2026-05-24

JIT Skills Slash Hermes System Prompt 95% — Multiplied Across 39 Calls Per Session

hermestokensoptimizationembeddingsplugins

Hermes Agent ships with 123 installed skills. Before JIT, every single one was dumped into the system prompt — 4,200 tokens of skill names and descriptions, every turn, whether the task needed pixel-art guidance or not. The fix was semantic embedding search that finds the right skill before the agent's turn and injects just that one. System prompt: 226 tokens. But the bigger win is what happens per-turn when you stop making the agent call a tool to find skills.

What JIT Skills Does

Instead of a flat skill listing, the system prompt gets a 226-token guidance block telling the agent it can call search_skills("query") to find relevant skills by semantic similarity. When the agent's turn begins, the user's query is embedded against a pre-built index of 123 skill descriptions. The top match is injected directly into context — the agent sees the full SKILL.md content alongside the user's message with no tool call required.

Before (full mode):
  System prompt: "Available skills: [123 entries × ~34 tokens each]..."
  → 4,200 tokens

After (JIT mode):
  System prompt: "Use search_skills(query) to find skills on-demand..."
  → 226 tokens
  + Pre-loaded skill injected before each turn
  → 1,500–3,000 tokens saved per avoided search_skills call

The pre-loading is the mechanism that compounds. Without it, the agent would call search_skills as a tool, consuming a round-trip of API overhead plus the returned skill content. Pre-loading skips that tool call entirely — the skill arrives before the agent thinks to ask.

How We Built It

The architecture has four layers:

Plugin (~/.hermes/plugins/jit-skills/): Registers the search_skills tool and handles the semantic embedding pipeline. The build_embeddings.py script scans all 123 SKILL.md files, concatenates each skill's name, description, category, and first 500 chars of body, then embeds the resulting text via one of four backends (auto-detected best-available wins).

Embedding backends (4-tier fallback):

| Tier | Backend | Model | Latency | |------|---------|-------|---------| | 1 | LiteLLM | nomic-embed-text-v2-moe (local RTX 3070) | ~150ms | | 2 | OpenAI | text-embedding-3-small | ~200ms | | 3 | sentence-transformers | all-MiniLM-L6-v2 (CPU) | ~50ms | | 4 | Keyword/TF-IDF | Pure Python stdlib | <1ms |

If LiteLLM is down, it falls through to OpenAI. If no API key is set, sentence-transformers runs locally. If even that can't import, TF-IDF over a 5,000-word vocabulary always works. The index is saved as a numpy .npz file (123 × 768-dim float32) plus a JSON metadata sidecar.

Source patches: Two patches to Hermes core — one adding skills.mode to the config schema, another gating the system prompt builder to return the 226-token JIT block when mode == "jit". A nightly cron job (d08175aa988b) runs a re-apply script at 3AM UTC that detects and restores patches after hermes update overwrites them.

Activation:

hermes config set skills.mode jit   # 226-token prompt
hermes config set skills.mode full  # 4,200-token prompt (original)
hermes gateway restart

The Calls-Per-Session Multiplier

We log every API call Hermes makes — provider, token counts, latency, cost. Querying the token logger's SQLite database across all sessions gives us a baseline for what "per-turn" actually means in practice.

| Metric | Value | |--------|-------| | Average calls per session | 39.2 | | Maximum calls (single session) | 184 | | Total sessions logged | 39 | | Total API calls | 1,529 |

The pre-load mechanism saves one search_skills tool call per turn. At 39 calls per session and a conservative 1,500 tokens saved per avoided call, that's 58,500 tokens per session on top of the 4,000-token system prompt reduction — 62,500 tokens saved per session total.

At DeepSeek's current promo pricing ($0.14/M input tokens), that's $0.00875 per session. At standard pricing, ~$0.035. Over months of daily use, the absolute dollar savings are modest — tens of dollars. The real savings are in context quality.

The maximum session we recorded hit 184 API calls. At that extreme, JIT pre-loading saves 276,000 tokens just from avoided tool calls. For a model with a 128K context window, that's the difference between the agent fitting its conversation history or getting truncated mid-task.

The top five sessions by call count:

session_id           calls   input_tokens   output_tokens
20260521_184144       184      458,109         57,253
20260520_020659       104      309,556         31,222
20260521_180330       100      274,561         39,514
20260521_012858        99      225,596         44,702
20260521_151705        97      251,443         42,741

These are multi-turn sessions with heavy tool use — code generation, debugging loops, multi-file refactors. Every turn that would have called search_skills instead had the relevant skill pre-loaded, saving 1,500–3,000 tokens each time.

Search Quality

The semantic embedding approach consistently finds the right skill:

| Query | Top Result | Cosine Score | |-------|-----------|-------------| | "debug litellm proxy" | litellm-proxy-debug | 0.68 | | "github PR workflow code review" | github-pr-workflow | 0.69 | | "pixel art NES palette" | pixel-art | 0.69 | | "AI songwriting music" | songwriting-and-ai-music | 0.67 | | "Next.js frontend design UI" | frontend-design | 0.57 | | "post tweets search twitter" | xdk-twitter | 0.53 |

Even the keyword/TF-IDF fallback finds correct matches at lower scores (0.2–0.6). The frontend-design query scoring 0.57 is the weakest link — it's a broad query against a diverse skill set, and cosine similarity over static embeddings struggles with abstract concepts. We're experimenting with re-ranking via a second-pass LLM call for scores below 0.60, but the 226-token budget already covers the single-best-match case well enough that the agent rarely needs to call search_skills itself.

What We Learned

Pre-loading is the real win, not the system prompt reduction. The 4,000-token system prompt savings get the attention because it's a clean before/after number. But the per-turn savings from eliminating tool calls compound across 39 calls per session and dwarf the base reduction for any session longer than 2-3 turns. If you're building a similar system, design for pre-loading from day one.

Source patches are fragile. hermes update overwrites config.py and prompt_builder.py. The re-apply script handles this, but the better solution is an upstream hook — Hermes core should expose a system_prompt_skills hook that plugins can override without touching source. The production plan outlines the PR for this.

The embedding index must be rebuilt after installing new skills. It's not automatic. The nightly cron handles this by running the browse.sh auto-installer which rebuilds the index after fetching new skills, but ad-hoc skill installs during the day will miss the index until the next 3AM rebuild.

Four fallback tiers is overkill in practice. The LiteLLM/nomic-embed backend has never failed. The TF-IDF fallback exists purely as a safety net and costs <1ms. The OpenAI and sentence-transformers tiers are dead code in our setup but cost nothing to keep — they make the plugin portable to any Hermes install without our local infrastructure.

Getting Started

# Clone the plugin
git clone https://github.com/underdown/hermes-jit-skills.git ~/.hermes/plugins/jit-skills

# Build the embedding index (auto-detects best backend)
cd ~/.hermes/plugins/jit-skills
python3 build_embeddings.py --force

# Apply the source patches (two files)
bash /root/reapply-jit-patches.sh

# Enable and restart
hermes config set skills.mode jit
hermes gateway restart

# Verify
hermes tools list | grep search_skills

The plugin works immediately. Test it by asking Hermes about something domain-specific — you'll see the relevant skill pre-loaded in the agent's context (visible with /debug prompt if you have that enabled).

The repo: https://github.com/underdown/hermes-jit-skills