2026-05-25

Semantic Skills — Skill Pre-Loading That Keeps Prompt Caching Intact

hermesskillscachingtokensoptimizationplugin

The jit-skills plugin cut Hermes Agent's system prompt from 4,200 tokens to 226 by replacing the full skill listing with a search_skills tool. That saved 95% on the system prompt. But it had a catch: the agent had to call the tool on the first turn of every task, burning 1,500 to 3,000 tokens. And if the top-match skill were pre-loaded into the system prompt, it would break exact-prefix prompt caching.

semantic-skills v2 fixes both problems. It pre-loads the best-matching skill into the user message via a pre_gateway_dispatch hook — no tool call needed, and the system prompt stays byte-identical across turns.

The Architecture Change

In v1 (jit-skills), skill delivery was reactive:

User message → agent calls search_skills → tool response → agent reads skill

This added one tool-call turn to every new task. On GPT-5.5 or Claude, that turn costs 1,500 to 3,000 tokens each time.

In v2 (semantic-skills), skill delivery is pre-loaded at the gateway level:

User message arrives at gateway
  → pre_gateway_dispatch hook fires (before auth, before agent dispatch)
  → embed query against 407 skill vectors
  → top match above 0.30 threshold → rewrite message with skill prepended
  → LLM sees: [stable 226-token system prompt] [user msg: skill + query]

The pre_gateway_dispatch hook fires at the gateway layer — before authentication, before platform pairing, before the agent even wakes up. It embeds the user's query, finds the top match, and rewrites the incoming message text to include the skill content at the top. The system prompt is never touched.

This is a better injection point than pre_llm_call for two reasons. First, it fires earlier in the pipeline, so the skill content is baked into the raw message before any downstream processing. Second, it returns {"action": "rewrite", "text": "..."} to replace the message text directly, rather than injecting ephemeral context that the agent loop has to manage.

Feature	jit-skills v1	semantic-skills v2
Skill delivery	Agent calls search_skills tool	Pre-loaded via pre_gateway_dispatch hook
First-turn tool call	Yes (1,500-3,000 tokens)	Eliminated
System prompt	226 tokens, cache-friendly	226 tokens, cache-friendly
Injection point	Tool result (messages array)	pre_gateway_dispatch (gateway layer, before auth)
Session cache	None	Skips re-injection on follow-up turns
Tokens saved (vs full)	~4,000	~5,500-7,000

Why This Matters for Prompt Caching

Prompt caching on every major provider (Anthropic, OpenAI, DeepSeek) uses exact-prefix matching. If the system prompt changes by even one byte between turns, the entire cache invalidates.

The v1 plugin had a tension: if you pre-loaded a skill into the system prompt, it would break the cache. The only safe choice was to make the agent call the search_skills tool and accept the tool-call overhead.

v2 resolves this by injecting into the user message, not the system prompt. The system prompt stays at 226 tokens — identical across every turn. The per-turn skill content goes into the user message, which is always part of the uncached tail. The cache contract is preserved.

This is the same pattern Claude Code uses internally: stable tool stubs and system instructions at the top, dynamic content appended to the messages array.

The pre_gateway_dispatch Hook

The hook fires at the gateway level, before auth, pairing, or agent dispatch. It embeds the incoming message, finds the top match, and rewrites the message text:

def on_pre_gateway_dispatch(event, gateway, session_store, **kwargs):
    text = event.text
    if len(text.split()) < 3:  # skip short queries
        return None

    # Embed query against pre-built skill index
    store = lazy_import_embedding_store()
    query_embedding = store.embed_text(text)
    best_score, best_meta = rank_by_cosine_similarity(query_embedding, embeddings)

    if best_score < 0.30:  # threshold
        return None

    # Read full SKILL.md content
    skill_content = store.read_skill_content(best_meta["path"])

    # Avoid re-injecting the same skill on follow-up turns
    session_key = event.source.chat_id or event.source.user_id
    if _SESSION_CACHE.get(session_key) == best_meta["name"]:
        return None

    _SESSION_CACHE[session_key] = best_meta["name"]

    rewritten = (
        f"Relevant skill (pre-loaded, score {best_score:.2f}): {best_meta['name']}\n"
        f"{'─' * 60}\n"
        f"{skill_content}\n"
        f"{'─' * 60}\n\n"
        f"if this skill is insufficient, call search_skills for additional matches.\n\n"
        f"{text}"
    )

    return {"action": "rewrite", "text": rewritten}

The hook has three guardrails:

Short queries (< 3 words) are skipped — single-word responses don't need skills
Score threshold of 0.30 — below this, the query is too ambiguous for reliable pre-loading
Session cache — the same skill is not re-injected on follow-up turns in the same session

The return value {"action": "rewrite", "text": "..."} tells the gateway to replace the raw message text before it reaches the agent loop. If the hook returns None, the message passes through unchanged and the agent can still use the search_skills tool as a fallback.

Token Savings

The total savings vs full mode (all 407+ skills listed in the system prompt):

Source	Full Mode	JIT v1	Semantic v2
System prompt	4,200	226	226
First-turn tool call	0	1,500-3,000	0
Total per session	4,200	1,726-3,226	226

Per session, v2 saves an additional 1,500 to 3,000 tokens over v1 by eliminating the tool call. Across a typical Hermes session (39 turns[^1]), that compounds.

Installation

# Install the plugin
hermes plugins install underdown/semantic-skills --enable

# Build the embedding index
cd ~/.hermes/plugins/semantic-skills
python3 build_embeddings.py --backend litellm --force

# Enable semantic mode (also accepts "jit" for backward compatibility)
hermes config set skills.mode semantic

# Restart
hermes gateway restart

Migrating from jit-skills

If you are using the v1 jit-skills plugin:

hermes plugins uninstall jit-skills
hermes plugins install underdown/semantic-skills --enable
cd ~/.hermes/plugins/semantic-skills
python3 build_embeddings.py --backend litellm --force
hermes config set skills.mode semantic
hermes gateway restart

The search_skills tool remains available as a fallback. The embedding index format is compatible — no rebuild strictly required if the index already exists, but rebuilding with --force ensures the index covers any newly installed skills.

What It Does Not Do

The pre-load hook injects exactly one skill — the top match above threshold. It does not rank multiple candidates or inject ranked lists. If the pre-loaded skill is insufficient, the agent calls search_skills to find alternatives. This is by design: injecting multiple skills would bloat the user message and undermine the token savings.

The hook does not modify the system prompt. This is a hard constraint — pre_gateway_dispatch rewrites the incoming message text at the gateway layer. The system prompt is assembled later and cannot be touched by this hook.

Availability

GitHub: underdown/semantic-skills

The jit-skills plugin has been deprecated and its repository updated with migration instructions. Both skills.mode: "jit" and skills.mode: "semantic" are accepted by the source patch for backward compatibility.

[^1]: Per-turn average derived from Hermes Agent session DB analysis across 50+ sessions. A "turn" includes user message, assistant response, and all tool-call iterations in between.