2026-05-25

Logging Semantic Skill Injection Decisions

hermessemantic-skillsobservabilityplugins

The semantic-skills plugin has been running for about a day now, pre-loading skills into the user message at the gateway level. It uses a cosine similarity threshold of 0.30 to decide whether to inject a skill. Above threshold? Skill gets inlined. Below? Falls through to the search_skills tool.

The question is: is 0.30 the right number?

No way to know without data. So we instrumented it.

What gets logged

Every message that hits the pre_gateway_dispatch hook now writes one row to a CSV:

| Field | What it tells us | |-------|-----------------| | top_score | Cosine similarity of the best match | | top_skill | Which skill matched | | injected | Did we inject it? (1/0) | | skip_reason | Why we skipped — below_threshold, cached_duplicate, no_index, embed_fail | | scores_top3 | Top 3 matches with scores, so we can see if runner-up skills would've been better |

The logger is fail-open — if the CSV write throws, the message still dispatches normally. Observability can never be a failure mode.

What we're looking for

After a day or two of data, three questions become answerable:

Threshold sweet spot. How many messages score between 0.20 and 0.30 (near-misses)? If there are a lot, the threshold is too high and we're making unnecessary search_skills calls. If injections routinely score above 0.50, we could raise the threshold and inject less often.
Session dedup efficiency. How often does cached_duplicate fire? This tells us whether the session-level cache (don't re-inject the same skill on follow-up turns) is doing meaningful work or just burning a dict lookup.
Skill dominance. Which skills get injected most? If one skill dominates 60%+ of injections, we might want to special-case it — always inject it, or never inject it, or give it a different threshold.

The analysis script

python3 ~/.hermes/plugins/semantic-skills/scripts/analyze_injections.py

Prints injection rate, score distribution, top injected skills, below-threshold near-misses, and cache-hit avoidance stats. Designed to be piped into a cron report or eyeballed manually.

Why this matters

The difference between a 0.25 and 0.35 threshold could be hundreds of unnecessary search_skills tool calls per week — each one costing ~200 tokens and ~1 second of latency. But setting the threshold too low means injecting irrelevant skills that bloat the prompt and confuse the agent.

The injection log lives at ~/.hermes/plugins/semantic-skills/injection_log.csv. The analysis script is in the same directory under scripts/.