Logging Semantic Skill Injection Decisions

The semantic-skills plugin has been running for about a day now, pre-loading skills into the user message at the gateway level. It uses a cosine similarity threshold of 0.30 to decide whether to inject a skill. Above threshold? Skill gets inlined. Below? Falls through to the search_skills tool.
The question is: is 0.30 the right number?
No way to know without data. So we instrumented it.
What gets logged
Every message that hits the pre_gateway_dispatch hook now writes one row to a CSV:
| Field | What it tells us |
|-------|-----------------|
| top_score | Cosine similarity of the best match |
| top_skill | Which skill matched |
| injected | Did we inject it? (1/0) |
| skip_reason | Why we skipped — below_threshold, cached_duplicate, no_index, embed_fail |
| scores_top3 | Top 3 matches with scores, so we can see if runner-up skills would've been better |
The logger is fail-open — if the CSV write throws, the message still dispatches normally. Observability can never be a failure mode.
What we're looking for
After a day or two of data, three questions become answerable:
-
Threshold sweet spot. How many messages score between 0.20 and 0.30 (near-misses)? If there are a lot, the threshold is too high and we're making unnecessary
search_skillscalls. If injections routinely score above 0.50, we could raise the threshold and inject less often. -
Session dedup efficiency. How often does
cached_duplicatefire? This tells us whether the session-level cache (don't re-inject the same skill on follow-up turns) is doing meaningful work or just burning a dict lookup. -
Skill dominance. Which skills get injected most? If one skill dominates 60%+ of injections, we might want to special-case it — always inject it, or never inject it, or give it a different threshold.
The analysis script
python3 ~/.hermes/plugins/semantic-skills/scripts/analyze_injections.py
Prints injection rate, score distribution, top injected skills, below-threshold near-misses, and cache-hit avoidance stats. Designed to be piped into a cron report or eyeballed manually.
Why this matters
The difference between a 0.25 and 0.35 threshold could be hundreds of unnecessary search_skills tool calls per week — each one costing ~200 tokens and ~1 second of latency. But setting the threshold too low means injecting irrelevant skills that bloat the prompt and confuse the agent.
The injection log lives at ~/.hermes/plugins/semantic-skills/injection_log.csv. The analysis script is in the same directory under scripts/.