2026-05-25

Deterministic Prefix Ordering: How Production Agents Get 90% Prompt Cache Hit Rates

hermesllmcachingoptimizationtokensinfrastructure

Prompt caching is the least-glamorous and most-effective cost lever in LLM infrastructure. It cuts input token costs by up to 90% and time-to-first-token latency by up to 80%, but it operates on a single constraint: exact-prefix matching. Change one byte in the cached portion, and the entire cache invalidates. Production agents work around this by enforcing deterministic prefix ordering — a structural convention, not an AI technique. Here is how the teams shipping at scale implement it.

How Prompt Caching Actually Works

Prompt caching is not semantic. It does not recognize that two prompts are "basically the same." It is exact-prefix matching at the byte level. The model stores key/value tensors computed during the attention pass over the prompt's first N tokens. If the next request begins with the identical prefix, the model skips the re-computation and reuses the cached tensors[^1].

The cache hierarchy for Anthropic's Claude, which defines the most widely adopted model, is:

[tools] → [system messages] → [messages up to cache breakpoint] → [new tail]

The tools array and system messages are part of the cached prefix. Alter tool definitions, change the system prompt wording, or even toggle extended thinking settings, and the entire cache invalidates[^2].

OpenAI's implementation works identically but auto-activates — no explicit cache_control markers needed. DeepSeek and xAI also auto-cache. The threshold is 1,024 tokens minimum prefix length across all major providers, with caching in 128-token increments beyond that[^3].

Provider	Cache Type	Min Prefix	Max Cost Reduction	TTFT Reduction
Anthropic	Explicit markers	1,024	90%	85%
OpenAI	Automatic	1,024	90%	80%
DeepSeek	Automatic	1,024	90%	80%
Google (Gemini)	Explicit markers	1,024	75%	not published

Cache reads are much cheaper than base input tokens, but cache writes cost 25% more than a normal input pass. The payoff comes from reusing an expensive prefix multiple times — roughly 4+ turns to break even on the write premium[^2].

The Pattern: Deterministic Prefix Ordering

Every provider's cookbook recommends the same structure. Place durable, unchanging content at the beginning. Place volatile, session-specific content at the end.

The canonical order:

1. Tool definitions          ← most stable
2. System prompt             ← stable per session
3. Reference documents       ← stable per session
4. Conversation history      ← grows but earlier messages are stable
5. User message              ← always new

If tools, system prompt, and reference docs are byte-identical across turns, the prefix caches perfectly. The user message — the only truly dynamic part — sits after the cache breakpoint.

This is not a product feature. It is a developer convention enforced by prompt structure. Every agent that wants cache hits must obey it.

Who Is Shipping This

Claude Code: Agent Architecture Designed Around Caching

Anthropic's engineering team has been the most public about designing agents for prompt caching from the architecture level:

"One of the biggest realizations I've had working on Claude Code is that you fundamentally have to design agents for prompt caching first. Almost every feature touches on it somehow."

This comes from a Claude Code engineer who published a detailed breakdown of their approach in April 2026[^4]. The key decisions:

Stable tool stubs always present in the same order. Tool definitions sit at the top of the prefix and never change during a session. Even when tool search results are dynamic, the stubs remain static.
Messages are appended, never modified. New user input, assistant responses, and tool results are added to the end of the messages array. Earlier messages are never rewritten. This keeps the cached prefix intact and only invalidates the new tail.
Repository guidance lives in stable files (CLAUDE.md), not ad-hoc chat messages. This removes the temptation to "restate the same instructions differently" and break the cache.
Cache-unfriendly patterns are avoided by design: switching models mid-session, changing tool definitions, toggling thinking settings, re-pasting the same context with tiny wording changes, and rewriting the brief instead of extending it are all flagged as anti-patterns[^2].

The result: Claude Code sessions maintain cache hits across multi-turn conversations without the user needing to think about it. The agent's internal prompt structure enforces the convention.

LiteLLM: Auto-Injected Cache Control for 10+ Providers

LiteLLM runs a proxy layer that automatically injects cache_control markers into outgoing requests. It supports Anthropic, AWS Bedrock, Vertex AI, Google AI Studio, Azure AI, OpenRouter, Databricks, DashScope/Qwen, MiniMax, and Z.ai/GLM[^5].

The configuration is declarative — specify which message roles to cache and LiteLLM adds the markers:

completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[...],
    cache_control_injection_points=[
        {"location": "message", "role": "system"},
        {"location": "message", "index": -2}  # second-to-last message
    ],
)

LiteLLM does not rewrite prompts. It adds breakpoint markers at user-configured positions. The developer still owns the prefix structure. The value is provider abstraction — one config works across all supported backends without per-provider code changes.

Reported savings from teams using LiteLLM with prompt caching range from 30% to 90% depending on workload[^6]. The high end comes from context-heavy use cases (legal documents, codebase analysis) where the cached prefix is 10,000+ tokens.

Portkey: Dual-Layer Caching with Semantic Fallback

Portkey runs two caching layers. Layer 1 is standard exact-prefix caching through the provider's native implementation. When that misses, Layer 2 checks a semantic cache — it looks for a previously cached response to a semantically similar prompt, not just a byte-identical prefix[^7].

This is the closest production implementation to "prompt rewriting for caching" — but it operates at the response level, not the KV-compute level. A semantic cache hit returns the stored response directly, bypassing the LLM entirely. It does not save compute on the new request; it skips it.

The trade-off: semantic caching works across different prompt formulations and model providers, with a 90-day TTL (vs. 5-60 minutes for exact-prefix caching). But it requires similarity threshold tuning and risks serving stale responses to prompts that look similar but require different answers.

The Deliberately-Long-Prompt Trick

One counterintuitive optimization that appears in multiple production setups: deliberately lengthening a prompt so it crosses the 1,024-token minimum threshold for caching. A 900-token prompt has a 0% cache rate regardless of structure. Padding it to 1,100 tokens with stable, reusable content turns a 0% rate into a potential 50%+ rate. The math[^3]:

900-token prompt: 0% cache rate, full input cost every turn
1,100-token prompt with 50% cache hit rate: 33% token cost savings
1,100-token prompt with 70% cache hit rate: 55% token cost savings

The cost of the extra 200 tokens per request is dwarfed by the cache savings once the hit rate exceeds roughly 20%.

What Breaks the Cache

The research is consistent across providers on what invalidates exact-prefix caching. The list is worth internalizing because every item is a common pattern in agent code:

Cache Killer	Why It Breaks	Fix
Dynamic content at the start	Even a timestamp at position 1 invalidates the entire prefix	Move timestamps to metadata, never the prompt body
Changing tool definitions mid-session	Tools sit at the top of the prefix; any change invalidates everything	Freeze tool stubs for the session duration
Rewriting the system prompt per turn	System messages are cached; per-turn injection breaks the prefix	Inject dynamic content into later messages, not the system block
Toggling thinking settings	Affects message block encoding and breaks byte-identical matching	Set thinking budget once at session start
Switching models	Different model = different KV cache lane; no carry-over	Treat model switches as cache boundaries
Re-pasting context with wording changes	The agent restates the same guidance differently each turn	Put durable guidance in stable files, reference them by path
Inconsistent whitespace or ordering	Trailing spaces, different line endings, or reordered keys break exact match	Canonicalize whitespace and JSON key ordering at the proxy layer

The Tension in Agent Harnesses

Agent harnesses like Hermes have a built-in conflict with prompt caching. The system prompt structure is:

[persona] [memory] [tool definitions] [skill listing] [user message]

JIT skill retrieval saves 95% on system prompt tokens by removing 123 skill descriptions and injecting only the relevant one. But this injection changes the system prompt per turn — exactly the pattern that breaks prefix caching.

The trade-off:

Approach	System Prompt Tokens	Cache-Friendly	Net Effect
Full skill listing (123 skills)	4,200	Yes	Stable prefix, but 4,200 tokens every turn
JIT skills (1 skill injected)	726	No	5.8x fewer tokens but breaks the cache per turn
JIT + memory in messages	226	Yes	Stable system prefix + dynamic skill in the tail

The third row is the practical resolution: move all dynamic content (skill injection, memory, user message) into the messages array after the system block. The system prompt becomes a fully stable 226-token skeleton that caches perfectly. The per-turn dynamic content sits in the uncached tail where it belongs.

For agent harness builders, the rule is simple: the system prompt is a cache contract. Anything that changes per turn must live in the messages array, not the system block. If a skill injection modifies the system prompt, it breaks the contract.

What Nobody Is Doing (Yet)

There is no production prompt-rewriting proxy that uses a cheap LLM to normalize user input before forwarding to an expensive model. The idea surfaces in discussions but fails on latency and cost arithmetic:

The rewrite model consumes tokens and adds latency
The canonical form must be reused many times to amortize the rewrite cost
In single-turn chat, every user message is unique — no reuse possible
In multi-turn conversations, the conversation history is the dynamic tail and cannot be meaningfully "normalized" without changing its content

The one place this math works: batch processing. If 10,000 prompts share identical instructions but differ only in a single data field, normalizing them into an identical prefix structure before batching makes sense. Several RAG pipelines do this implicitly by separating document context from query text in a fixed template.

Getting Started

To maximize cache hit rates in an existing agent:

Audit your prefix. Log the first 2,048 tokens of every request. If the prefix changes across turns, find what is changing and why.
Freeze the system prompt. No per-turn injection. No dynamic dates. No user-specific preamble. The system block is immutable for the session duration.
Order messages append-only. Never rewrite earlier messages. New tool results, assistant responses, and user input are appended to the end.
Stabilize tool definitions. Load them once at session start and never mutate them. If you need different tools for different turns, you need different sessions.
Monitor cache hit rate. Every major provider's API response includes cached_tokens in usage details. Track it. A sub-50% hit rate for a mature agent means the prefix is unstable.
Pad if necessary. If your stable prefix is under 1,024 tokens, add a static reference document or style guide to push it over the threshold. The extra tokens are cheaper than a 0% cache rate.

This is not an AI problem. It is a structural engineering problem solved by convention and enforced by discipline. The teams getting 90% cache hit rates are not using smarter models — they are using dumber, more deterministic prompts.

[^1]: OpenAI. "Prompt Caching 201." OpenAI Cookbook. 2025.

[^2]: Mager, Dan. "Claude: How Prompt Caching Actually Works." Mager.co. April 29, 2026.

[^3]: OpenAI. "Prompt Caching." OpenAI API Docs. 2025.

[^4]: @trq212. "One of the biggest realizations I've had working on Claude Code is that you fundamentally have to design agents for prompt caching first." X. 2026.

[^5]: LiteLLM. "Auto-Inject Prompt Caching Checkpoints." LiteLLM Docs. 2025.

[^6]: Prasanna, Vijit. "Prompt Caching, LiteLLM, and the 8600-Token Bug." LinkedIn. 2025.

[^7]: Portkey. "OpenAI's Prompt Caching: A Deep Dive." Portkey Blog. 2025.