2026-05-26

GBrain + Mnemosyne: Creating the Voltron of Memory Plugins With Hot Cache and Cold Storage

hermesgbrainmnemosynememoryagents

Mnemosyne is a community-built memory plugin for Hermes Agent by AxDSan -- SQLite-backed, zero-dependency, sub-millisecond recall. It injects context into the system prompt every turn. GBrain is @garrytan's knowledge management engine -- slug-based markdown pages with hybrid vector + keyword search, knowledge graph edges, and 42 built-in skills. They occupy the same conceptual space but operate at different temperatures: Mnemosyne is a hot cache, GBrain is cold storage.

The problem is that the agent does not know what GBrain contains without querying it. Every relevant fact in GBrain requires an explicit MCP tool call to retrieve. If the agent does not know to ask, it never sees the data.

The fix connects them at two levels.

Architecture

Level	What it does	Latency	Token cost
Level 1 -- Digest	Ambient table of contents: people, projects, concepts	0ms	~113 chars
Level 2 -- Injection	Relevance-gated page summaries on session start	~6s (first turn only)	~300-500 chars
Before (baseline)	Agent must guess whether to query GBrain	2+ turns to retrieve	Full MCP call chain

Level 1: Digest Sync

A 30-line Python script runs every 6 hours via Hermes cron. It calls gbrain stats and gbrain list, filters out auto-generated fact pages, and produces a compact entry:

GBrain digest (2026-05-26 15:07 UTC): 30 pages, 28 embedded, 15 tags. People: Ryan Underdown. Projects: Hermes Agent Setup. Key concepts: Embedding Providers, GBrain Architecture, Hermes Agent Skills, Mnemosyne Memory System, SkillOpt.

This is stored in Mnemosyne memory and injected into every turn. The agent knows what GBrain contains without asking. When the user mentions a topic covered by GBrain, the agent can reference it immediately rather than making a round-trip MCP call.

The digest averages 113 characters against Mnemosyne's 2,200-char budget. The cost is negligible.

Level 2: Relevance-Gated Injection

A pre_llm_call hook fires on the first turn of each session. It embeds the user's query, runs GBrain's hybrid search, and injects the top 3 matching page summaries into the user's message.

Test queries and their matches:

Query	GBrain page matched	Relevance
"hermes skills architecture"	Hermes Agent Skills	1.12
"what is skillopt"	SkillOpt -- Self-Evolving Agent Skills	matched
"embedding providers"	Embedding Providers	matched

GBrain Instance

The test instance runs GBrain v0.41.2.0 with PGLite (embedded WASM Postgres). Embeddings use Nomic embed v2 MoE (768 dimensions) through LM Studio on a remote desktop via Tailscale. The LITELLM_BASE_URL env var bypasses the local LiteLLM proxy, pointing GBrain's embedding provider directly at LM Studio's OpenAI-compatible API.

Seven pages were seeded for the test: five concepts (Hermes skills, GBrain architecture, Mnemosyne memory, SkillOpt paper, embedding providers), one person (Ryan Underdown), and one project (Hermes Agent Setup). GBrain's hybrid search correctly matched all three test queries to their relevant pages.

What This Does Not Do

This is not a full RAG pipeline. It does not chunk pages, embed them in a separate vector index, or add per-turn retrieval latency. Level 1 is a static digest refreshed on a schedule. Level 2 fires once per session and adds latency to the first response only.

The trade-off is intentional. A full RAG pipeline would duplicate GBrain's existing hybrid search infrastructure and add token bloat to every turn. The two-level approach gives the agent the "shape" of GBrain (Level 1) and on-demand page summaries when relevant (Level 2), without reinventing GBrain inside Mnemosyne.

Cache Safety

Prompt caching is the single largest cost lever for LLM inference. Any approach that modifies the system prompt or user message on every turn risks invalidating cached prefixes and multiplying token costs. The two-level design avoids this completely.

Level 1 (digest): The digest lives in Mnemosyne, which is injected into the system prompt at session start. The digest updates at most every 6 hours when the cron fires. Provider prompt caches -- Anthropic's 5-minute TTL, DeepSeek's automatic caching, OpenAI's automatic caching -- cycle 36 to 72 times between updates. The cache never stays valid long enough to collide with a digest refresh.

Within a session, the system prompt is assembled once from files on disk. If the cron fires mid-session, the running agent keeps the old digest. No mid-conversation invalidation.

Level 2 (injection): The pre_llm_call hook only fires on is_first_turn=True. A first turn is by definition a new session with a new user message -- nothing to cache, no prefix to preserve. On every subsequent turn, the hook returns None, producing a byte-identical user message to what would have been sent without the plugin. Zero cache interference after turn 1.

The only theoretical cache hit: starting a new session within seconds of a digest cron firing, where the system prompt is 113 characters different from the previous session. Provider TTLs of 5-10 minutes make this statistically irrelevant.

The approach is cache-safe by design. Neither level introduces per-turn variability into the system prompt or user message after the first turn.

Code

The full implementation is at github.com/underdown/gbrosyne (MIT).

Level 1: scripts/gbrain-digest.py -- run via Hermes cron, writes digest to Mnemosyne memory
Level 2: plugin/ -- Hermes plugin (pre_llm_call hook, first turn only), copy to ~/.hermes/plugins/gbrain-inject/ and enable

Install instructions in the repo README.

[^1]: Garry Tan. "GBrain: Garry's Opinionated OpenClaw/Hermes Agent Brain." GitHub. 2026. [^2]: AxDSan. "Mnemosyne: The Zero-Dependency, Sub-Millisecond AI Memory System for Hermes Agents." GitHub. 2026. [^3]: Nous Research. "Hermes Agent Self-Evolution." GitHub. 2026.