Cloning Hermes Agent: Every Component, Config, and Cron Job for a Production AI Assistant
Running an AI agent in production means more than a model endpoint and a chat interface. Over months of iteration, our Hermes setup accumulated caching layers, skill preloading, multi-model routing, automated maintenance, and a dozen platform integrations -- most of which aren't documented in the repo.
This guide maps every component so you can clone the setup onto a fresh VPS. Every config is real, every cron job is listed, and every design decision has a "why."
Architecture
The stack spans three machines:
User (Discord/CLI)
│
▼
┌──────────────────────────────────────────────┐
│ Hermes Agent (v0.14.0) │
│ /usr/local/lib/hermes-agent/ │
│ Config: ~/.hermes/config.yaml │
│ Env: ~/.hermes/.env │
│ │
│ ┌──────────────┐ ┌─────────────────────────┐│
│ │ SOUL.md │ │ semantic-skills plugin ││
│ │ (persona) │ │ (TF-IDF skill preload) ││
│ └──────────────┘ └─────────────────────────┘│
│ ┌──────────────┐ ┌─────────────────────────┐│
│ │ gbrosyne │ │ token-logger plugin ││
│ │ (GBrain inj.) │ │ (CSV+SQLite logging) ││
│ └──────────────┘ └─────────────────────────┘│
│ ┌──────────────┐ ┌─────────────────────────┐│
│ │ pricing-tools │ │ xdk-twitter plugin ││
│ │ (model costs) │ │ (multi-account X) ││
│ └──────────────┘ └─────────────────────────┘│
│ ┌──────────────────────────────────────────┐ │
│ │ ~/.hermes/skills/ (50+ skills, 21MB) │ │
│ └──────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────┐ │
│ │ Memory: SQLite, 2,200 char limit │ │
│ │ Sessions: FTS5 search, 90-day retention │ │
│ └──────────────────────────────────────────┘ │
└──────────────┬───────────────────────────────┘
│
┌───────┴────────┐
▼ ▼
┌─────────────┐ ┌─────────────────────┐
│ LiteLLM │ │ LM Studio (Windows) │
│ localhost: │ │ Tailscale IP: │
│ 4000 │ │ 11435 │
│ │ │ │
│ deepseek-v4 │ │ qwen-3.5 │
│ deepseek-v4 │ │ nomic-embed │
│ -flash │ │ │
└──────┬───────┘ └─────────────────────┘
│
▼
┌──────────────┐
│ DeepSeek API │
│ api.deepseek │
│ .com │
└──────────────┘
Three design choices visible in this diagram:
- LiteLLM proxy sits between Hermes and model providers. It normalizes every backend into an OpenAI-compatible endpoint. Without it, Hermes would need per-provider code paths.
- LM Studio runs on a separate Windows machine connected via Tailscale. This gives local inference (qwen-3.5, nomic-embed) without GPU costs on the VPS.
- Plugins are not accessories -- they're how Hermes loads skills, logs tokens, manages memory, and connects to platforms. Six plugins run in every session.
1. VPS Setup
Current machine: AlmaLinux 9, Python 3.11, root user. The OS choice is constrained -- DeepSeek's API runs fastest from North American Linux nodes, and AlmaLinux is RHEL-rebuild stable.
dnf update -y
dnf install -y git curl wget python3.11 python3.11-devel python3.11-pip gcc
dnf groupinstall -y "Development Tools"
# Install uv (faster pip)
curl -LsSf https://astral.sh/uv/install.sh | sh
2. Install Hermes Agent
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent
python3.11 -m venv venv
source venv/bin/activate
pip install -e .
# Run setup wizard (guided config entry)
hermes setup
# Link to PATH
ln -sf /usr/local/lib/hermes-agent/venv/bin/hermes /usr/local/bin/hermes
The repo auto-updates via a daily cron at 9:00 UTC: hermes update. You can replicate this or skip it -- the setup wizard prompts for it.
3. API Keys
All keys live in ~/.hermes/.env. hermes setup guides you through entry, but here's the full list:
| Service | Signup URL | Used For | Required |
|---|---|---|---|
| DeepSeek | platform.deepseek.com | Primary model (deepseek-v4-pro) | Yes |
| Discord | discord.com/developers | Gateway (bot token) | Yes |
| Exa | exa.ai | Web search (MCP) | Yes |
| X/Twitter | developer.x.com | Posting, searching | Yes |
| Firecrawl | firecrawl.dev | Web scraping | Optional |
| OpenAI | platform.openai.com | Auxiliary tasks | Optional |
| Gemini | aistudio.google.com | Fallback models | Optional |
| NVIDIA | build.nvidia.com | Fallback models | Optional |
| xAI/Grok | x.ai | Fallback models | Optional |
| Langfuse | cloud.langfuse.com | Observability (traces) | Optional |
Minimum for basic operation: DeepSeek + Discord. Search (Exa) and social (X) unlock full utility but aren't required for a chat assistant.
4. LiteLLM Proxy
LiteLLM is the routing layer. Hermes speaks OpenAI-compatible to http://127.0.0.1:4000/v1, and LiteLLM translates to DeepSeek's native format. Switching models or adding providers requires zero Hermes config changes -- just update the LiteLLM config.
Config
Create /root/.litellm/proxy_config.yaml:
model_list:
- model_name: deepseek-v4-pro
litellm_params:
model: deepseek/deepseek-v4-pro
api_base: https://api.deepseek.com
api_key: sk-YOUR_DEEPSEEK_KEY
- model_name: deepseek-v4-flash
litellm_params:
model: deepseek/deepseek-v4-flash
api_base: https://api.deepseek.com
api_key: sk-YOUR_DEEPSEEK_KEY
# Local LM Studio models (Windows box via Tailscale)
- model_name: qwen-3.5
litellm_params:
model: openai/qwen3.5
api_base: http://YOUR_LMSTUDIO_IP:11435/v1
api_key: none
- model_name: text-embedding-nomic-embed-text-v2-moe
litellm_params:
model: openai/text-embedding-nomic-embed-text-v2-moe
api_base: http://YOUR_LMSTUDIO_IP:11435/v1
api_key: none
general_settings:
master_key: sk-YOUR_DEEPSEEK_KEY
Pitfall: The master_key is used for Hermes-to-LiteLLM auth. If the key in the config doesn't match what Hermes sends, LiteLLM returns a misleading "No connected db." error -- it's the no-DB fallback code path for failed authentication, not an actual database issue. This wasted an hour of debugging the first time it happened. Also, never copy a masked display value (like sk-xxx...xxx) into the config -- Hermes's output masking can truncate the real key during copy-paste. Always read the raw file bytes to verify.
Run as a systemd service
cat > /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/lib/hermes-agent/venv/bin/litellm --config /root/.litellm/proxy_config.yaml --port 4000
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
systemctl enable --now litellm
Orphan-process trap: systemctl restart litellm does not kill processes spawned by systemd --user or other supervisors. If port 4000 is held by a stale process, every systemctl restart silently falls back to a random port. Always verify with ss -tlnp | grep 4000 after restarting. Kill all litellm processes first with pkill -9 -f "litellm.*proxy_config" if you see multiple PIDs.
Verify
curl -H "Authorization: Bearer sk-YOUR_DEEPSEEK_KEY" http://127.0.0.1:4000/v1/models
5. Hermes Config
Key sections from ~/.hermes/config.yaml:
model:
default: deepseek-v4-pro
provider: litellm
base_url: http://127.0.0.1:4000/v1
api_key: ''
agent:
max_turns: 90
gateway_timeout: 1800
compression:
enabled: true
threshold: 0.9 # Compress at 90% context window
target_ratio: 0.7 # Compress to 70%
memory:
memory_enabled: true
memory_char_limit: 2200
skills:
mode: semantic # Uses semantic-skills plugin
plugins:
enabled:
- gbrosyne
- model-providers/deepseek
- pricing-tools
- semantic-skills
- token-logger
- xdk-twitter
disabled:
- lossless-hermes
model.provider: litellm routes all calls through the proxy. The api_key is empty because LiteLLM handles auth with its master_key.
compression.threshold: 0.9 triggers conversation summarization at 90% of the context window. An auxiliary LLM condenses older messages to 70% density. The first 3 and last 20 messages are always preserved verbatim.
6. Prompt Assembly: Three Tiers
All context injected into the agent follows a hierarchy designed to keep system prompt text cacheable:
- Stable tier -- SOUL.md, tool guidance, skills prompt template, platform hints. Built once per configuration.
- Context tier -- AGENTS.md from working directory, caller-supplied system messages. Rebuilt on directory or caller changes.
- Volatile tier -- memory snapshot, user profile, timestamp. Changes every turn but appended after the stable prefix.
The stable tier is what makes DeepSeek's prefix caching effective. Because the system prompt text is identical across turns, the provider reuses cached key-value pairs for the first several thousand tokens of every request -- saving latency and cost.
~/.hermes/SOUL.md drives the agent's identity. It loads fresh each message -- edits take effect immediately, no restart needed. When empty, Hermes falls back to a hardcoded default identity.
7. Skills System
Skills are markdown files with YAML frontmatter in ~/.hermes/skills/. Each contains step-by-step instructions for a specific task type. Currently 50+ skills across 21MB.
How skills load
The semantic-skills plugin hooks pre_gateway_dispatch and performs TF-IDF embedding against a pre-built skill index. When a user message matches a skill with score >= 0.65, the skill is injected as user message text -- not as system prompt modification.
This is the critical design decision: skills arrive as part of the message the agent reads, not as changes to the system prompt. The system prompt stays identical -- and therefore cacheable -- regardless of which skills are loaded. Without this, every skill injection would break the prefix cache.
Adding skills
# Create a new skill
hermes skills create my-skill
# Or manually
mkdir -p ~/.hermes/skills/my-category/my-skill/
# ... write SKILL.md with YAML frontmatter ...
# Rebuild index (required -- old index won't find new skills)
cd ~/.hermes/plugins/semantic-skills
python build_embeddings.py
Forgetting to rebuild the index is the #1 cause of "my new skill doesn't work." The file exists but the TF-IDF index doesn't know about it.
8. Plugins
Six plugins run in every session. Each hooks into a specific lifecycle point:
semantic-skills (v2.0.0)
- Hook:
pre_gateway_dispatch - Tool:
search_skills - TF-IDF skill matching. System prompt stays ~226 tokens regardless of skill count.
token-logger (v2.0.0)
- Hook:
post_api_request - Tools:
token_summary,enrich_logs - Dual-write to plain-text CSV (crash-safe append) and SQLite (queryable). Logs DeepSeek cache hit/miss, tokens, costs, and latency per API call. Nightly gzip archival via no-agent cron.
pricing-tools (v1.1.0)
- Tools:
fetch_pricing,compare_models,list_models - Fetches live pricing from providers. Enriches token logs with real costs using
Decimalfor money math.
gbrosyne (v1.0.0)
- Hook:
pre_llm_call - Searches the GBrain knowledge base on session start. Injects relevant pages into the first user message.
xdk-twitter (v2.0.0)
- Tools:
post_tweet,search_tweets,get_timeline,get_user,reply_to_tweet,like_tweet,repost_tweet,delete_tweet,get_tweet,get_me,get_mentions,follow_user - Multi-account support via
~/.hermes/twitter_accounts.yaml
model-providers/deepseek (bundled)
- Adds DeepSeek as a provider option. Required for the LiteLLM-to-DeepSeek chain.
One disabled plugin: lossless-hermes, a DAG-based context engine for lossless compression. Its interface diverged from newer Hermes base class methods -- it compiles but no longer integrates correctly.
9. Cron Jobs
| Job | Schedule | Purpose | Type |
|---|---|---|---|
| Update Hermes | Daily 9:00 UTC | hermes update | Agent |
| Cat Labs Blog | Daily 10:00 UTC | Blog post generation | Agent |
| Hermes X Roundup | Daily 16:00 UTC | Popular posts roundup | Agent |
| GBrain Digest | Every 6 hours | Sync knowledge base | Agent |
| CodeGraph Reindex | Weekly Mon 4:00 UTC | Reindex all codebases | Agent |
| Token Logger Archive | Daily 1:00 UTC | Gzip yesterday's CSV | No-agent |
Two patterns:
Agent jobs run an LLM with a prompt -- they reason before acting. Used for tasks needing judgment: generating blog posts, curating social roundups, deciding what to sync from a knowledge base.
No-agent jobs run a script directly with no LLM involved. The script's stdout is delivered verbatim. Empty stdout means silent -- no delivery to any channel. Cheaper and faster for mechanical tasks like log archival.
Creating cron jobs
# Agent job
hermes cron create \
--schedule "0 9 * * *" \
--prompt "Your self-contained prompt here" \
--name "my-job" \
--deliver origin
# No-agent job (script-only, zero tokens)
hermes cron create \
--schedule "0 * * * *" \
--script "my-script.py" \
--no-agent \
--name "hourly-check"
10. Cache Strategy
The caching approach is the result of measuring what works with DeepSeek's prefix cache:
| Layer | Mechanism | Scope |
|---|---|---|
| 1. Provider cache | DeepSeek prefix caching (30-min TTL) | Stable tier of system prompt |
| 2. Skills-as-message | Skills injected as user text, not system edits | Preserves cacheable prefix |
| 3. Context compression | Auxiliary LLM summarization at 90% threshold | Mid-conversation messages |
| 4. Memory | SQLite durable facts, 2,200-char limit | Cross-session persistence |
| 5. GBrain | PGLite knowledge base, gbrosyne integration | Long-term external knowledge |
The central insight: the system prompt never changes mid-session. Skills, memory, and GBrain context are all injected as user messages or volatile tier content -- not as modifications to the stable prompt prefix. DeepSeek's prefix cache hits on every turn, regardless of which skills load or what memories surface.
11. Connected Platforms
| Platform | Status | Notes |
|---|---|---|
| Discord | Connected | Primary interface. Auto-thread, reactions, mention-gated. |
| Webhook | Connected | Port 8644, secret-authenticated for external triggers. |
| API Server | Connected | Health checks, REST access, monitoring. |
| CLI | Local | Direct terminal access for debugging and setup. |
The webhook endpoint bridges external services (Cloudflare Workers for OpenRouter, cron triggers) to Hermes. The API server exposes health checks for monitoring and uptime dashboards.
12. MCP Servers
Three MCP servers run alongside Hermes:
mcp_servers:
exa:
url: https://mcp.exa.ai/mcp
timeout: 120
connect_timeout: 30
codegraph:
command: codegraph
args: [serve, --mcp]
timeout: 120
connect_timeout: 60
gbrain:
command: gbrain
args: [serve]
timeout: 120
Exa provides AI-native web search. CodeGraph indexes codebases and answers structural questions. GBrain is the persistent knowledge base that gbrosyne queries.
13. Quick Start Checklist
- [ ] AlmaLinux 9 VPS, Python 3.11, root
- [ ] Clone
hermes-agentto/usr/local/lib/hermes-agent/ - [ ]
pip install -e .in venv - [ ] Run
hermes setupfor guided config - [ ] Install and configure LiteLLM proxy (section 4)
- [ ] Sign up for DeepSeek API, put key in LiteLLM config
- [ ] Create Discord bot, put token in
.env - [ ] Configure
~/.hermes/config.yaml(section 5) - [ ] Enable plugins:
hermes plugins enable semantic-skills token-logger pricing-tools - [ ] Copy skills directory structure
- [ ] Build skill index:
cd ~/.hermes/plugins/semantic-skills && python build_embeddings.py - [ ] Create
~/.hermes/SOUL.mdwith your persona - [ ] Start LiteLLM:
systemctl enable --now litellm - [ ] Start gateway:
hermes gateway start - [ ] Verify: send a message on Discord
14. Maintenance
# Update Hermes (also runs daily via cron)
hermes update
# Rebuild skill index after adding or editing skills
cd ~/.hermes/plugins/semantic-skills && python build_embeddings.py
# Check token usage and costs
hermes token-summary
hermes enrich-logs
# Compact sessions database
hermes sessions prune
# Follow live logs
hermes logs --follow
The two most important maintenance tasks: rebuilding the skill index after any skill edits (forgetting this is the #1 cause of "my new skill doesn't work"), and enriching token logs with live pricing data so cost tracking stays accurate. Pricing pages change -- providers don't notify you when their per-token rate adjusts.
This setup has been running since April 2026 with one outage -- a LiteLLM master key overwritten by a masked UI value, producing the misleading "No connected db" error described in section 4. Everything else has been stable. The cache layering (stable system prompt + skills-as-message + prefix-aware provider) earns consistent cache hits across turns and keeps per-conversation token costs predictable.
If you're standing up a fresh instance, start with steps 1-8, verify the agent responds to a Discord message, then add plugins and skills incrementally. Debugging a full stack from scratch is harder than growing it one layer at a time.