2026-05-27

Cloning Hermes Agent: Every Component, Config, and Cron Job for a Production AI Assistant

hermeslitellmdeepseekskillspluginsvps

Running an AI agent in production means more than a model endpoint and a chat interface. Over months of iteration, our Hermes setup accumulated caching layers, skill preloading, multi-model routing, automated maintenance, and a dozen platform integrations -- most of which aren't documented in the repo.

This guide maps every component so you can clone the setup onto a fresh VPS. Every config is real, every cron job is listed, and every design decision has a "why."

Architecture

The stack spans three machines:

User (Discord/CLI)
       │
       ▼
┌──────────────────────────────────────────────┐
│  Hermes Agent (v0.14.0)                       │
│  /usr/local/lib/hermes-agent/                 │
│  Config: ~/.hermes/config.yaml                │
│  Env:    ~/.hermes/.env                       │
│                                                │
│  ┌──────────────┐  ┌─────────────────────────┐│
│  │ SOUL.md       │  │ semantic-skills plugin  ││
│  │ (persona)     │  │ (TF-IDF skill preload)  ││
│  └──────────────┘  └─────────────────────────┘│
│  ┌──────────────┐  ┌─────────────────────────┐│
│  │ gbrosyne      │  │ token-logger plugin     ││
│  │ (GBrain inj.) │  │ (CSV+SQLite logging)    ││
│  └──────────────┘  └─────────────────────────┘│
│  ┌──────────────┐  ┌─────────────────────────┐│
│  │ pricing-tools │  │ xdk-twitter plugin      ││
│  │ (model costs) │  │ (multi-account X)       ││
│  └──────────────┘  └─────────────────────────┘│
│  ┌──────────────────────────────────────────┐ │
│  │ ~/.hermes/skills/  (50+ skills, 21MB)     │ │
│  └──────────────────────────────────────────┘ │
│  ┌──────────────────────────────────────────┐ │
│  │ Memory: SQLite, 2,200 char limit          │ │
│  │ Sessions: FTS5 search, 90-day retention   │ │
│  └──────────────────────────────────────────┘ │
└──────────────┬───────────────────────────────┘
               │
       ┌───────┴────────┐
       ▼                ▼
┌─────────────┐  ┌─────────────────────┐
│ LiteLLM      │  │ LM Studio (Windows) │
│ localhost:   │  │ Tailscale IP:       │
│ 4000         │  │ 11435               │
│              │  │                     │
│ deepseek-v4  │  │ qwen-3.5            │
│ deepseek-v4  │  │ nomic-embed         │
│ -flash       │  │                     │
└──────┬───────┘  └─────────────────────┘
       │
       ▼
┌──────────────┐
│ DeepSeek API  │
│ api.deepseek  │
│ .com          │
└──────────────┘

Three design choices visible in this diagram:

LiteLLM proxy sits between Hermes and model providers. It normalizes every backend into an OpenAI-compatible endpoint. Without it, Hermes would need per-provider code paths.
LM Studio runs on a separate Windows machine connected via Tailscale. This gives local inference (qwen-3.5, nomic-embed) without GPU costs on the VPS.
Plugins are not accessories -- they're how Hermes loads skills, logs tokens, manages memory, and connects to platforms. Six plugins run in every session.

1. VPS Setup

Current machine: AlmaLinux 9, Python 3.11, root user. The OS choice is constrained -- DeepSeek's API runs fastest from North American Linux nodes, and AlmaLinux is RHEL-rebuild stable.

dnf update -y
dnf install -y git curl wget python3.11 python3.11-devel python3.11-pip gcc
dnf groupinstall -y "Development Tools"

# Install uv (faster pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Install Hermes Agent

git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent

python3.11 -m venv venv
source venv/bin/activate
pip install -e .

# Run setup wizard (guided config entry)
hermes setup

# Link to PATH
ln -sf /usr/local/lib/hermes-agent/venv/bin/hermes /usr/local/bin/hermes

The repo auto-updates via a daily cron at 9:00 UTC: hermes update. You can replicate this or skip it -- the setup wizard prompts for it.

3. API Keys

All keys live in ~/.hermes/.env. hermes setup guides you through entry, but here's the full list:

Service	Signup URL	Used For	Required
DeepSeek	platform.deepseek.com	Primary model (deepseek-v4-pro)	Yes
Discord	discord.com/developers	Gateway (bot token)	Yes
Exa	exa.ai	Web search (MCP)	Yes
X/Twitter	developer.x.com	Posting, searching	Yes
Firecrawl	firecrawl.dev	Web scraping	Optional
OpenAI	platform.openai.com	Auxiliary tasks	Optional
Gemini	aistudio.google.com	Fallback models	Optional
NVIDIA	build.nvidia.com	Fallback models	Optional
xAI/Grok	x.ai	Fallback models	Optional
Langfuse	cloud.langfuse.com	Observability (traces)	Optional

Minimum for basic operation: DeepSeek + Discord. Search (Exa) and social (X) unlock full utility but aren't required for a chat assistant.

4. LiteLLM Proxy

LiteLLM is the routing layer. Hermes speaks OpenAI-compatible to http://127.0.0.1:4000/v1, and LiteLLM translates to DeepSeek's native format. Switching models or adding providers requires zero Hermes config changes -- just update the LiteLLM config.

Config

Create /root/.litellm/proxy_config.yaml:

model_list:
  - model_name: deepseek-v4-pro
    litellm_params:
      model: deepseek/deepseek-v4-pro
      api_base: https://api.deepseek.com
      api_key: sk-YOUR_DEEPSEEK_KEY

  - model_name: deepseek-v4-flash
    litellm_params:
      model: deepseek/deepseek-v4-flash
      api_base: https://api.deepseek.com
      api_key: sk-YOUR_DEEPSEEK_KEY

  # Local LM Studio models (Windows box via Tailscale)
  - model_name: qwen-3.5
    litellm_params:
      model: openai/qwen3.5
      api_base: http://YOUR_LMSTUDIO_IP:11435/v1
      api_key: none

  - model_name: text-embedding-nomic-embed-text-v2-moe
    litellm_params:
      model: openai/text-embedding-nomic-embed-text-v2-moe
      api_base: http://YOUR_LMSTUDIO_IP:11435/v1
      api_key: none

general_settings:
  master_key: sk-YOUR_DEEPSEEK_KEY

Pitfall: The master_key is used for Hermes-to-LiteLLM auth. If the key in the config doesn't match what Hermes sends, LiteLLM returns a misleading "No connected db." error -- it's the no-DB fallback code path for failed authentication, not an actual database issue. This wasted an hour of debugging the first time it happened. Also, never copy a masked display value (like sk-xxx...xxx) into the config -- Hermes's output masking can truncate the real key during copy-paste. Always read the raw file bytes to verify.

Run as a systemd service

cat > /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/lib/hermes-agent/venv/bin/litellm --config /root/.litellm/proxy_config.yaml --port 4000
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now litellm

Orphan-process trap: systemctl restart litellm does not kill processes spawned by systemd --user or other supervisors. If port 4000 is held by a stale process, every systemctl restart silently falls back to a random port. Always verify with ss -tlnp | grep 4000 after restarting. Kill all litellm processes first with pkill -9 -f "litellm.*proxy_config" if you see multiple PIDs.

Verify

curl -H "Authorization: Bearer sk-YOUR_DEEPSEEK_KEY" http://127.0.0.1:4000/v1/models

5. Hermes Config

Key sections from ~/.hermes/config.yaml:

model:
  default: deepseek-v4-pro
  provider: litellm
  base_url: http://127.0.0.1:4000/v1
  api_key: ''

agent:
  max_turns: 90
  gateway_timeout: 1800

compression:
  enabled: true
  threshold: 0.9          # Compress at 90% context window
  target_ratio: 0.7       # Compress to 70%

memory:
  memory_enabled: true
  memory_char_limit: 2200

skills:
  mode: semantic           # Uses semantic-skills plugin

plugins:
  enabled:
    - gbrosyne
    - model-providers/deepseek
    - pricing-tools
    - semantic-skills
    - token-logger
    - xdk-twitter
  disabled:
    - lossless-hermes

model.provider: litellm routes all calls through the proxy. The api_key is empty because LiteLLM handles auth with its master_key.

compression.threshold: 0.9 triggers conversation summarization at 90% of the context window. An auxiliary LLM condenses older messages to 70% density. The first 3 and last 20 messages are always preserved verbatim.

6. Prompt Assembly: Three Tiers

All context injected into the agent follows a hierarchy designed to keep system prompt text cacheable:

Stable tier -- SOUL.md, tool guidance, skills prompt template, platform hints. Built once per configuration.
Context tier -- AGENTS.md from working directory, caller-supplied system messages. Rebuilt on directory or caller changes.
Volatile tier -- memory snapshot, user profile, timestamp. Changes every turn but appended after the stable prefix.

The stable tier is what makes DeepSeek's prefix caching effective. Because the system prompt text is identical across turns, the provider reuses cached key-value pairs for the first several thousand tokens of every request -- saving latency and cost.

~/.hermes/SOUL.md drives the agent's identity. It loads fresh each message -- edits take effect immediately, no restart needed. When empty, Hermes falls back to a hardcoded default identity.

7. Skills System

Skills are markdown files with YAML frontmatter in ~/.hermes/skills/. Each contains step-by-step instructions for a specific task type. Currently 50+ skills across 21MB.

How skills load

The semantic-skills plugin hooks pre_gateway_dispatch and performs TF-IDF embedding against a pre-built skill index. When a user message matches a skill with score >= 0.65, the skill is injected as user message text -- not as system prompt modification.

This is the critical design decision: skills arrive as part of the message the agent reads, not as changes to the system prompt. The system prompt stays identical -- and therefore cacheable -- regardless of which skills are loaded. Without this, every skill injection would break the prefix cache.

Adding skills

# Create a new skill
hermes skills create my-skill

# Or manually
mkdir -p ~/.hermes/skills/my-category/my-skill/
# ... write SKILL.md with YAML frontmatter ...

# Rebuild index (required -- old index won't find new skills)
cd ~/.hermes/plugins/semantic-skills
python build_embeddings.py

Forgetting to rebuild the index is the #1 cause of "my new skill doesn't work." The file exists but the TF-IDF index doesn't know about it.

8. Plugins

Six plugins run in every session. Each hooks into a specific lifecycle point:

semantic-skills (v2.0.0)

Hook: pre_gateway_dispatch
Tool: search_skills
TF-IDF skill matching. System prompt stays ~226 tokens regardless of skill count.

token-logger (v2.0.0)

Hook: post_api_request
Tools: token_summary, enrich_logs
Dual-write to plain-text CSV (crash-safe append) and SQLite (queryable). Logs DeepSeek cache hit/miss, tokens, costs, and latency per API call. Nightly gzip archival via no-agent cron.

pricing-tools (v1.1.0)

Tools: fetch_pricing, compare_models, list_models
Fetches live pricing from providers. Enriches token logs with real costs using Decimal for money math.

gbrosyne (v1.0.0)

Hook: pre_llm_call
Searches the GBrain knowledge base on session start. Injects relevant pages into the first user message.

xdk-twitter (v2.0.0)

Tools: post_tweet, search_tweets, get_timeline, get_user, reply_to_tweet, like_tweet, repost_tweet, delete_tweet, get_tweet, get_me, get_mentions, follow_user
Multi-account support via ~/.hermes/twitter_accounts.yaml

model-providers/deepseek (bundled)

Adds DeepSeek as a provider option. Required for the LiteLLM-to-DeepSeek chain.

One disabled plugin: lossless-hermes, a DAG-based context engine for lossless compression. Its interface diverged from newer Hermes base class methods -- it compiles but no longer integrates correctly.

9. Cron Jobs

Job	Schedule	Purpose	Type
Update Hermes	Daily 9:00 UTC	hermes update	Agent
Cat Labs Blog	Daily 10:00 UTC	Blog post generation	Agent
Hermes X Roundup	Daily 16:00 UTC	Popular posts roundup	Agent
GBrain Digest	Every 6 hours	Sync knowledge base	Agent
CodeGraph Reindex	Weekly Mon 4:00 UTC	Reindex all codebases	Agent
Token Logger Archive	Daily 1:00 UTC	Gzip yesterday's CSV	No-agent

Two patterns:

Agent jobs run an LLM with a prompt -- they reason before acting. Used for tasks needing judgment: generating blog posts, curating social roundups, deciding what to sync from a knowledge base.

No-agent jobs run a script directly with no LLM involved. The script's stdout is delivered verbatim. Empty stdout means silent -- no delivery to any channel. Cheaper and faster for mechanical tasks like log archival.

Creating cron jobs

# Agent job
hermes cron create \
  --schedule "0 9 * * *" \
  --prompt "Your self-contained prompt here" \
  --name "my-job" \
  --deliver origin

# No-agent job (script-only, zero tokens)
hermes cron create \
  --schedule "0 * * * *" \
  --script "my-script.py" \
  --no-agent \
  --name "hourly-check"

10. Cache Strategy

The caching approach is the result of measuring what works with DeepSeek's prefix cache:

Layer	Mechanism	Scope
1. Provider cache	DeepSeek prefix caching (30-min TTL)	Stable tier of system prompt
2. Skills-as-message	Skills injected as user text, not system edits	Preserves cacheable prefix
3. Context compression	Auxiliary LLM summarization at 90% threshold	Mid-conversation messages
4. Memory	SQLite durable facts, 2,200-char limit	Cross-session persistence
5. GBrain	PGLite knowledge base, gbrosyne integration	Long-term external knowledge

The central insight: the system prompt never changes mid-session. Skills, memory, and GBrain context are all injected as user messages or volatile tier content -- not as modifications to the stable prompt prefix. DeepSeek's prefix cache hits on every turn, regardless of which skills load or what memories surface.

11. Connected Platforms

Platform	Status	Notes
Discord	Connected	Primary interface. Auto-thread, reactions, mention-gated.
Webhook	Connected	Port 8644, secret-authenticated for external triggers.
API Server	Connected	Health checks, REST access, monitoring.
CLI	Local	Direct terminal access for debugging and setup.

The webhook endpoint bridges external services (Cloudflare Workers for OpenRouter, cron triggers) to Hermes. The API server exposes health checks for monitoring and uptime dashboards.

12. MCP Servers

Three MCP servers run alongside Hermes:

mcp_servers:
  exa:
    url: https://mcp.exa.ai/mcp
    timeout: 120
    connect_timeout: 30
  codegraph:
    command: codegraph
    args: [serve, --mcp]
    timeout: 120
    connect_timeout: 60
  gbrain:
    command: gbrain
    args: [serve]
    timeout: 120

Exa provides AI-native web search. CodeGraph indexes codebases and answers structural questions. GBrain is the persistent knowledge base that gbrosyne queries.

13. Quick Start Checklist

[ ] AlmaLinux 9 VPS, Python 3.11, root
[ ] Clone hermes-agent to /usr/local/lib/hermes-agent/
[ ] pip install -e . in venv
[ ] Run hermes setup for guided config
[ ] Install and configure LiteLLM proxy (section 4)
[ ] Sign up for DeepSeek API, put key in LiteLLM config
[ ] Create Discord bot, put token in .env
[ ] Configure ~/.hermes/config.yaml (section 5)
[ ] Enable plugins: hermes plugins enable semantic-skills token-logger pricing-tools
[ ] Copy skills directory structure
[ ] Build skill index: cd ~/.hermes/plugins/semantic-skills && python build_embeddings.py
[ ] Create ~/.hermes/SOUL.md with your persona
[ ] Start LiteLLM: systemctl enable --now litellm
[ ] Start gateway: hermes gateway start
[ ] Verify: send a message on Discord

14. Maintenance

# Update Hermes (also runs daily via cron)
hermes update

# Rebuild skill index after adding or editing skills
cd ~/.hermes/plugins/semantic-skills && python build_embeddings.py

# Check token usage and costs
hermes token-summary
hermes enrich-logs

# Compact sessions database
hermes sessions prune

# Follow live logs
hermes logs --follow

The two most important maintenance tasks: rebuilding the skill index after any skill edits (forgetting this is the #1 cause of "my new skill doesn't work"), and enriching token logs with live pricing data so cost tracking stays accurate. Pricing pages change -- providers don't notify you when their per-token rate adjusts.

This setup has been running since April 2026 with one outage -- a LiteLLM master key overwritten by a masked UI value, producing the misleading "No connected db" error described in section 4. Everything else has been stable. The cache layering (stable system prompt + skills-as-message + prefix-aware provider) earns consistent cache hits across turns and keeps per-conversation token costs predictable.

If you're standing up a fresh instance, start with steps 1-8, verify the agent responds to a Discord message, then add plugins and skills incrementally. Debugging a full stack from scratch is harder than growing it one layer at a time.