Hermes 4: the open-source agent brain. — DeltaForceOS

◢ The stack

OpenAI

GPT models, embeddings, function calling

Pay-as-you-go

GitHub

Source control + Vercel auto-deploy on push

Free

◢ The build · 5 steps · 16 min

Follow these in order. Don't skip.

Step 01 / 05

Pick your runtime — local vs hosted

▸LOCAL via Ollama — runs on M-series Mac (32GB+ RAM recommended for 8B model, 64GB+ for 70B). Zero API cost. Good for privacy.
▸HOSTED via Together AI — pay per token, no GPU needed. Best price/perf for agents that don't need to be local.
▸HOSTED via Replicate — same as Together, slightly higher latency, easier dashboards.
▸DON'T run Hermes 405B locally unless you have an H100. Use the API.

Step 02 / 05

Run Hermes locally with Ollama

Terminal

1# Install Ollama
2brew install ollama
3ollama serve &
4 
5# Pull Hermes 4 — pick the size that fits your RAM
6ollama pull nous-hermes-2:34b      # 64GB+ RAM
7ollama pull nous-hermes-2:10.7b    # 32GB RAM
8ollama pull nous-hermes-2:8b       # 16GB RAM
9 
10# Quick smoke test
11ollama run nous-hermes-2:8b "What is the capital of France?"

Step 03 / 05

Use Hermes via the OpenAI-compatible endpoint

Ollama exposes /v1/chat/completions in OpenAI format. Any code that uses the OpenAI SDK works against Hermes with one URL change.

agent/hermes_local.py

1from openai import OpenAI
2 
3# Point the OpenAI SDK at your local Ollama
4client = OpenAI(
5    base_url="http://localhost:11434/v1",
6    api_key="ollama",  # Ollama doesn't check this, but the SDK requires a string
7)
8 
9resp = client.chat.completions.create(
10    model="nous-hermes-2:8b",
11    messages=[
12        {"role": "system", "content": "You are an agent that uses tools. Always think step by step."},
13        {"role": "user", "content": "What's 17 * 23?"},
14    ],
15    tools=[{
16        "type": "function",
17        "function": {
18            "name": "calculator",
19            "description": "Run an arithmetic expression and return the result",
20            "parameters": {
21                "type": "object",
22                "properties": {"expression": {"type": "string"}},
23                "required": ["expression"],
24            },
25        },
26    }],
27)
28 
29print(resp.choices[0].message)

Step 04 / 05

Call hosted Hermes (Together AI) instead

agent/hermes_hosted.py

1from openai import OpenAI
2 
3client = OpenAI(
4    base_url="https://api.together.xyz/v1",
5    api_key="<your-together-api-key>",
6)
7 
8resp = client.chat.completions.create(
9    model="NousResearch/Hermes-3-Llama-3.1-70B",
10    messages=[
11        {"role": "system", "content": "You are a tool-using research agent."},
12        {"role": "user", "content": "Search the web for the price of ETH and tell me."},
13    ],
14)
15print(resp.choices[0].message.content)

Step 05 / 05

When Hermes wins, when it loses

▸WINS: tool-call reliability is excellent — Hermes was fine-tuned specifically for agent loops
▸WINS: cost — 70B at Together AI is ~$0.88/M tokens vs Sonnet at $3/M input. 3-5× cheaper.
▸WINS: privacy — local runs never leave your machine. Compliance-friendly.
▸LOSES: long-context reasoning — Sonnet/GPT-5 still outperform on 50k+ token tasks.
▸LOSES: code review nuance — Claude wins for senior-engineer-level feedback.
▸Use Hermes for: routing, classification, tool-calling agents, RAG synthesis. Use Claude/GPT-5 for: hard reasoning, code review, long-doc analysis.

◆ Ship-it checklist

5 CHECKS

Ollama installed and serving
At least one Hermes model pulled (size matched to your RAM)
OpenAI-SDK code calling Hermes locally — same SDK works for both
Optional: a Together AI key for the 70B+ models you can't run locally
You compared one task's output across Hermes vs Claude vs GPT-5 — you know what each is good at

← All guides Show your build in the community →