◢ The stack
◢ The build · 5 steps · 16 min
Follow these in order. Don't skip.
01
Step 01 / 05
Pick your runtime — local vs hosted
- ▸LOCAL via Ollama — runs on M-series Mac (32GB+ RAM recommended for 8B model, 64GB+ for 70B). Zero API cost. Good for privacy.
- ▸HOSTED via Together AI — pay per token, no GPU needed. Best price/perf for agents that don't need to be local.
- ▸HOSTED via Replicate — same as Together, slightly higher latency, easier dashboards.
- ▸DON'T run Hermes 405B locally unless you have an H100. Use the API.
02
Step 02 / 05
Run Hermes locally with Ollama
Terminal
1# Install Ollama2brew install ollama3ollama serve &4 5# Pull Hermes 4 — pick the size that fits your RAM6ollama pull nous-hermes-2:34b # 64GB+ RAM7ollama pull nous-hermes-2:10.7b # 32GB RAM8ollama pull nous-hermes-2:8b # 16GB RAM9 10# Quick smoke test11ollama run nous-hermes-2:8b "What is the capital of France?"03
Step 03 / 05
Use Hermes via the OpenAI-compatible endpoint
Ollama exposes /v1/chat/completions in OpenAI format. Any code that uses the OpenAI SDK works against Hermes with one URL change.
agent/hermes_local.py
1from openai import OpenAI2 3# Point the OpenAI SDK at your local Ollama4client = OpenAI(5 base_url="http://localhost:11434/v1",6 api_key="ollama", # Ollama doesn't check this, but the SDK requires a string7)8 9resp = client.chat.completions.create(10 model="nous-hermes-2:8b",11 messages=[12 {"role": "system", "content": "You are an agent that uses tools. Always think step by step."},13 {"role": "user", "content": "What's 17 * 23?"},14 ],15 tools=[{16 "type": "function",17 "function": {18 "name": "calculator",19 "description": "Run an arithmetic expression and return the result",20 "parameters": {21 "type": "object",22 "properties": {"expression": {"type": "string"}},23 "required": ["expression"],24 },25 },26 }],27)28 29print(resp.choices[0].message)04
Step 04 / 05
Call hosted Hermes (Together AI) instead
agent/hermes_hosted.py
1from openai import OpenAI2 3client = OpenAI(4 base_url="https://api.together.xyz/v1",5 api_key="<your-together-api-key>",6)7 8resp = client.chat.completions.create(9 model="NousResearch/Hermes-3-Llama-3.1-70B",10 messages=[11 {"role": "system", "content": "You are a tool-using research agent."},12 {"role": "user", "content": "Search the web for the price of ETH and tell me."},13 ],14)15print(resp.choices[0].message.content)05
Step 05 / 05
When Hermes wins, when it loses
- ▸WINS: tool-call reliability is excellent — Hermes was fine-tuned specifically for agent loops
- ▸WINS: cost — 70B at Together AI is ~$0.88/M tokens vs Sonnet at $3/M input. 3-5× cheaper.
- ▸WINS: privacy — local runs never leave your machine. Compliance-friendly.
- ▸LOSES: long-context reasoning — Sonnet/GPT-5 still outperform on 50k+ token tasks.
- ▸LOSES: code review nuance — Claude wins for senior-engineer-level feedback.
- ▸Use Hermes for: routing, classification, tool-calling agents, RAG synthesis. Use Claude/GPT-5 for: hard reasoning, code review, long-doc analysis.
◆ Ship-it checklist
5 CHECKS
- Ollama installed and serving
- At least one Hermes model pulled (size matched to your RAM)
- OpenAI-SDK code calling Hermes locally — same SDK works for both
- Optional: a Together AI key for the 70B+ models you can't run locally
- You compared one task's output across Hermes vs Claude vs GPT-5 — you know what each is good at


