Size 100%
Claude Code · Live Session Demo
Seven words,
fully diagnosed.
Rodrique Heron · 2026-05-14 Agentic AI as my SRE coworker — what's possible today, how it's wired, and the open question of building this without Claude Code as the harness.
What's running before anything breaks
The system under test
Frigate NVR — a fleet of Reolink cameras
Host — nvr-host.lab.example.net
Detection — RTX 3060 12 GB (CUDA via CDI)
Network — macvlan VLAN 23, 192.168.23.10
Storage — 1.5 TB media LV
How it's deployed
Rootful Podman + systemd Quadlet
Ansible playbook + Quadlet template
IaC — cameras, zones, masks under source control
Recoverybin/lab-frigate-recover + healthcheck timer
Documented — CLAUDE.md, recovery modes A and B
I opened a Claude Code session in this homelab's repo, typed seven words, attached two screenshots, and hit send. That is the entirety of what I provided.
“the frigate system
is not in a good state”
7 words. Two screenshots. No host. No service. No logs. The next slide is the replay of what happened in the next 12 seconds — Claude Code doing the work, not me typing commands at a shell.
Opening probe — use the controls to pause, scrub, or replay
▌ USERwhat Rodrique typed ▌ CLAUDE CODEwhat Claude said ⏺ BashClaude calling a tool tool result
Claude Code is doing the work, not Rodrique. The ⏺ Bash markers are Claude invoking the Bash tool — each one is a tool call decided by the model, not a command Rodrique typed. Replayed via asciinema, the standard Linux terminal recorder.
What Claude Code already had, before the user said anything else
What the user provided
7 words
2 screenshots of broken camera tiles

No host. No service. No logs. No directives.
What Claude already had (from preloaded context)
Host: nvr-host.lab.example.net
API: http://192.168.23.10:5000
Deploy: rootful Podman + macvlan Quadlet
Recovery script: bin/lab-frigate-recover
Failure Modes A & B documented in CLAUDE.md
Storage layout: lv_frigate_config / lv_frigate_media
Playbook: setup-nvr-host-frigate.yml
5 prior Tuna memories on Frigate operational state
The smoking gun — cache, threads, cgroup pids
▌ CLAUDE CODEClaude's reasoning ⏺ BashClaude calling a tool tool result
Three Bash tool calls. Claude decides what to run, runs it, reads the output, then decides the next call. The cgroup pids.events: max 90 result is what names the root cause directly — most "out of threads" bugs surface as random Python or libc errors instead.
Podman's silent default, never changed in the Quadlet
cgroup pids.max 2,048 Podman's default. Frigate 0.17 + the fleet + audio + embeddings is too thread-heavy for this.
pids.current 1,926 94% of the cap. recording_manager can't spawn move-queue workers.
pids.events: max 90 90 confirmed fork/clone denials since startup. The cgroup is the smoking gun.
One Quadlet directive. One playbook run.
playbooks/setup-nvr-host-frigate.yml — Quadlet [Container] section Tmpfs=/tmp:rw,size=8G

+ # cgroup pids cap. Podman default 2048 is too low.
+ PidsLimit=8192

Environment=FRIGATE_RTSP_PASSWORD=...
ansible-playbook playbooks/setup-nvr-host-frigate.yml \
  --limit nvr-host.lab.example.net -e mqtt_enabled=true -e confirm=yes

changed: nvr-host — Lay down Frigate container quadlet
changed: nvr-host — Restart macvlan network FIRST
changed: nvr-host — Restart Frigate
ok: Frigate API is up: 0.17.1-416a9b7
ok: fleet recovered, all streaming
~3 minutes of downtime — then:
MetricBeforeAfter
pids.max 2,048 (Podman default) 8,192
pids.current 1,926 (94%) 1,882 (23%)
pids.events: max 90 denied forks 0
/tmp/cache files 3,215 stuck 24 in-flight
mp4s written (2 min)0 50
Cameras streaming watchdog cycle all
Same session, follow-through — commit, issues, memories, propagation
▌ CLAUDE CODEnarration ⏺ Bashshell command ⏺ Editfile edit ⏺ mcp__tuna__memory_storeMCP tool call result
Multiple tool types, all decided by Claude. Same session: Bash for git + gh, Edit for the CLAUDE.md update, mcp__tuna__memory_store (an MCP server I built) for persisting the gotcha. Four issues filed with full context, two memories so the same trap doesn't bite again, fix propagated upstream into the shared Ansible role and the vLLM Quadlet template.
Part Two
What makes
this possible
The agentic behaviour isn't built into Claude Code by default. It comes from three layers that were built deliberately over time — and it matters who built what.
Three sources of capability, clearly separated
Anthropic
Claude Code itself — the agent harness, the loop, tool execution.
The model — Claude Opus 4.7 in this session.
Skill / Agent / TaskCreate machinery; the superpowers plugin (e.g. systematic-debugging, brainstorming).
MCP protocol — standardized tool calls.
Open Source / Community
Ansible, Podman / Quadlet, vLLM, Qdrant, Frigate, asciinema (this replay).
The MCP server ecosystem — Gmail, Mattermost, GitHub, Chrome DevTools.
Model weights — Llama, Qwen, DeepSeek, Granite.
Rodrique (this homelab)
CLAUDE.md + AGENTS.md per project — the shared vocabulary of the infrastructure.
A persistent memory MCP server — gives any AI agent (Claude, Gemini, Codex) cross-session, cross-tool semantic memory.
The skill library: lab-* for lab-specific operations, general-* for more general-purpose workflows.
A local vLLM model platform, a multi-model orchestration experiment, and the Ansible collections that deploy it all.
A strong CLAUDE.md is the single biggest unlock
Claude Code reads this file at every session start. It's the agent's operational vocabulary for the system — everything you'd otherwise re-explain on every prompt.
~/.claude/CLAUDE.md
Global
Cross-project rules: my tone, my safety boundaries, my conventions for SSH, vault use, Bash safety, env defaults.
<repo>/CLAUDE.md
Per-project
What this project is, what's safe to edit, how to deploy, how to recover. Documented failure modes belong here.
<repo>/sub/AGENTS.md
Per-directory
For complex subtrees that need their own rules — an Ansible role's local conventions, a service's deploy quirks. AGENTS.md is the cross-agent variant (also read by Codex, Cline, Gemini).
# Excerpt from this repo's CLAUDE.md — the part Claude read before doing anything today

FRIGATE RECOVERY — three failure modes, two recovery paths:
  Mode A: systemctl reports `activating` >90s → bin/lab-frigate-recover
  Mode B: API returns 000/refused → bin/lab-frigate-recover (IPAM drift)
  Mode C: API 200, fps=3, no segments → systemctl restart (added today)
FRIGATE STATE: /opt/podman/containers/frigate/config/ · /opt/podman/containers/frigate/media/
Tuna — a memory layer I built so AI agents stop forgetting
What it is
A small MCP server that gives any AI agent (Claude Code, Gemini CLI, Codex, Cline) persistent, semantic memory across sessions.

Memories store as text + tags + project. Search is vector-based — ask “Frigate failure modes” and it returns related entries ranked by similarity.

Backed by a vector DB and an embedding service running on my GPU host. Nothing about it is Anthropic-specific.
Why I built it
Claude Code's built-in memory is per-tool — Gemini CLI and Codex can't see it. Each agent forgets across sessions and across tools.

With Tuna, a debugging insight stored from a Claude session is found later by Gemini or Codex. Memories outlive the session and span the toolchain.

Auto-injected at session start (the 5 memories you saw earlier). Searchable mid-session. Written at end of session.
Skills — methodology and workflows, not improvisation
From Anthropic
superpowers
The official skills plugin. systematic-debugging, brainstorming, writing-plans, test-driven-development. Generic, project-agnostic.

Today's session was held to systematic-debugging rules — that's why no fix was attempted before the evidence was on the table.
From me
lab-* · general-*
lab-* are lab-specific: lab-deploy-container, lab-run-infra-milestone, lab-frigate-recover. They codify how things get done in this homelab.

general-* are general-purpose workflows that aren't tied to my lab at all — general-capture-knowledge is the one that wrote up this session into memory, docs, and issues.
Anthropic ships the skill machinery. I write the skills that match the work I actually do.
What 7 words produced
+ 7 words user input
+ CLAUDE.md failure modes, recovery scripts, storage layout
+ 5 Tuna memories prior sessions of Frigate-specific operational knowledge
+ superpowers + my skills discipline (systematic-debugging) + workflow (capture-knowledge)
+ Claude Opus 4.7 the model that ties it together
= Root cause found in 45 min. 3-min downtime. Zero follow-up questions. 9 artifacts produced: 1 commit, 4 issues, 2 Tuna memories, 1 CLAUDE.md update, 1 session doc.
Issues + CLAUDE.md updates aren't ceremony — they're the compounding effect
Without the wrap-up
Every session starts cold. The same diagnosis runs again next time.

Follow-ups discovered mid-session — the thread leak, the latent MQTT bug, the healthcheck blindspot — evaporate the moment the context window resets.

Three months later, the same trap bites again. Nobody remembers it was diagnosed already.
With the wrap-up
4 GitHub issues — follow-up work captured with full context, ready for any future session (mine or another agent's).

CLAUDE.md updated — the new Mode C signature is now part of the on-rails recovery playbook. The next session reads it at start.

2 persistent memories — the cgroup trap is now searchable from any AI tool. Claude, Gemini, Codex can all find it next time.

The next session begins smarter than this one did.
Each session is a deposit, not a one-shot. The context layer compounds — and the agent that reads it gets sharper with every incident, not flatter.
Part Three · The Open Question
What would it take to
build this without
Claude Code?
The harness is Anthropic's. The model is Anthropic's. If I want this same agentic behaviour on local infrastructure I own end-to-end, I have to recreate the harness myself — and the constraints aren't where I expected them.
What fits on my GPUs — and what doesn't
A note on the lab
I've always had a lab. The hardware here is years of acquisition — not a budget request. I keep a lab because it's how I actually learn: if I can't touch it, I never really learn it.
All served via vLLM in Podman Quadlets on RHEL. Two GPU hosts: gpu-host-a (RTX 6000 Ada, 48 GB) and gpu-host-b (24 GB). Models I've played with:
ModelSize / QuantVRAMVerdict in my lab
Llama 3.3 70B Instruct INT4 / AWQ ~35 GB Runs, but ctx capped at 2,048 tokens (KV cache starved)
Llama 3.1 70B AWQ ~37 GB 5× slower inference, 90% GPU util — barely usable for agents
Qwen 2.5 72B Instruct AWQ ~36 GB Same envelope as Llama 70B
Qwen 2.5 Coder 32B AWQ ~16 GB Coding workhorse. Tool-calling reliable.
DeepSeek-R1 Distill (Qwen 32B)AWQ ~16 GB Best reasoning under 70B. Slow but thorough.
Gemma 4 26B (A4B) AWQ-4bit (custom)~14 GBStrong on structured-extraction evals. Custom vLLM build for gemma4 parser.
Gemma 4 e4b native ~10 GB Lightweight worker option.
Granite 4.0 H-Small (FP8) FP8 ~14 GB Red Hat / IBM model. Production-leaning.
Granite 3.3 8B Instruct FP16 ~16 GB Fast supervisor role. Limited reasoning depth.
Qwen 2.5 Coder 7B FP16 ~14 GB Fast worker for known patterns.
Mixtral 8x22B AWQ ~70 GB Doesn't fit. OOM on single 48 GB GPU.
The constraint isn't quality — it's size. Even Llama 70B INT4 has 0.82 GB left for KV cache after weights, so it gets a 2,048-token context. That's roughly enough to read CLAUDE.md, not enough to debug Frigate.
Multiple small models can compensate — if something orchestrates them
The compensation pattern
One large model can be split into three small models with roles:

Supervisor — Granite 8B (planning, tool selection)
Worker — Llama 70B INT4 (deep reasoning when needed)
Coder — Qwen 2.5 Coder 7B (fast code generation)

Each fits comfortably. Together they cover the work envelope of one frontier model.
What's hard about running them together
Load/unload time — ~2.5 minutes to swap a 32B model on a 48 GB GPU. Agent loops are stalled while this happens.

Context handoff — supervisor's plan has to travel to the worker; worker's output back to supervisor; coder's diffs back to all. None of this is automatic.

Tool calling — each model needs its own prompt template, system prompt, tool schema, retry policy.

The role logic itself — when does the supervisor delegate vs. handle? When does the worker escalate? That's policy, and policy is the harness.
The unlock isn't a bigger model. It's the orchestrator that makes small models behave like a coherent agent. I've prototyped this orchestration pattern myself with mixed results — the policy logic is the hard part, not the model serving.
Where I've landed — from a real local-harness eval
April 2026: I ran 7 benchmark suites comparing local coding harnesses against qwen3-coder-30b via my LiteLLM gateway, using real engineering-shaped tasks. Conclusion: no one-harness-for-everything — route by action class.
Action classDefault harnessStatus in my lab
Agentic / supervised implementationCline CLISettled default. Off-the-shelf, MCP-aware.
Deterministic patch / known fix Aider Settled default for constrained edits.
General agent runtime Goose Tried. Did not earn default status.
Sandboxed multi-file OpenHands, OpenCode Tried. Same outcome.
Local-build candidates claw-code, CheetahClaws Not benchmarked end-to-end (toolchain prereqs).
The hard problem isn't picking a harness. It's the supervisor logic on top — the policy that picks which harness for which task, threads context between calls, and owns the validation. That's what I'm still building.
Red Hat is shipping the pattern I've been prototyping
Headline finding · vLLM Semantic Router (Athena release, March 2026)
A lightweight classifier (ModernBERT) routes each request by intent and complexity. Simple queries go to a small or local model; reasoning-heavy ones go to the larger one. OpenAI-compatible API — agents see no change.

This is the same architectural pattern as the multi-model orchestration pattern I was prototyping. The router picks the right model; my harness picks the right worker. They compose. I was building this from scratch; Red Hat already shipped it.
~90%
token-cost reduction via routing common requests locally
86%
of demo requests stayed on the free/local model
−48%
tokens & −47% latency, Qwen3 30B benchmark
+10%
accuracy on complex tasks (auto-reasoning routing)
Other Summit 2026 announcements that matter
Red Hat AI 3.4 — Model-as-a-Service with governed interfaces, hardened images, expanded model catalog (Granite 4.0 H Small, Mistral-Small-3.2-24B, Llama 3.3 70B, GPT-OSS-120B, Nemotron-3-Nano).
Speculative decoding in vLLM — 2–3× faster generation, lower cost.
Distributed inference — vLLM + llm-d across hybrid infra.
Joe Fernandes (VP/GM, Red Hat AI): “AI agents will drive inference demand growth.”
Where I'd look for adoption
Roles that are context-heavy and operationally repetitive — SRE / platform engineering, customer support engineering, FAEs, account teams.

Anyone reasoning over a constantly-changing system they don't have time to re-learn each morning. The agentic-SRE pattern shown here is the same pattern; only the context surface changes.
The value isn't the model. Red Hat doesn't need to win at frontier model training. The differentiator is the operational substrate beneath the agent — supported inference, supported routing, supported runtime, supported lifecycle. Summit this week made it concrete: RHEL AI + Red Hat Inference Server + vLLM Semantic Router + Podman/OpenShift is the production-grade version of what I'm running in my homelab.
Agentic SRE is real today
— when the context is real first.
Project context
Documented operations: failure modes, recovery scripts, conventions.
Persistent memory
A semantic store that outlives any single session and any single tool.
Encoded methodology
Skills that hold the discipline so the model doesn't have to improvise.
is the difference between
A coworker — not an intern.
I want to learn from this team. Three specific asks:
1. What gaps did I miss? If you spotted something I'm not paying attention to — in the architecture, the context layer, the discipline — I want to hear it. 2. How can I improve how I'm learning this? I'm self-teaching the agentic-runtime side. If you have a path that worked for you — talks, papers, repos, communities — send me there. 3. What tooling should I be trying to get the same experience with local models? I've benchmarked Cline, Aider, Goose, OpenHands, OpenCode. If there's something out of that frame that's working for you on local inference, I want to evaluate it.
I'd rather learn from your scars than re-discover them next quarter. Drop me a note any time.