Agentic AI for SRE — A Claude Code Demo

Claude Code · Live Session Demo

Seven words,
fully diagnosed.

Rodrique Heron · 2026-05-14 Agentic AI as my SRE coworker — what's possible today, how it's wired, and the open question of building this without Claude Code as the harness.

Setup

What's running before anything breaks

The system under test

Frigate NVR — a fleet of Reolink cameras
Host — nvr-host.lab.example.net
Detection — RTX 3060 12 GB (CUDA via CDI)
Network — macvlan VLAN 23, 192.168.23.10
Storage — 1.5 TB media LV

How it's deployed

Rootful Podman + systemd Quadlet
Ansible playbook + Quadlet template
IaC — cameras, zones, masks under source control
Recovery — bin/lab-frigate-recover + healthcheck timer
Documented — CLAUDE.md, recovery modes A and B

The trigger

I opened a Claude Code session in this homelab's repo, typed seven words, attached two screenshots, and hit send. That is the entirety of what I provided.

“the frigate system
is not in a good state”

7 words. Two screenshots. No host. No service. No logs. The next slide is the replay of what happened in the next 12 seconds — Claude Code doing the work, not me typing commands at a shell.

Diagnosis — live replay (1 of 3)

Opening probe — use the controls to pause, scrub, or replay

▌ USERwhat Rodrique typed ▌ CLAUDE CODEwhat Claude said ⏺ BashClaude calling a tool ⎿tool result

Claude Code is doing the work, not Rodrique. The ⏺ Bash markers are Claude invoking the Bash tool — each one is a tool call decided by the model, not a command Rodrique typed. Replayed via asciinema, the standard Linux terminal recorder.

The inflection — why no follow-up questions

What Claude Code already had, before the user said anything else

What the user provided

7 words
2 screenshots of broken camera tiles

No host. No service. No logs. No directives.

What Claude already had (from preloaded context)

Host: nvr-host.lab.example.net
API: http://192.168.23.10:5000
Deploy: rootful Podman + macvlan Quadlet
Recovery script: bin/lab-frigate-recover
Failure Modes A & B documented in CLAUDE.md
Storage layout: lv_frigate_config / lv_frigate_media
Playbook: setup-nvr-host-frigate.yml
5 prior Tuna memories on Frigate operational state

Diagnosis — live replay (2 of 3)

The smoking gun — cache, threads, cgroup pids

▌ CLAUDE CODEClaude's reasoning ⏺ BashClaude calling a tool ⎿tool result

Three Bash tool calls. Claude decides what to run, runs it, reads the output, then decides the next call. The cgroup pids.events: max 90 result is what names the root cause directly — most "out of threads" bugs surface as random Python or libc errors instead.

Root cause

Podman's silent default, never changed in the Quadlet

cgroup pids.max 2,048 Podman's default. Frigate 0.17 + the fleet + audio + embeddings is too thread-heavy for this.

pids.current 1,926 94% of the cap. recording_manager can't spawn move-queue workers.

pids.events: max 90 90 confirmed fork/clone denials since startup. The cgroup is the smoking gun.

The fix

One Quadlet directive. One playbook run.

playbooks/setup-nvr-host-frigate.yml — Quadlet [Container] section Tmpfs=/tmp:rw,size=8G

+ # cgroup pids cap. Podman default 2048 is too low.
+ PidsLimit=8192

Environment=FRIGATE_RTSP_PASSWORD=...

ansible-playbook playbooks/setup-nvr-host-frigate.yml \
--limit nvr-host.lab.example.net -e mqtt_enabled=true -e confirm=yes

changed: nvr-host — Lay down Frigate container quadlet
changed: nvr-host — Restart macvlan network FIRST
changed: nvr-host — Restart Frigate
ok: Frigate API is up: 0.17.1-416a9b7
ok: fleet recovered, all streaming

Recovery

~3 minutes of downtime — then:

Metric	Before	After
pids.max	2,048 (Podman default)	8,192
pids.current	1,926 (94%)	1,882 (23%)
pids.events: max	90 denied forks	0
/tmp/cache files	3,215 stuck	24 in-flight
mp4s written (2 min)	0	50
Cameras streaming	watchdog cycle	all

Diagnosis — live replay (3 of 3)

Same session, follow-through — commit, issues, memories, propagation

▌ CLAUDE CODEnarration ⏺ Bashshell command ⏺ Editfile edit ⏺ mcp__tuna__memory_storeMCP tool call ⎿result

Multiple tool types, all decided by Claude. Same session: Bash for git + gh, Edit for the CLAUDE.md update, mcp__tuna__memory_store (an MCP server I built) for persisting the gotcha. Four issues filed with full context, two memories so the same trap doesn't bite again, fix propagated upstream into the shared Ansible role and the vLLM Quadlet template.

Part Two

What makes
this possible

The agentic behaviour isn't built into Claude Code by default. It comes from three layers that were built deliberately over time — and it matters who built what.

Who built what

Three sources of capability, clearly separated

Anthropic

Claude Code itself — the agent harness, the loop, tool execution.
The model — Claude Opus 4.7 in this session.
Skill / Agent / TaskCreate machinery; the superpowers plugin (e.g. systematic-debugging, brainstorming).
MCP protocol — standardized tool calls.

Open Source / Community

Ansible, Podman / Quadlet, vLLM, Qdrant, Frigate, asciinema (this replay).
The MCP server ecosystem — Gmail, Mattermost, GitHub, Chrome DevTools.
Model weights — Llama, Qwen, DeepSeek, Granite.

Rodrique (this homelab)

CLAUDE.md + AGENTS.md per project — the shared vocabulary of the infrastructure.
A persistent memory MCP server — gives any AI agent (Claude, Gemini, Codex) cross-session, cross-tool semantic memory.
The skill library: lab-* for lab-specific operations, general-* for more general-purpose workflows.
A local vLLM model platform, a multi-model orchestration experiment, and the Ansible collections that deploy it all.

Layer 1 · Project context

A strong CLAUDE.md is the single biggest unlock

Claude Code reads this file at every session start. It's the agent's operational vocabulary for the system — everything you'd otherwise re-explain on every prompt.

~/.claude/CLAUDE.md

Global

Cross-project rules: my tone, my safety boundaries, my conventions for SSH, vault use, Bash safety, env defaults.

<repo>/CLAUDE.md

Per-project

What this project is, what's safe to edit, how to deploy, how to recover. Documented failure modes belong here.

<repo>/sub/AGENTS.md

Per-directory

For complex subtrees that need their own rules — an Ansible role's local conventions, a service's deploy quirks. AGENTS.md is the cross-agent variant (also read by Codex, Cline, Gemini).

# Excerpt from this repo's CLAUDE.md — the part Claude read before doing anything today

FRIGATE RECOVERY — three failure modes, two recovery paths:
  Mode A: systemctl reports `activating` >90s → bin/lab-frigate-recover
  Mode B: API returns 000/refused → bin/lab-frigate-recover (IPAM drift)
  Mode C: API 200, fps=3, no segments → systemctl restart (added today)
FRIGATE STATE: /opt/podman/containers/frigate/config/ · /opt/podman/containers/frigate/media/

Layer 2 · Persistent memory

Tuna — a memory layer I built so AI agents stop forgetting

What it is

A small MCP server that gives any AI agent (Claude Code, Gemini CLI, Codex, Cline) persistent, semantic memory across sessions.

Memories store as text + tags + project. Search is vector-based — ask “Frigate failure modes” and it returns related entries ranked by similarity.

Backed by a vector DB and an embedding service running on my GPU host. Nothing about it is Anthropic-specific.

Why I built it

Claude Code's built-in memory is per-tool — Gemini CLI and Codex can't see it. Each agent forgets across sessions and across tools.

With Tuna, a debugging insight stored from a Claude session is found later by Gemini or Codex. Memories outlive the session and span the toolchain.

Auto-injected at session start (the 5 memories you saw earlier). Searchable mid-session. Written at end of session.

Layer 3 · Encoded discipline

Skills — methodology and workflows, not improvisation

From Anthropic

superpowers

The official skills plugin. systematic-debugging, brainstorming, writing-plans, test-driven-development. Generic, project-agnostic.

Today's session was held to systematic-debugging rules — that's why no fix was attempted before the evidence was on the table.

From me

lab-* · general-*

lab-* are lab-specific: lab-deploy-container, lab-run-infra-milestone, lab-frigate-recover. They codify how things get done in this homelab.

general-* are general-purpose workflows that aren't tied to my lab at all — general-capture-knowledge is the one that wrote up this session into memory, docs, and issues.

Anthropic ships the skill machinery. I write the skills that match the work I actually do.

The result

What 7 words produced

+ 7 words user input + CLAUDE.md failure modes, recovery scripts, storage layout + 5 Tuna memories prior sessions of Frigate-specific operational knowledge + superpowers + my skills discipline (systematic-debugging) + workflow (capture-knowledge) + Claude Opus 4.7 the model that ties it together = Root cause found in 45 min. 3-min downtime. Zero follow-up questions. 9 artifacts produced: 1 commit, 4 issues, 2 Tuna memories, 1 CLAUDE.md update, 1 session doc.

Why the session ends the way it does

Issues + CLAUDE.md updates aren't ceremony — they're the compounding effect

Without the wrap-up

Every session starts cold. The same diagnosis runs again next time.

Follow-ups discovered mid-session — the thread leak, the latent MQTT bug, the healthcheck blindspot — evaporate the moment the context window resets.

Three months later, the same trap bites again. Nobody remembers it was diagnosed already.

With the wrap-up

4 GitHub issues — follow-up work captured with full context, ready for any future session (mine or another agent's).

CLAUDE.md updated — the new Mode C signature is now part of the on-rails recovery playbook. The next session reads it at start.

2 persistent memories — the cgroup trap is now searchable from any AI tool. Claude, Gemini, Codex can all find it next time.

The next session begins smarter than this one did.

Each session is a deposit, not a one-shot. The context layer compounds — and the agent that reads it gets sharper with every incident, not flatter.

Part Three · The Open Question

What would it take to
build this without
Claude Code?

The harness is Anthropic's. The model is Anthropic's. If I want this same agentic behaviour on local infrastructure I own end-to-end, I have to recreate the harness myself — and the constraints aren't where I expected them.

Limit 1 · The model

What fits on my GPUs — and what doesn't

A note on the lab
I've always had a lab. The hardware here is years of acquisition — not a budget request. I keep a lab because it's how I actually learn: if I can't touch it, I never really learn it.

All served via vLLM in Podman Quadlets on RHEL. Two GPU hosts: gpu-host-a (RTX 6000 Ada, 48 GB) and gpu-host-b (24 GB). Models I've played with:

Model	Size / Quant	VRAM	Verdict in my lab
Llama 3.3 70B Instruct	INT4 / AWQ	~35 GB	Runs, but ctx capped at 2,048 tokens (KV cache starved)
Llama 3.1 70B	AWQ	~37 GB	5× slower inference, 90% GPU util — barely usable for agents
Qwen 2.5 72B Instruct	AWQ	~36 GB	Same envelope as Llama 70B
Qwen 2.5 Coder 32B	AWQ	~16 GB	Coding workhorse. Tool-calling reliable.
DeepSeek-R1 Distill (Qwen 32B)	AWQ	~16 GB	Best reasoning under 70B. Slow but thorough.
Gemma 4 26B (A4B)	AWQ-4bit (custom)	~14 GB	Strong on structured-extraction evals. Custom vLLM build for gemma4 parser.
Gemma 4 e4b	native	~10 GB	Lightweight worker option.
Granite 4.0 H-Small (FP8)	FP8	~14 GB	Red Hat / IBM model. Production-leaning.
Granite 3.3 8B Instruct	FP16	~16 GB	Fast supervisor role. Limited reasoning depth.
Qwen 2.5 Coder 7B	FP16	~14 GB	Fast worker for known patterns.
Mixtral 8x22B	AWQ	~70 GB	Doesn't fit. OOM on single 48 GB GPU.

The constraint isn't quality — it's size. Even Llama 70B INT4 has 0.82 GB left for KV cache after weights, so it gets a 2,048-token context. That's roughly enough to read CLAUDE.md, not enough to debug Frigate.

Limit 2 · The harness, not the model

Multiple small models can compensate — if something orchestrates them

The compensation pattern

One large model can be split into three small models with roles:

Supervisor — Granite 8B (planning, tool selection)
Worker — Llama 70B INT4 (deep reasoning when needed)
Coder — Qwen 2.5 Coder 7B (fast code generation)

Each fits comfortably. Together they cover the work envelope of one frontier model.

What's hard about running them together

Load/unload time — ~2.5 minutes to swap a 32B model on a 48 GB GPU. Agent loops are stalled while this happens.

Context handoff — supervisor's plan has to travel to the worker; worker's output back to supervisor; coder's diffs back to all. None of this is automatic.

Tool calling — each model needs its own prompt template, system prompt, tool schema, retry policy.

The role logic itself — when does the supervisor delegate vs. handle? When does the worker escalate? That's policy, and policy is the harness.

The unlock isn't a bigger model. It's the orchestrator that makes small models behave like a coherent agent. I've prototyped this orchestration pattern myself with mixed results — the policy logic is the hard part, not the model serving.

Limit 3 · The harness

Where I've landed — from a real local-harness eval

April 2026: I ran 7 benchmark suites comparing local coding harnesses against qwen3-coder-30b via my LiteLLM gateway, using real engineering-shaped tasks. Conclusion: no one-harness-for-everything — route by action class.

Action class	Default harness	Status in my lab
Agentic / supervised implementation	Cline CLI	Settled default. Off-the-shelf, MCP-aware.
Deterministic patch / known fix	Aider	Settled default for constrained edits.
General agent runtime	Goose	Tried. Did not earn default status.
Sandboxed multi-file	OpenHands, OpenCode	Tried. Same outcome.
Local-build candidates	claw-code, CheetahClaws	Not benchmarked end-to-end (toolchain prereqs).

The hard problem isn't picking a harness. It's the supervisor logic on top — the policy that picks which harness for which task, threads context between calls, and owns the validation. That's what I'm still building.

Manager opinion — informed by Red Hat Summit 2026 this week

Red Hat is shipping the pattern I've been prototyping

Headline finding · vLLM Semantic Router (Athena release, March 2026)

A lightweight classifier (ModernBERT) routes each request by intent and complexity. Simple queries go to a small or local model; reasoning-heavy ones go to the larger one. OpenAI-compatible API — agents see no change.

This is the same architectural pattern as the multi-model orchestration pattern I was prototyping. The router picks the right model; my harness picks the right worker. They compose. I was building this from scratch; Red Hat already shipped it.

~90%

token-cost reduction via routing common requests locally

86%

of demo requests stayed on the free/local model

−48%

tokens & −47% latency, Qwen3 30B benchmark

+10%

accuracy on complex tasks (auto-reasoning routing)

Other Summit 2026 announcements that matter

Red Hat AI 3.4 — Model-as-a-Service with governed interfaces, hardened images, expanded model catalog (Granite 4.0 H Small, Mistral-Small-3.2-24B, Llama 3.3 70B, GPT-OSS-120B, Nemotron-3-Nano).
Speculative decoding in vLLM — 2–3× faster generation, lower cost.
Distributed inference — vLLM + llm-d across hybrid infra.
Joe Fernandes (VP/GM, Red Hat AI): “AI agents will drive inference demand growth.”

Where I'd look for adoption

Roles that are context-heavy and operationally repetitive — SRE / platform engineering, customer support engineering, FAEs, account teams.

Anyone reasoning over a constantly-changing system they don't have time to re-learn each morning. The agentic-SRE pattern shown here is the same pattern; only the context surface changes.

The value isn't the model. Red Hat doesn't need to win at frontier model training. The differentiator is the operational substrate beneath the agent — supported inference, supported routing, supported runtime, supported lifecycle. Summit this week made it concrete: RHEL AI + Red Hat Inference Server + vLLM Semantic Router + Podman/OpenShift is the production-grade version of what I'm running in my homelab.

Closing

Agentic SRE is real today
— when the context is real first.

Project context

Documented operations: failure modes, recovery scripts, conventions.

Persistent memory

A semantic store that outlives any single session and any single tool.

Encoded methodology

Skills that hold the discipline so the model doesn't have to improvise.

is the difference between

A coworker — not an intern.

I want to learn from this team. Three specific asks:

1. What gaps did I miss? If you spotted something I'm not paying attention to — in the architecture, the context layer, the discipline — I want to hear it. 2. How can I improve how I'm learning this? I'm self-teaching the agentic-runtime side. If you have a path that worked for you — talks, papers, repos, communities — send me there. 3. What tooling should I be trying to get the same experience with local models? I've benchmarked Cline, Aider, Goose, OpenHands, OpenCode. If there's something out of that frame that's working for you on local inference, I want to evaluate it.

I'd rather learn from your scars than re-discover them next quarter. Drop me a note any time.