Build log

Case study: the 8.2 GB memory leak and the self-healer it justified

·6 min read

Every autonomous system has an origin story for its watchdog. Ours is a memory leak.

The incident

The fleet's dashboard was running in development mode on the operations machine — convenient for iterating, and it had been fine for weeks. What we didn't know: dev mode was quietly spawning CSS tooling worker processes and never reaping them. They accumulated for days until the process tree was holding 8.2 GB of RAM.

The machine didn't crash. Worse — it got slow. Agents started missing their schedule windows. Jobs queued up. Nothing threw an error anywhere, because nothing was technically broken. The fleet just gradually stopped doing work, silently, on a weekend.

The post-mortem

Three changes came out of it:

  • Production mode, enforced. The dashboard now runs under a process manager in production mode, full stop. Dev mode on the ops machine is banned — written into the operating docs the same day.
  • Health checks with teeth. Every agent already pinged a health_checkstable; now something actually reads it. Silence for 15 minutes is treated as failure, not as "probably fine."
  • A self-healer that acts, not alerts. Pages and notifications wake a human. The healer instead restarts the silent agent, retries the stuck job on a clean worker, and then logs what it did — so Monday's review is a list of incidents that already resolved themselves.

The design that stuck

The first version of the self-healer was embarrassingly small — a timer, two SQL queries, and a restart command. It has since grown a gate that decides when a failure needs LLM-level diagnosis versus a plain restart, but the core philosophy hasn't moved:

  • Detection is cheap; act on it. If you can detect a stall well enough to page a human, you can usually fix it without one.
  • Escalate by exception. The healer handles the 95%. The remaining 5% — repeated failures, unknown states — dispatch an event that requests deeper diagnosis instead of looping blindly.
  • Log every intervention. A healer that silently fixes things hides your real failure rate. Every restart lands in the event log with full context.

The result

Since the incident, stalls that used to cost a weekend cost minutes. The healer catches stuck jobs typically within one scan cycle, and the operator's job on bad days changed from "SSH in and archaeology" to "read the healer's incident list over coffee."

The uncomfortable lesson: the fleet's biggest outage wasn't caused by an agent, an API, or a model. It was caused by tooling defaults. Supervise everything, including the things that supervise.

The full architecture — event log, orchestrator, healer, dashboard — is diagrammed on How it works.

CodeFlow Beats
Lo-fi / EDM hype