Connection lost
Trying to reconnect…
Server didn't respond
Recovering…
Case study
The 33-hour wipe: a CSI driver reformatted a live LUN
A routine node drain ended with 33 hours of production telemetry gone — a
democratic-csi
driver reformatted a live Pure LUN out from under VictoriaMetrics, triggered by a
dm-multipath path-group race. Here is the incident, and the exact emisar loop that
contains it: investigate through declared
actions, stop the bleed behind one approval, then write the durable fix to
Terraform
— no raw SSH, every step on the record.
The stack
Five Dell R640s in VA1 · Pure FlashArray //M50 over iSCSI multipath · Nomad 2.0.2 ·
democratic-csi node-manual. VictoriaMetrics owns a 750 GB LUN, VictoriaLogs
a 100 GB LUN — both single-node-writer, both holding 33 hours of Apex /
Fortnite / Tarkov / Deadlock backend telemetry. The investigation below runs entirely
through emisar: the stock debugging
and docker
packs, plus a small storage + Nomad pack we authored for this fleet. emisar is only as
capable as the actions you declare — that is the point.
T+0 — the drain
Drain nomad-hvn01, reboot, restore — routine. VictoriaMetrics and
VictoriaLogs migrated to hvn03
and came up green in about 30 seconds. Ingest resumed. Dull and normal.
T+50m — 33 hours, gone
Grafana on the last 7 days: almost everything flat, only the last 12 minutes of metrics alive. A sharp cliff where the history just ended. Same for logs. The processes were healthy and ingesting — the data was missing because the data was not there. An agent picks up the page.
1 · Investigate — through emisar, not SSH
The agent works down the layers with declared pack actions, each scoped to
nomad-hvn03
and logged with a reason. No shell, no standing SSH key.
# Claude, over MCP → emisar. Each call is a declared, scoped, logged action. storage.csi_volume_ls {"volume": "vm-data"} → total 24 drwx------ 2 root root 16384 Jun 4 13:11 lost+found # a freshly-formatted ext4, born 30 minutes ago. VM's history isn't here. debugging.dmesg_tail {"lines": 400} → … 13:11:03 device-mapper: multipath 254:3: queue_if_no_path enabled 13:11:09 EXT4-fs (dm-3): mounted filesystem … clean # the two lines that matter in the tail: six seconds with no active # path group — that's the window. docker.logs {"container": "democratic-csi", "lines": 2000} → … GetDiskFormat /dev/mapper/3624…265c → blkid output="" (empty) Disk appears unformatted; running mkfs.ext4 -F /dev/mapper/3624… Disk successfully formatted (mkfs)
Diagnosis: when the multipath device was re-probed on migration, every path came up
enabled
(queue-ready) but no group was promoted active. blkid
was the first I/O — it opened the device, the read sat in the queue, and it timed out
empty. The driver read "empty" as "unformatted" and ran mkfs.ext4 -F
over a live LUN, then mounted the fresh filesystem for VictoriaMetrics to write to.
It is
kubernetes/kubernetes#95183
— a whole bug class, confirmed against Azure Disk, NetApp Trident, Longhorn, and OpenEBS.
Switching CSI drivers would not fix it; the guard has to live at the wrapper layer.
2 · Stop the bleed — one approval
VictoriaMetrics is already writing fresh parts to the empty filesystem — every group-commit overwrites blocks that still hold the old data. The move is to halt it, and halting a live alloc is destructive, so policy holds it for a human.
nomad.alloc_stop {"alloc": "a1b2c3d4", "reason":
"CSI reformatted a live LUN — stop writes to preserve recoverable blocks"}
⏸ pending approval — nomad.alloc_stop is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ alloc stopped · writes halted · LUN frozen for forensics
Caught in the first minute, that freeze preserves the LUN for recovery. Here a human took an hour to notice, the ext4 journal had wrapped, and the old blocks were already reused — so we accepted the 33-hour gap (game-side state was untouched; only telemetry was lost). But the bleed stopped on one approval, and the audit trail shows exactly who authorized the only destructive action and when.
3 · Codify the fix — a Terraform PR the agent wrote
The fix can't live on the host — the next drain would undo it. So the agent writes it where it belongs: a pull request against the infra repo, locally, for a human to review and merge. Three layers, each closing the wipe path at a different point.
# PR: "csi: never auto-format on blkid-empty (the 33h wipe)" # democratic-csi-node.nomad.hcl — the root cause, in one line: node { format { + disabled = true # never mkfs a "blank" device — refuse loudly } } - csi_volume_claim_gc_interval = "5m" + csi_volume_claim_gc_interval = "1m" # stuck-claim recovery: 10m → 6m worst case + kill_timeout = "60s" # let NodeUnstageVolume flush on drain # new files in the same PR: # multipath-watchdog.nomad.hcl — 30s dd-kick promotes stuck path groups active # format-new-volume.sh — mkfs.xfs (refuses overwrite without -f) for new LUNs # alerts/multipath.yaml — page if a path group sits status=enabled > 2 min
The one line that matters is format { disabled = true }. The
watchdog kicks a stuck path group active in microseconds with a single 4 KiB direct
read; XFS-by-default refuses to overwrite an existing signature; the alert pages when
the watchdog itself fails — which is exactly when you want to know.
What emisar actually changed
-
The forensics were legible and scoped.
Every
dmesg, multipath, and CSI-log read was a declared action against one host, logged with a reason — not atailscale ssh root@…scramble across five tools with no record of who looked at what. - The one destructive step stopped for a person. Halting the alloc was gated, approved once, and recorded — the agent could contain the damage without being trusted to run arbitrary commands.
- The cure landed as reviewable Terraform. The permanent fix is a diff a human merged, not a command that lived for ten minutes in someone's shell history and got lost.
Honest note: emisar would not have stopped democratic-csi's mkfs
— that was
an automated component doing its job badly. What emisar changes is everything a human or
agent does around
the failure. Investigate through tools, stop the disaster where you can, then hand a
Terraform path back to you — imperative containment, declarative cure.