Case study

The 33-hour wipe: a CSI driver reformatted a live LUN

A routine node drain ended with 33 hours of production telemetry gone — a democratic-csi driver reformatted a live Pure LUN out from under VictoriaMetrics, triggered by a dm-multipath path-group race. Here is the incident, and the exact emisar loop that contains it: investigate through declared actions, stop the bleed behind one approval, then write the durable fix to Terraform — no raw SSH, every step on the record.

The stack

Five Dell R640s in VA1 · Pure FlashArray //M50 over iSCSI multipath · Nomad 2.0.2 · democratic-csi node-manual. VictoriaMetrics owns a 750 GB LUN, VictoriaLogs a 100 GB LUN — both single-node-writer, both holding 33 hours of Apex / Fortnite / Tarkov / Deadlock backend telemetry. The investigation below runs entirely through emisar: the stock debugging and docker packs, plus a small storage + Nomad pack we authored for this fleet. emisar is only as capable as the actions you declare — that is the point.

T+0 — the drain

Drain nomad-hvn01, reboot, restore — routine. VictoriaMetrics and VictoriaLogs migrated to hvn03 and came up green in about 30 seconds. Ingest resumed. Dull and normal.

T+50m — 33 hours, gone

Grafana on the last 7 days: almost everything flat, only the last 12 minutes of metrics alive. A sharp cliff where the history just ended. Same for logs. The processes were healthy and ingesting — the data was missing because the data was not there. An agent picks up the page.

1 · Investigate — through emisar, not SSH

The agent works down the layers with declared pack actions, each scoped to nomad-hvn03 and logged with a reason. No shell, no standing SSH key.

# Claude, over MCP → emisar. Each call is a declared, scoped, logged action.

storage.csi_volume_ls {"volume": "vm-data"}
→ total 24
  drwx------ 2 root root 16384 Jun  4 13:11 lost+found
  # a freshly-formatted ext4, born 30 minutes ago. VM's history isn't here.

debugging.dmesg_tail {"lines": 400}
→ … 13:11:03  device-mapper: multipath 254:3: queue_if_no_path enabled
  13:11:09  EXT4-fs (dm-3): mounted filesystem … clean
  # the two lines that matter in the tail: six seconds with no active
  # path group — that's the window.

docker.logs {"container": "democratic-csi", "lines": 2000}
→ … GetDiskFormat /dev/mapper/3624…265c → blkid output="" (empty)
  Disk appears unformatted; running mkfs.ext4 -F /dev/mapper/3624…
  Disk successfully formatted (mkfs)

Diagnosis: when the multipath device was re-probed on migration, every path came up enabled (queue-ready) but no group was promoted active. blkid was the first I/O — it opened the device, the read sat in the queue, and it timed out empty. The driver read "empty" as "unformatted" and ran mkfs.ext4 -F over a live LUN, then mounted the fresh filesystem for VictoriaMetrics to write to. It is kubernetes/kubernetes#95183 — a whole bug class, confirmed against Azure Disk, NetApp Trident, Longhorn, and OpenEBS. Switching CSI drivers would not fix it; the guard has to live at the wrapper layer.

2 · Stop the bleed — one approval

VictoriaMetrics is already writing fresh parts to the empty filesystem — every group-commit overwrites blocks that still hold the old data. The move is to halt it, and halting a live alloc is destructive, so policy holds it for a human.

nomad.alloc_stop {"alloc": "a1b2c3d4", "reason":
  "CSI reformatted a live LUN — stop writes to preserve recoverable blocks"}
⏸ pending approval — nomad.alloc_stop is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ alloc stopped · writes halted · LUN frozen for forensics

Caught in the first minute, that freeze preserves the LUN for recovery. Here a human took an hour to notice, the ext4 journal had wrapped, and the old blocks were already reused — so we accepted the 33-hour gap (game-side state was untouched; only telemetry was lost). But the bleed stopped on one approval, and the audit trail shows exactly who authorized the only destructive action and when.

3 · Codify the fix — a Terraform PR the agent wrote

The fix can't live on the host — the next drain would undo it. So the agent writes it where it belongs: a pull request against the infra repo, locally, for a human to review and merge. Three layers, each closing the wipe path at a different point.

# PR: "csi: never auto-format on blkid-empty (the 33h wipe)"
# democratic-csi-node.nomad.hcl — the root cause, in one line:
  node {
    format {
+     disabled = true        # never mkfs a "blank" device — refuse loudly
    }
  }

- csi_volume_claim_gc_interval = "5m"
+ csi_volume_claim_gc_interval = "1m"   # stuck-claim recovery: 10m → 6m worst case
+ kill_timeout                 = "60s"  # let NodeUnstageVolume flush on drain

# new files in the same PR:
#   multipath-watchdog.nomad.hcl  — 30s dd-kick promotes stuck path groups active
#   format-new-volume.sh          — mkfs.xfs (refuses overwrite without -f) for new LUNs
#   alerts/multipath.yaml         — page if a path group sits status=enabled > 2 min

The one line that matters is format { disabled = true }. The watchdog kicks a stuck path group active in microseconds with a single 4 KiB direct read; XFS-by-default refuses to overwrite an existing signature; the alert pages when the watchdog itself fails — which is exactly when you want to know.

What emisar actually changed

  • The forensics were legible and scoped. Every dmesg, multipath, and CSI-log read was a declared action against one host, logged with a reason — not a tailscale ssh root@… scramble across five tools with no record of who looked at what.
  • The one destructive step stopped for a person. Halting the alloc was gated, approved once, and recorded — the agent could contain the damage without being trusted to run arbitrary commands.
  • The cure landed as reviewable Terraform. The permanent fix is a diff a human merged, not a command that lived for ten minutes in someone's shell history and got lost.

Honest note: emisar would not have stopped democratic-csi's mkfs — that was an automated component doing its job badly. What emisar changes is everything a human or agent does around the failure. Investigate through tools, stop the disaster where you can, then hand a Terraform path back to you — imperative containment, declarative cure.