Case study · Storage

The 33-hour wipe: a CSI driver reformatted a live LUN

A routine node drain ended with 33 hours of production telemetry gone — a democratic-csi driver ran mkfs over a live Pure LUN, triggered by a dm-multipath path-group race. Here is the incident, and the exact emisar loop that contained it: investigate through declared actions, stop the bleed behind one approval, then write the durable fix to the infra repo — no raw SSH, every step on the record. The twist is the fix: the one-line setting you'd reach for first was a no-op, and the thing that actually stopped it was a guard that refuses to trust the driver.

The stack

Five Dell PowerEdge R640s in the VA1 colo · Pure FlashArray over iSCSI multipath · Nomad 2.0.2 · democratic-csi node-manual v1.9.5, pinned by OCI digest. VictoriaMetrics, VictoriaLogs, and Grafana each own a dedicated Pure LUN (vm-data, vl-data, grafana-data) — all single-node-writer, all under Pure QoS caps, all holding 33 hours of Apex / Fortnite / Tarkov / Deadlock backend telemetry. The investigation below runs entirely through emisar on the stock debugging, docker, nomad, multipath, iscsi, pure and victoriametrics packs. emisar is only as capable as the actions you declare; for this fleet, the catalog already covered every layer of the fabric.

The fabric was never quiet

The wipe was not a bolt from the blue. For the weeks before it, the iSCSI / dm-multipath path was a slow drip of operational pain: queue depth tuned to 128 for the FlashArray, multipath retry behavior corrected, persistent iSCSI sessions pinned at boot, an iSCSI login race plus a /run/multipathd persistence bug, a data-plane boot race, a session watchdog timer, and kill_timeout raised to 60s so NodeUnstageVolume could flush on drain. None catastrophic alone. Together: an unstable substrate where a path group could come up with no active path — exactly the condition that makes a live device read empty.

T+0 — the drain

Drain nomad-hvn01, reboot, restore — routine. VictoriaMetrics and VictoriaLogs rescheduled onto nomad-hvn03 and came up green in about 30 seconds. Ingest resumed. Dull and normal.

T+50m — 33 hours, gone

Grafana on the last 7 days: almost everything flat, only the last 12 minutes of metrics alive. A sharp cliff where the history just ended. Same for logs. The processes were healthy and ingesting — the data was missing because the data was not there. An agent picks up the page.

1 · Investigate — through emisar, not SSH

The agent works down the layers with declared pack actions, each scoped to nomad-hvn03 (or the array) and logged with a reason. No shell, no standing SSH key.

# Claude, over MCP → emisar. Each call is a declared, scoped, logged action.

nomad.csi_volume_status {"volume_id": "vm-data"}
→ Schedulable = true    Access Mode = single-node-writer
  Allocations          a1b2c3d4  vm  running   (nomad-hvn03)
  # the storage layer is fine — the volume is attached and healthy.
  # so the question is what's actually on it.

fs.ls_long {"path": "…/vm-data"}
→ total 24
  drwx------ 2 root root 16384 13:11 lost+found
  # a freshly-made ext4: nothing but lost+found, born at 13:11.
  # 33 hours of VictoriaMetrics data is not here.

debugging.dmesg_tail {"lines": 400}
→ 13:11:03 device-mapper: multipath 254:3: queue_if_no_path enabled
  13:11:03 multipath 254:3: Reinstating path … remaining active paths: 0
  13:11:09 EXT4-fs (dm-3): mounted filesystem … clean
  # six seconds with zero active paths, then a clean mount of a fresh fs.

multipath.topology
→ 3624a9…265c dm-3 PURE,FlashArray
  features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
  |-+- policy='service-time 0' prio=0 status=enabled   # ← neither path
  | `- 8:0:0:1  sde 8:64  active ready running          #   group is
  `-+- policy='service-time 0' prio=0 status=enabled   #   status=active:
    `- 10:0:0:1 sdi 8:128 active ready running          #   the I/O window

docker.logs {"container": "democratic-csi", "lines": 2000}
→ GetDiskFormat /dev/mapper/3624a9…265c → blkid output="" (empty)
  Disk appears unformatted; running mkfs.ext4 -F /dev/mapper/3624a9…
  Disk successfully formatted (mkfs)
  # blkid read empty mid-race; the driver formatted a live LUN.

pure.volumes_space {"names": "vm-data"}
→ vm-data  data_reduction 1.0:1  unique 0.01G   (40.8G at 13:00)
  # the array agrees: unique data fell off a cliff. Gone, not hidden.

Diagnosis: when the multipath device was re-probed on migration, every path came up enabled (queue-ready) but no group was promoted active. blkid was the first I/O — it opened the device, the read sat in the queue, and it timed out empty. The driver read "empty" as "unformatted" and ran mkfs.ext4 -F over a live LUN, then mounted the fresh filesystem for VictoriaMetrics to write to. It is kubernetes/kubernetes#95183 — a whole bug class, confirmed against NetApp Trident, Longhorn, OpenEBS, and Azure Disk. Switching CSI drivers would not fix it — the same bug lives in all of them.

2 · Stop the bleed — one approval

VictoriaMetrics is already writing fresh parts to the empty filesystem — every group-commit overwrites blocks that still hold the old data. The move is to halt it, and nomad.alloc_stop is declared risk: high in the pack, so policy holds it for a human.

nomad.alloc_stop {"alloc_id": "a1b2c3d4", "reason":
  "CSI reformatted a live LUN — stop writes to preserve recoverable blocks"}
⏸ pending approval — nomad.alloc_stop is risk:high; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ alloc stopped · writes halted · LUN frozen for forensics

Caught in the first minute, that freeze preserves the LUN for recovery. Here a human took an hour to notice, the ext4 journal had wrapped, and the old blocks were already reused — so we accepted the 33-hour gap (game-side state was untouched; only telemetry was lost).

3 · Codify the fix — what actually stops it

The obvious fix is one line in the driver config — and it does nothing. Source review during the cutover showed democratic-csi v1.9.5 never reads node.format.disabled on the POSIX NodeStageVolume path; it is a documented no-op. The fix that actually holds is a guard that doesn't trust the driver, landed as a reviewed pull request against the infra repo — locally, for a human to merge.

# driver-config.yaml — the obvious knob, kept only as documentation:
  node: { format: { disabled: true } }   # ← v1.9.5 never reads it. No-op.

# So don't let the driver reach a real mkfs. At container start, shadow every
# formatter and keep the real binary as <name>.real:
for name in mkfs mkfs.ext2 mkfs.ext3 mkfs.ext4 mkfs.xfs mkfs.btrfs; do
  for dir in /usr/sbin /sbin /usr/bin /bin; do
    [ -x "$dir/$name" ] || continue
    mv "$dir/$name" "$dir/$name.real"
    cp /local/mkfs.guard "$dir/$name"
  done
done

# mkfs.guard — runs in the driver's place and decides per device:
tool=$(basename "$0"); real=$(command -v "$tool.real")
for arg in "$@"; do
  case "$arg" in /dev/*|/host/dev/*) ;; *) continue ;; esac
  base=$(basename "$(readlink -f "$arg")")
  id="$base $(cat /sys/class/block/$base/device/model 2>/dev/null)"
  echo "$id" | grep -Eqi 'nvme|Pure|FlashArray' || continue  # local disk: allow
  [ "${ALLOW_PURE_MKFS_DEVICE:-}" = "$arg" ] && continue   # explicit one-off
  if [ "$tool" = mkfs.ext4 ]; then
    fstype=$(blkid -p -s TYPE -o value "$arg"); rc=$?
    case $rc in
      0) [ "$fstype" = ext4 ] && exit 0 ;;   # already ext4: idempotent no-op
      2) ;;                                  # blank, but still not ours to format
      *) exit 64 ;;                          # blkid ambiguous: FAIL CLOSED
    esac
  fi
  exit 64   # any Pure/NVMe LUN we didn't no-op above: refuse, loudly
done
exec "$real" "$@"   # not a Pure device: the real mkfs runs

# Same era, for other reasons: iSCSI dm-multipath → NVMe/TCP. Rarer
# empty-read window — but not the fix; the driver reformatted on NVMe too.

The guard shadows every mkfs entrypoint inside the plugin and fails loud on anything it can't prove is a blank device — a corrupted filesystem that looks empty to blkid is refused, not formatted. A separate serial-resolved formatter handles genuinely new volumes: it resolves exactly one Pure namespace by serial, then refuses unless the start, middle, and end of the device all read as zero.

What emisar actually changed

The forensics were legible and scoped. Every dmesg, multipath, CSI-log, and array-side read was a declared action against one host, logged with a reason — not a tailscale ssh root@… scramble across five tools with no record of who looked at what.
The one destructive step stopped for a person. Halting the alloc was gated, approved once, and recorded — the agent could contain the damage without being trusted to run arbitrary commands.
The real fix landed as reviewable infra. A guard that distrusts the driver, landed as a diff a human reviewed and merged — not a command that lived for ten minutes in someone's shell history and got lost.

emisar would not have stopped democratic-csi's mkfs — that was an automated component doing its job badly, and the tidy declarative fix you'd reach for first was a no-op the vendor shipped. What emisar changes is everything a human or agent does around the failure. Investigate through tools, stop the disaster where you can, then hand back a change a human reviews and merges.

Start free — connect a runner How the trust boundary works Author your own pack