Case study · Cassandra

The major compaction that ate the read path

At 23:41 someone ran nodetool compact on a 1.8 TB table to "reclaim a little space." By midnight, coordinator p99 reads on orders_ks had gone from 8 ms to nearly a second and the pager was lit. Here is the loop that contained it: an agent investigates through declared nodetool actions — no JMX shell, no SSH — finds the runaway compaction, and aborts it behind a single human approval, every step on the record.

The cluster

Six-node Cassandra 4.1 in one datacenter, RF=3, orders_ks on SizeTieredCompaction. The agent reaches it over MCP → emisar, scoped to one runner on cass-03, with the stock cassandra pack: 45 nodetool and cqlsh actions where every read is low-risk and every mutation is risk-tiered. There is no nodetool shell and no JMX port open to the model — only the actions the pack declares, each validated and gated by policy.

T+0 — reads start to drag

Grafana on the last hour: coordinator p99 reads on orders_ks climbing past 200 ms and still rising. Writes are fine. No node is marked down. The on-call agent picks up the page.

1 · Investigate — through declared actions, not a JMX shell

The agent works down from the ring to the cause. Every call below is low risk, scoped to cass-03, and logged with a reason — so the default policy lets them run unattended.

# Claude, over MCP → emisar. Each call is a declared, scoped, logged action.

cassandra.nodetool_status {}
→ Datacenter: dc1
  --  Address    Load        Tokens  Host ID    Rack
  UN  10.0.1.3   412.8 GiB   256     7b1c…      rack1
  UN  10.0.1.4   408.1 GiB   256     a90f…      rack1
  … all six nodes UN — nothing is down; this isn't a topology problem.

cassandra.nodetool_proxyhistograms {}
→ Percentile   Read Latency   Write Latency  (micros)
  95%             410824.30        9148.20
  99%             978472.39       10577.05
  # coordinator p99 reads ~0.98s — clients cluster-wide are hurting, not one node.

cassandra.nodetool_tpstats {}
→ Pool Name            Active  Pending  Blocked
  ReadStage                32      611        0
  CompactionExecutor        1        0        0
  # ReadStage backing up behind a single pinned compaction.

cassandra.nodetool_compactionstats {}
→ pending tasks: 0
  id      type        keyspace   table        completed  total     progress
  a1b2c3  Compaction  orders_ks  orders_2019  401.2 GiB  1.82 TiB  22.0%
  # one 1.8 TB major compaction, 3h12m in, holding the disk hostage.

cassandra.nodetool_compactionhistory {}
→ orders_ks.orders_2019  started 23:41 (3h ago) · major · still running
  # someone ran `nodetool compact orders_ks orders_2019` by hand to reclaim space.

Diagnosis: a manual major compaction is rewriting all of orders_2019 — 1.8 TB — into a single SSTable. On SizeTieredCompaction that is the classic foot-gun: it saturates disk I/O and starves the read path for hours. The cluster is healthy; the I/O budget is simply gone.

2 · Stop the bleed — one approval

Aborting a running compaction is destructive — the partial SSTable is thrown away and hours of I/O wasted — so policy holds it for a human. nodetool stop ships at high, which the default policy returns as require_approval.

cassandra.nodetool_stop_compaction {"operation": "COMPACTION",
  "reason": "manual major compaction on orders_2019 starving reads cluster-wide"}
⏸ pending approval — nodetool_stop_compaction is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ COMPACTION aborted on cass-03 · partial SSTable discarded · disk I/O released

Coordinator p99 reads fall back under 12 ms within the minute. The abort leaves no half-written table — Cassandra discards the in-progress SSTable cleanly — so there is nothing to recover, only I/O to give back. The approver saw the actor, the exact operation argument, the target runner, and the reason before clicking once.

3 · The risk model, in one frame

To keep normal autocompaction from re-saturating the same disks, the agent caps throughput. That action is medium, which the default policy allows — so unlike the abort, it just runs:

cassandra.nodetool_setcompactionthroughput {"mb_per_sec": 64,
  "reason": "cap compaction I/O while orders_ks recovers"}
→ compaction throughput set to 64 MB/s · ran on policy, no approval needed

Two mutations, two outcomes. The abort was destructive enough to stop for a person; capping throughput is routine, so the default policy lets it through. emisar is not "approve everything" — it is a risk tier per action, and you decide where the approval line sits. Tighten medium to require_approval for a role, or grant a bounded standing approval for the abort — either change is itself an audited policy edit.

What the agent could not do

  • Tear down a node to "fix" it. nodetool decommission, removenode, assassinate, and drain are in the pack — and they all ship at critical, which the default policy denys outright. There is no approval path; the runner never execs them. Lifting that is a deliberate policy edit, itself audited.
  • Repeat the operator's mistake. nodetool compact is in the pack too, at high — so the agent can't kick its own major compaction without a human signing off first.
  • Reach an arbitrary keyspace or node. Keyspace and table arguments are pattern-validated (^[a-zA-Z][a-zA-Z0-9_]{0,47}$) before exec, and nodetool_status even pins the JMX target to an allow-listed host and port.
  • Leak the host. There is no read_config action, and any JMX or cqlsh output runs through redaction on the runner before it leaves the box.

Questions a Cassandra operator asks

Could the agent have made it worse?

Only within a tier you chose. The reads it ran change nothing. The single destructive step — aborting the compaction — stopped for a human, who saw the exact arguments first. The truly irreversible operations (decommission, removenode, assassinate, drain) are denied outright by default. The worst the agent can do unattended is run a low or medium action you have decided is safe to run.

How does emisar know an action is "high risk"?

The risk tier is declared in the action's YAML and travels with the pack's content hash. An operator trusts that hash once; the runner re-verifies it before every run. Change the risk in the YAML and the hash changes, which blocks dispatch until someone re-trusts it — the model can't quietly relabel a high action as low.

Can I let on-call abort a runaway compaction without paging me every time?

Yes — approve once and issue a standing grant bounded by duration, runner, argument shape, and a use count. The next identical stop_compaction matches the grant and runs; anything outside it still stops for a human. Revoke the grant any time, and every use is in the audit trail.