Connection lost
Trying to reconnect…
Server didn't respond
Recovering…
Case study · Cassandra
The major compaction that ate the read path
At 23:41 someone ran nodetool compact
on a 1.8 TB table to "reclaim a little space." By midnight, coordinator p99 reads on
orders_ks
had gone from 8 ms to nearly a second and the pager was lit. Here is the loop that
contained it: an agent investigates through declared nodetool
actions — no JMX shell, no SSH — finds the runaway compaction, and aborts it behind a
single human approval, every step on the record.
The cluster
Six-node Cassandra 4.1 in one datacenter, RF=3, orders_ks
on SizeTieredCompaction. The agent reaches it over MCP → emisar, scoped to one runner on cass-03, with the stock
cassandra
pack: 45 nodetool and cqlsh actions where every read is low-risk and every mutation is
risk-tiered. There is no nodetool
shell and no JMX port open to the model — only the actions the pack declares, each
validated and gated by policy.
T+0 — reads start to drag
Grafana on the last hour: coordinator p99 reads on orders_ks
climbing past 200 ms and still rising. Writes are fine. No node is marked down. The
on-call agent picks up the page.
1 · Investigate — through declared actions, not a JMX shell
The agent works down from the ring to the cause. Every call below is
low
risk, scoped to cass-03, and logged with a reason — so the default policy
lets them run unattended.
# Claude, over MCP → emisar. Each call is a declared, scoped, logged action. cassandra.nodetool_status {} → Datacenter: dc1 -- Address Load Tokens Host ID Rack UN 10.0.1.3 412.8 GiB 256 7b1c… rack1 UN 10.0.1.4 408.1 GiB 256 a90f… rack1 … all six nodes UN — nothing is down; this isn't a topology problem. cassandra.nodetool_proxyhistograms {} → Percentile Read Latency Write Latency (micros) 95% 410824.30 9148.20 99% 978472.39 10577.05 # coordinator p99 reads ~0.98s — clients cluster-wide are hurting, not one node. cassandra.nodetool_tpstats {} → Pool Name Active Pending Blocked ReadStage 32 611 0 CompactionExecutor 1 0 0 # ReadStage backing up behind a single pinned compaction. cassandra.nodetool_compactionstats {} → pending tasks: 0 id type keyspace table completed total progress a1b2c3 Compaction orders_ks orders_2019 401.2 GiB 1.82 TiB 22.0% # one 1.8 TB major compaction, 3h12m in, holding the disk hostage. cassandra.nodetool_compactionhistory {} → orders_ks.orders_2019 started 23:41 (3h ago) · major · still running # someone ran `nodetool compact orders_ks orders_2019` by hand to reclaim space.
Diagnosis: a manual major compaction is rewriting all of orders_2019
— 1.8 TB — into a single SSTable. On SizeTieredCompaction that is the classic foot-gun:
it saturates disk I/O and starves the read path for hours. The cluster is healthy; the
I/O budget is simply gone.
2 · Stop the bleed — one approval
Aborting a running compaction is destructive — the partial SSTable is thrown away and
hours of I/O wasted — so policy holds it for a human. nodetool stop
ships at high, which the default policy returns as require_approval.
cassandra.nodetool_stop_compaction {"operation": "COMPACTION",
"reason": "manual major compaction on orders_2019 starving reads cluster-wide"}
⏸ pending approval — nodetool_stop_compaction is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ COMPACTION aborted on cass-03 · partial SSTable discarded · disk I/O released
Coordinator p99 reads fall back under 12 ms within the minute. The abort leaves no
half-written table — Cassandra discards the in-progress SSTable cleanly — so there is
nothing to recover, only I/O to give back. The approver saw the actor, the exact
operation
argument, the target runner, and the reason before clicking once.
3 · The risk model, in one frame
To keep normal autocompaction from re-saturating the same disks, the agent caps
throughput. That action is medium, which the default
policy allows — so unlike the abort, it just runs:
cassandra.nodetool_setcompactionthroughput {"mb_per_sec": 64,
"reason": "cap compaction I/O while orders_ks recovers"}
→ compaction throughput set to 64 MB/s · ran on policy, no approval needed
Two mutations, two outcomes. The abort was destructive enough to stop for a person;
capping throughput is routine, so the default policy lets it through. emisar is not
"approve everything" — it is a risk tier per action, and you decide where the approval
line sits. Tighten medium
to require_approval
for a role, or grant a bounded standing approval for the
abort — either change is itself an audited policy edit.
What the agent could not do
-
Tear down a node to "fix" it.
nodetool decommission,removenode,assassinate, anddrainare in the pack — and they all ship atcritical, which the default policydenys outright. There is no approval path; the runner never execs them. Lifting that is a deliberate policy edit, itself audited. -
Repeat the operator's mistake.
nodetool compactis in the pack too, athigh— so the agent can't kick its own major compaction without a human signing off first. -
Reach an arbitrary keyspace or node.
Keyspace and table arguments are pattern-validated
(
^[a-zA-Z][a-zA-Z0-9_]{0,47}$) before exec, andnodetool_statuseven pins the JMX target to an allow-listed host and port. -
Leak the host.
There is no
read_configaction, and any JMX or cqlsh output runs through redaction on the runner before it leaves the box.
Questions a Cassandra operator asks
Could the agent have made it worse?
Only within a tier you chose. The reads it ran change nothing. The single destructive
step — aborting the compaction — stopped for a human, who saw the exact arguments
first. The truly irreversible operations (decommission, removenode, assassinate,
drain) are denied outright by default. The worst the agent can do unattended is run a
low
or medium
action you have decided is safe to run.
How does emisar know an action is "high risk"?
The risk tier is declared in the action's YAML and travels with the pack's content
hash. An operator trusts that hash once; the runner re-verifies it before every run.
Change the risk in the YAML and the hash changes, which blocks dispatch until someone
re-trusts it — the model can't quietly relabel a high
action as low.
Can I let on-call abort a runaway compaction without paging me every time?
Yes — approve once and issue a standing grant bounded by duration, runner, argument
shape, and a use count. The next identical stop_compaction matches the
grant and runs; anything outside it still stops for a human. Revoke the grant any time,
and every use is in the audit trail.