Case study · Postgres

The migration that died holding a lock

At 02:14 the API started timing out across the board. Not a slow query, not a CPU spike — every write was queued behind a single transaction a crashed deploy had left open eighteen minutes earlier. Here is the loop that cleared it: an agent investigates through declared, read-only psql actions, finds the head of the lock chain, and severs it behind one human approval — no psql -c shell anywhere in reach.

The database

A managed Postgres 16 primary with two streaming replicas. The agent reaches it over MCP → emisar with the stock postgres pack: 48 actions, each a single fixed SQL statement run through psql -XAt with only typed, validated arguments substituted. The runner authenticates with libpq PG* env vars on the host; the role is scoped to pg_monitor plus pg_signal_backend — enough to read state and signal a backend, nothing that can run DDL.

T+0 — everything starts timing out

p99 API latency goes vertical; checkout requests pile up against a 30-second timeout. Connection count climbs as the app retries. CPU is flat — the database isn't busy, it's waiting. The on-call agent picks up the page.

1 · Investigate — through declared reads

Every call below is low risk and holds no locks of its own — so the default policy runs them unattended while the agent narrows in.

# Claude, over MCP → emisar. Every call carries a reason; it's logged.

postgres.activity_states {}
→ active|41
  idle in transaction|6
  idle|22
  # 41 "active" backends and climbing — but active here means blocked, not working.

postgres.lock_blocking_chains {}
→ blocked_pid|blocker_pid|blocked_query
  8801|4471|UPDATE orders SET status='paid' WHERE id=$1
  8802|4471|UPDATE orders SET status='paid' WHERE id=$1
  … 30 rows, every one blocked_by 4471
  # one backend — pid 4471 — is the head of the entire chain.

postgres.idle_in_transaction {}
→ pid|usename|application_name|idle_seconds|query
  4471|deploy|migrate-svc|1084|SELECT id FROM orders WHERE region='eu' FOR UPDATE
  # 4471 has been idle *inside* a transaction for 18 min, holding a FOR UPDATE lock.

postgres.activity_detail {}
→ 4471|deploy|migrate-svc|idle in transaction|Client|ClientRead|…
  # wait_event = ClientRead: the client is gone. A migration died mid-transaction.

Diagnosis: migrate-svc opened a transaction, took a FOR UPDATE lock on a batch of orders rows, and then crashed before COMMIT. Postgres holds the transaction — and its locks — open until the connection closes. Every checkout touching those rows is now queued behind a client that no longer exists.

2 · Stop the bleed — one approval

There are two ways to signal a backend, and the difference matters here. cancel_query (pg_cancel_backend, a SIGINT) cancels a running query — but 4471 isn't running one, it's sitting idle holding a lock, so a cancel does nothing. The tool that releases it is terminate_backend (pg_terminate_backend, a SIGTERM): sever the connection, roll back the orphaned transaction, drop the lock. It ships at high, so policy holds it for a human.

postgres.terminate_backend {"pid": 4471,
  "reason": "crashed migrate-svc left an idle FOR UPDATE txn blocking 30 writers"}
⏸ pending approval — terminate_backend is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ pg_terminate_backend(4471) = t · connection severed · txn rolled back · lock released

The thirty blocked UPDATEs commit in the next second and API latency drops back to baseline. The approver saw the actor, the exact pid (validated as an integer in [1, 4194304], nothing else), the target runner, and the reason before clicking once — and that approval is now tied to the run in the audit trail forever.

3 · Sweep the stragglers

The same failed deploy left a handful of other idle-in-transaction backends. None are blocking writers yet, but each pins an old xmin that keeps vacuum from cleaning up — slow-motion bloat. One gated bulk action clears them:

postgres.kill_idle {"older_than_seconds": 600,
  "reason": "clear the remaining idle-in-transaction backends from the failed migrate-svc deploy"}
⏸ pending approval — kill_idle is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ terminated 5 idle-in-transaction backends older than 600s

older_than_seconds is bounded to [60, 86400] and the kill only ever targets idle in transaction backends — it can't reach a healthy connection mid-query. Like the terminate before it, it's high, so it stopped for a human, and every backend it touched is on the record.

What stayed out of bounds

  • No raw SQL. There is no postgres.exec and no psql -c passthrough. Every action runs one fixed statement with only validated, typed arguments substituted — a PID is an integer in [1, 4194304], an identifier matches ^[a-zA-Z_][a-zA-Z0-9_]{0,62}$, nothing else reaches psql.
  • No DDL. The pack's writes stop at cancel/terminate, ANALYZE, VACUUM, and REINDEX CONCURRENTLY — each individually risk-tiered. There is no ALTER, DROP, or DELETE action to call.
  • Shapes, not values. The diagnostic reads return query fingerprints and counts — slow_queries reads pg_stat_statements, where literals are already normalized to $1. The agent sees what's slow, not the rows it touched.
  • No raw secrets leave the host. psql connection errors and PG* environment values run through redaction on the runner before egress. The control plane never sees your database password.

Questions a Postgres on-call asks

Can the model run an undeclared query?

No. The runner only exposes the actions in the trusted pack, each a fixed SQL string. A statement that isn't one of those 48 doesn't exist as far as the model is concerned — there is no free-text SQL surface to reach, so prompt-injecting a DROP TABLE has nowhere to land.

Why is killing a backend allowed at all?

Because clearing a wedged lock at 02:14 is the whole point. It's gated, not removed: cancel_query, terminate_backend, and kill_idle ship at high, which the default policy holds for one approval. Tighten any of them to deny for a role, or grant a bounded standing approval so on-call can clear the common case unattended — either way it's a policy decision, and the change is itself audited.

Does the runner need superuser on the database?

No — only the privileges the actions you enable require. For this incident, pg_monitor (read every stat view) plus pg_signal_backend (cancel and terminate other backends) is the whole grant. No superuser, no ownership, no write access to a single table.