Connection lost
Trying to reconnect…
Server didn't respond
Recovering…
Case study · Postgres
The migration that died holding a lock
At 02:14 the API started timing out across the board. Not a slow query, not a CPU spike —
every write was queued behind a single transaction a crashed deploy had left open
eighteen minutes earlier. Here is the loop that cleared it: an agent investigates through
declared, read-only psql
actions, finds the head of the lock chain, and severs it behind one human approval — no
psql -c
shell anywhere in reach.
The database
A managed Postgres 16 primary with two streaming replicas. The agent reaches it over
MCP → emisar with the stock postgres
pack: 48 actions, each a single fixed SQL statement run through psql -XAt
with only typed, validated arguments substituted. The runner authenticates with libpq
PG*
env vars on the host; the role is scoped to pg_monitor
plus pg_signal_backend
— enough to read state and signal a backend, nothing
that can run DDL.
T+0 — everything starts timing out
p99 API latency goes vertical; checkout requests pile up against a 30-second timeout. Connection count climbs as the app retries. CPU is flat — the database isn't busy, it's waiting. The on-call agent picks up the page.
1 · Investigate — through declared reads
Every call below is low
risk and holds no locks of its own — so the default policy runs them unattended while the
agent narrows in.
# Claude, over MCP → emisar. Every call carries a reason; it's logged. postgres.activity_states {} → active|41 idle in transaction|6 idle|22 # 41 "active" backends and climbing — but active here means blocked, not working. postgres.lock_blocking_chains {} → blocked_pid|blocker_pid|blocked_query 8801|4471|UPDATE orders SET status='paid' WHERE id=$1 8802|4471|UPDATE orders SET status='paid' WHERE id=$1 … 30 rows, every one blocked_by 4471 # one backend — pid 4471 — is the head of the entire chain. postgres.idle_in_transaction {} → pid|usename|application_name|idle_seconds|query 4471|deploy|migrate-svc|1084|SELECT id FROM orders WHERE region='eu' FOR UPDATE # 4471 has been idle *inside* a transaction for 18 min, holding a FOR UPDATE lock. postgres.activity_detail {} → 4471|deploy|migrate-svc|idle in transaction|Client|ClientRead|… # wait_event = ClientRead: the client is gone. A migration died mid-transaction.
Diagnosis: migrate-svc
opened a transaction, took a FOR UPDATE
lock on a batch of orders
rows, and then crashed before COMMIT. Postgres holds the transaction — and
its locks — open until the connection closes. Every checkout touching those rows is now
queued behind a client that no longer exists.
2 · Stop the bleed — one approval
There are two ways to signal a backend, and the difference matters here.
cancel_query
(pg_cancel_backend, a SIGINT) cancels a running query
— but 4471 isn't running one, it's sitting idle holding a lock, so a cancel
does nothing. The tool that releases it is terminate_backend
(pg_terminate_backend, a SIGTERM): sever the connection, roll back the
orphaned transaction, drop the lock. It ships at high,
so policy holds it for a human.
postgres.terminate_backend {"pid": 4471,
"reason": "crashed migrate-svc left an idle FOR UPDATE txn blocking 30 writers"}
⏸ pending approval — terminate_backend is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ pg_terminate_backend(4471) = t · connection severed · txn rolled back · lock released
The thirty blocked UPDATEs commit in the next second and API latency drops
back to baseline. The approver saw the actor, the exact pid
(validated as an integer in [1, 4194304], nothing else), the target runner,
and the reason before clicking once — and that approval is now tied to the run in the
audit trail forever.
3 · Sweep the stragglers
The same failed deploy left a handful of other idle-in-transaction backends. None are
blocking writers yet, but each pins an old xmin
that keeps vacuum from cleaning up — slow-motion bloat. One gated bulk action clears them:
postgres.kill_idle {"older_than_seconds": 600,
"reason": "clear the remaining idle-in-transaction backends from the failed migrate-svc deploy"}
⏸ pending approval — kill_idle is high-risk; a human approves in the portal
✓ approved by you · one use · audit event recorded
→ terminated 5 idle-in-transaction backends older than 600s
older_than_seconds
is bounded to [60, 86400]
and the kill only ever targets idle in transaction
backends — it can't reach a healthy connection mid-query. Like the terminate before it,
it's high, so it stopped for a human, and every backend it touched is on the
record.
What stayed out of bounds
-
No raw SQL.
There is no
postgres.execand nopsql -cpassthrough. Every action runs one fixed statement with only validated, typed arguments substituted — a PID is an integer in[1, 4194304], an identifier matches^[a-zA-Z_][a-zA-Z0-9_]{0,62}$, nothing else reaches psql. -
No DDL.
The pack's writes stop at cancel/terminate,
ANALYZE,VACUUM, andREINDEX CONCURRENTLY— each individually risk-tiered. There is noALTER,DROP, orDELETEaction to call. -
Shapes, not values.
The diagnostic reads return query fingerprints
and counts —
slow_queriesreadspg_stat_statements, where literals are already normalized to$1. The agent sees what's slow, not the rows it touched. -
No raw secrets leave the host.
psqlconnection errors andPG*environment values run through redaction on the runner before egress. The control plane never sees your database password.
Questions a Postgres on-call asks
Can the model run an undeclared query?
No. The runner only exposes the actions in the trusted pack, each a fixed SQL string. A
statement that isn't one of those 48 doesn't exist as far as the model is concerned —
there is no free-text SQL surface to reach, so prompt-injecting a
DROP TABLE
has nowhere to land.
Why is killing a backend allowed at all?
Because clearing a wedged lock at 02:14 is the whole point. It's gated, not removed: cancel_query, terminate_backend, and
kill_idle
ship at high, which the default policy holds for one approval. Tighten any
of them to deny
for a role, or grant a bounded standing approval so on-call can clear the common case
unattended — either way it's a policy decision, and the change is itself audited.
Does the runner need superuser on the database?
No — only the privileges the actions you enable require. For this incident,
pg_monitor
(read every stat view) plus pg_signal_backend
(cancel and terminate other backends) is the whole grant. No superuser, no ownership,
no write access to a single table.