Runbooks

A runbook is a saved, versioned sequence of actions — the "how we restart the cache tier" knowledge that usually lives in a wiki page going stale. Here it's executable: every run of it is policy-gated and audited, and your LLM can read and run it over MCP.

What a runbook is made of#

An ordered list of steps. Each step names an action from your catalog, the arguments to pass, and where it runs — a specific runner or a runner group (a group step fans out to every matching runner). Because steps reference catalog actions, a runbook can never do anything your packs don't already declare.

Authoring#

Runbooks → New in the dashboard. The editor is a form: pick an action from the catalog, fill its declared arguments, choose the target, reorder steps with the arrow controls. Owners and admins author and edit; operators and viewers (and connected LLMs) see published runbooks read-only.

— Draft → published. Drafts are editable and invisible to dispatch. Publishing freezes that version; every later save bumps the version number, so you can tell which revision ran during an incident.
— Title, slug, description. The description is what an LLM reads when it lists runbooks — write it like you'd brief a new on-call: when to reach for this, when not to.

Running one#

Dispatch from the runbook's page with a required reason — same as any single action. Execution runs in waves: up to five runs dispatch together, and the whole wave must reach a terminal state before the next wave starts — a run held for approval keeps its wave open, and later waves wait. Two properties matter operationally:

— Policy applies per step. A runbook isn't a policy bypass: a high-risk step stops for approval exactly as it would standalone, and the runbook continues once a human approves.
— Failure halts the sequence. A denied or failed step stops the runbook rather than marching on against a host in an unknown state. Every step's run is individually visible — output, exit code, duration — under the execution.

Worked example: a fleet health-check#

The classic case for your first runbook is the health-check.sh that SSHes to every node and greps a dozen things. As a runbook it's an ordered list of read-only steps pointed at a runner group — no SSH, no inbound port, every check declared and journaled. Because the checks are read-only they sit in the low-risk tier, so they fan out across the whole fleet without stopping for an approval:

— linux.uptime and systemd.failed_units on the edge group — is anything down, did anything crash-loop since the last sweep.
— linux.disk_usage and linux.memory — the two things that fill before they page you.
— time.chrony_tracking — clock drift, the quiet cause of half of "impossible" distributed bugs.
— one cluster-wide step — consul.node_health from a single runner that can reach Consul — for the leader, members, and any failing service checks.

Dispatch it with a reason and each group step fans out in parallel waves. Publish it once and the whole sweep is a single click for the on-call, or a single execute_runbook for an agent triaging a live alert. Then an agent can do one-off remediation later — a systemd.unit_restart for a unit it found dead that will wait for your approval by default.

Runbooks and your LLM#

Connected agents see four runbook tools: list_runbooks (published runbooks with their descriptions) and get_runbook (one runbook's ordered steps, targets resolved to current runner names). execute_runbook sends a published runbook through the governed end-to-end path; every step still passes its normal policy, approval, target, and audit checks. One limit: it refuses any runbook whose targets include a signature-enforcing runner — the bridge signs only direct run_action calls, so run those runbooks from the console.

create_runbook_draft validates an agent's proposed plan and saves it for human review, but never publishes it. An agent can still use get_runbook and dispatch steps itself when it needs to stop, reassess, or escalate between them.

What to put in a runbook.

Diagnostics-then-remediation sequences you run more than twice: drain-and-restart a service tier, rotate a stuck consumer group, the standard triage ladder for a noisy alert. Keep one-off investigation ad-hoc — that's what the catalog itself is for.

← Previous

Policies & approvals

Runs & history