Docs / Runbooks

Runbooks

A runbook is a saved, versioned sequence of actions — the "how we restart the cache tier" knowledge that usually lives in a wiki page going stale. Here it's executable: each step is a declared action with its arguments and target, every run of it is policy-gated and audited, and your LLM can read it as a playbook.

What a runbook is made of

An ordered list of steps. Each step names an action from your catalog, the arguments to pass, and where it runs — a specific runner or a runner group (a group step fans out to every matching runner). Because steps reference catalog actions, a runbook can never do anything your packs don't already declare — it's composition, not new capability.

Authoring

Runbooks → New in the dashboard. The editor is a form, not a YAML textarea: pick an action from the catalog, fill its declared arguments, choose the target, reorder steps with the arrow controls. Owners and admins author and edit; operators and viewers (and connected LLMs) see published runbooks read-only.

  • Draft → published. Drafts are editable and invisible to dispatch. Publishing freezes that version; every later save bumps the version number, so "which revision ran during the incident" has an answer.
  • Title, slug, description. The description is what an LLM reads when it lists runbooks — write it like you'd brief a new on-call: when to reach for this, when not to.

Running one

Dispatch from the runbook's page with a required reason — same as any single action. Execution is sequential with a small wave of parallelism: up to five runs in flight at a time, the next wave dispatching as runs complete. Two properties matter operationally:

  • Policy applies per step. A runbook isn't a policy bypass: a high-risk step stops for approval exactly as it would standalone, and the runbook continues once a human approves.
  • Failure halts the sequence. A denied or failed step stops the runbook rather than marching on against a host in an unknown state. Every step's run is individually visible — output, exit code, duration — under the execution.

Runbooks and your LLM

Connected agents see two extra tools: list_runbooks (published runbooks with their descriptions) and get_runbook (one runbook's ordered steps, targets resolved to current runner names). The cloud deliberately does not auto-execute a runbook for the model — the agent dispatches each step itself, in order, through the normal action tools. That keeps every step inside the same policy, approval, and audit machinery, and lets the agent stop, reassess, or escalate between steps the way a human operator would.

What to put in a runbook. Diagnostics-then-remediation sequences you run more than twice: drain-and-restart a service tier, rotate a stuck consumer group, the standard triage ladder for a noisy alert. Keep one-off investigation ad-hoc — that's what the catalog itself is for.