SRE Programme · Operational tier — what stuck looks like

What this tier actually looks like.

You have SLOs on critical services. Postmortems are blameless. Runbooks exist for the obvious incidents. The team is reasonably proud of its on-call shape. Then someone asks: “when did we last freeze feature work because we burned the error budget?” The answer is silence, because error budgets are a number on a dashboard, not a policy with consequences.

You probably have:

SLOs on the top 3–5 customer-facing services, measured from infra or basic latency proxies.
Blameless postmortems for major incidents; action items tracked, inconsistently closed.
Runbooks per service; quality varies; some haven’t been opened in a year.
Quarterly game-days, sometimes.
On-call rotations that mostly work; pager load anecdotally high on a couple of teams.
Toil acknowledged but not tracked or budgeted-against.

Why most teams get stuck here.

Operational-tier SRE programmes stall because error budgets aren’t enforced. Three patterns:

Error budgets as a metric, not a policy. If nothing happens when the budget burns, it’s decoration.
Postmortem action items that don’t close. Without closure tracking, postmortems stop being a learning loop.
Toil as folklore. If nobody tracks toil-percentage, the team can’t defend reliability investment against feature pressure.

The three substrate moves to the next tier.

1. Define error budgets for the top 3 services. Agree the policy.

Per Google SRE Workbook: when the budget is burnt, what triggers? Feature freeze? Change-velocity reduction? Auto-pause of risky deploys? Pick consequences before the budget gets burnt; agree them with product. Error Budget calculator helps size the conversation.

2. Track toil. Target <50%. Fund the automation.

Per Google SRE Ch.5. Categorise team work per sprint; toil-percentage reported alongside feature delivery. Above 50%, investment in automation funded explicitly — not as “we’ll do it when we have time.”

3. Build runbooks for the top 10 alert types. Test one in a game-day.

Runbooks tested in game-days survive the 3am page. Untested runbooks decay. Cadence: quarterly game-day, rotating which runbook gets exercised. Closes the gap between “we have runbooks” and “our on-call rotation actually uses them.”

What changes when you cross.

Error budgets change behaviour. The first time the org freezes features because of a burn, the policy becomes real.
Toil-percentage drops below 50%. Team capacity to invest in reliability compounds.
Pager load becomes survivable. Tested runbooks + actionable alerts reduce burnout-driven turnover.
Postmortem action items close within SLA. The learning loop becomes operational.

This is the Disciplined tier. DORA “High” cluster. The jump from Disciplined to Engineered (DORA “Elite”) is platform-level: golden signals inherited, blast-radius as a design constraint, chaos engineering as a habit. See the Platform Engineering IDP reference architecture.

Run the diagnostic.

To find out whether your team scores at this tier or another, run SRE Programme. It takes 2–4 minutes and surfaces both your overall tier and the capability breakdown that shows you where the move starts.

For the bigger picture: the compound diagnostic takes results from all six diagnostics and shows you the substrate gap that bounds your overall delivery, not the per-discipline symptom.

SRE Programme — Operational.