90-day playbook

Cloud cost: Aware to Controlled in a quarter.

If you're at the FinOps "Aware" tier (per FinOps Foundation State of FinOps 2024, ~42% of orgs are) — you can see the bill, you can't yet control it. This playbook is the substrate change that gets you to "Controlled" — the 20-35% YoY savings tier.

Audience Engineering leader + FinOps lead (or whoever inherited the cloud-cost meeting) Pre-req Tagging exists on most resources. CFO dashboard exists. Some commitments (Savings Plans / CUDs) in place.
End state Per-service cost in the developer's daily view · commitment coverage 70-90% on quarterly review · named owner per service spend · cost spike >30% triggers owner alert as incident · YoY savings trending 20-35% from baseline.
Re-run diagnostic at week 13 Cloud Cost diagnostic
Phase 1
Weeks 1–4

Tag, attribute, name.

Phase 1 makes cost visible at the point of decision — to the engineers making the architecture choices, not just the CFO. Without this, optimisation later is local-and-temporary.

Week 1

Tagging discipline — non-negotiable scheme.

  • Canonical scheme: service · owner · environment · cost-centre · data-classification
  • Enforced at creation: AWS Config rule / Azure Policy / GCP Org Policy that denies untagged resources
  • Backfill: existing untagged resources tagged within 2 weeks. Owner = nearest team or quarantine
Gate 1 · >95% of resources tagged with canonical scheme

Untagged-resource list trending to zero. New untagged creation blocked at admission.

Week 2

Per-service cost attribution — in dev's daily view.

  • Compute attribution per service: direct (tagged) + shared (allocated by usage)
  • Storage / data egress per service: often largest hidden costs — surface them
  • Surface in dev's daily flow: Backstage cost plugin · Cortex catalogue · Slack daily digest · internal portal
  • Tools: Kubecost / OpenCost (K8s) · Vantage · CloudHealth · native cost-explorer with per-tag breakdowns
Gate 2 · Engineers can see per-service cost without asking finance

Survey 5 engineers. They can find their service's monthly cost in <30 seconds. Without this, optimisation behaviour doesn't change.

Week 3

Named owner per service spend.

'The team that built it' isn't ownership. A named individual or rotating role is.

  • Service catalogue: each service has cost-owner annotation
  • Reporting line: owner sees monthly trend + reviews cost spikes >30% as an incident
  • Cultural shift: cost is a reliability metric, not a finance ask. Treated like SLO-violation
Gate 3 · Owner annotation on every service

Catalogue coverage 100%. Owner aware of their monthly trend.

Week 4

Cost-spike alerts — alert to the owner, not the inbox.

  • Threshold: default 30% MoM spike triggers alert. Tunable per service
  • Routing: directly to service owner via Slack / PagerDuty. Not to a central finance inbox
  • SLA: 48-hour expected response. Postmortem if unresolved
  • Tooling: native (AWS Budgets · Azure Cost Anomaly · GCP Recommender) · Vantage anomaly · Datadog cost monitors
Gate 4 · Cost-spike pipeline tested

Inject a deliberate test spike. Alert reaches owner within 30 minutes. Drill the response process once.

Phase 2
Weeks 5–8

Quick wins & idle hunt.

Phase 2 captures the 5–15% bill reduction that's pure idle waste and over-provisioning. No architectural change required.

Week 5

Hunt idle: unattached disks, orphaned snapshots, unused load balancers.

  • Per cloud: unattached EBS volumes · orphaned snapshots · idle ELB/NLB · unused EIPs · orphaned S3 multipart-uploads
  • Tooling: AWS Trusted Advisor · Azure Advisor · GCP Recommender · cloud-custodian (multi-cloud)
  • Auto-quarantine: tag for deletion after 7 days of zero use; delete after 30 days unless owner objects
  • Typical recovery: 5-15% of bill
Gate 5 · Idle-waste rate <2% of bill, holding

Quarantine pipeline live. Idle measured monthly; trend held.

Week 6

Dev / staging environments — auto-stop outside business hours.

  • Lambda / Function App / Cloud Function scheduled to stop non-prod compute at 7pm and start at 7am local. Weekends off entirely
  • Exceptions: nightly-batch services · always-on staging serving real partners. Tagged always-on=true
  • Typical recovery: 30-50% of non-prod compute spend
Gate 6 · Non-prod compute spend dropped 30%+ MoM

Measured. Exception list reviewed monthly.

Week 7

Storage lifecycle policies — nothing stays hot by default.

  • S3 / GCS / Blob: default lifecycle policy moves objects to cooler tiers (IA → Glacier / Nearline → Coldline) on age
  • Per-bucket override only with documented reason. data-classification tag drives the default policy
  • Log retention: trim CloudTrail / VPC Flow / app logs after regulator-required retention
  • Typical recovery: 10-30% of storage spend
Gate 7 · Default lifecycle policy applied to >90% of storage

Override list with documented reason + owner. Measured spend dropped.

Week 8

Right-size the biggest 20 workloads.

  • Pareto: top 20 services usually account for 60-80% of compute spend. Focus there
  • Per service: 30-day p95 CPU / memory utilisation. If <30%, downsize instance class one tier
  • K8s: VPA / vertical-pod-autoscaler in recommend mode → adjust requests / limits
  • Database: RDS / Cloud SQL right-sizing recommendations applied
Gate 8 · Top 20 right-sized; aggregate utilisation >40%

Per-service utilisation tracked. Right-sizing recommendations re-run quarterly.

Phase 3
Weeks 9–12

Commitments, governance, quarterly cadence.

Phase 3 builds the structural levers: commitment portfolio, cost-aware architecture decisions, and the governance cadence that keeps you at Controlled rather than slipping back to Aware.

Week 9

Commitment-coverage analysis & first quarterly tune.

Per FinOps Foundation, commitment coverage 70-90% of steady-state is the sweet spot. Below 50% you're leaving 15-25% on the table; above 95% you're locked-in.

  • Use the Cloud Commitment Optimiser with your actual steady-state numbers
  • Stagger commitments: mix of 1-year and 3-year. Avoid a single huge cliff
  • Quarterly review baked into the calendar: every Q1/Q2/Q3/Q4 first week, FinOps + engineering leads
Gate 9 · Coverage 70-90% on steady-state; cliff staggered

Coverage measured. Commitment expiry calendar visible to leadership. Quarterly review recurring.

Week 10

Egress hunt — the cost most teams ignore.

  • VPC Flow / Cloud Logging traffic-by-source analysis. Identify top 10 egress paths
  • Common wins: data leaving region unnecessarily · cross-AZ traffic that could be co-located · CDN-eligible content served from origin
  • VPC Endpoints / Private Service Connect to keep traffic on-network
  • Tag and chargeback egress per service. Make it visible to owners
Gate 10 · Top 10 egress paths analysed; top-3 mitigated

Egress spend trending. Owner-aware-of-egress: 90%+ of services.

Week 11

Cost-of-design at architecture review.

  • Architecture review template requires estimated cost-per-month at decision time. Reviewer can challenge.
  • Pre-approved patterns (paved paths) carry their own cost profile in the documentation
  • Cost regression flagged in PRs for IaC changes: estimated $/month delta shown to reviewer (Infracost · custom tooling)
Gate 11 · Last 5 architecture reviews include cost estimate

Cost is a property of design, not a post-hoc finding. Pattern documented for future reviewers.

Week 12

FinOps cadence operationalised — this is what makes it stick.

  • Weekly: automated cost-by-service report to leadership. Anomalies highlighted
  • Monthly: FinOps + engineering leads review · top-3 cost-attention items selected · owners assigned
  • Quarterly: commitment portfolio tune · paved-path cost profile review · capacity / commitment alignment
  • Half-yearly: board / executive report on YoY savings + investment ROI
Gate 12 · Calendar invites for all four cadences exist + happened at least once

Cadence beats heroics. Without the calendar, the discipline decays in 6 months.

End of week 13.

Three substrate moves; one quarter. By week 13 the FinOps function stops being finance-driven and becomes engineering-driven. Cost is now a property of the architecture, not a monthly surprise.

Also on this site