Cloud cost: Aware to Controlled in a quarter.
If you're at the FinOps "Aware" tier (per FinOps Foundation State of FinOps 2024, ~42% of orgs are) — you can see the bill, you can't yet control it. This playbook is the substrate change that gets you to "Controlled" — the 20-35% YoY savings tier.
Tag, attribute, name.
Phase 1 makes cost visible at the point of decision — to the engineers making the architecture choices, not just the CFO. Without this, optimisation later is local-and-temporary.
Tagging discipline — non-negotiable scheme.
- Canonical scheme:
service·owner·environment·cost-centre·data-classification - Enforced at creation: AWS Config rule / Azure Policy / GCP Org Policy that denies untagged resources
- Backfill: existing untagged resources tagged within 2 weeks. Owner = nearest team or quarantine
Untagged-resource list trending to zero. New untagged creation blocked at admission.
Per-service cost attribution — in dev's daily view.
- Compute attribution per service: direct (tagged) + shared (allocated by usage)
- Storage / data egress per service: often largest hidden costs — surface them
- Surface in dev's daily flow: Backstage cost plugin · Cortex catalogue · Slack daily digest · internal portal
- Tools: Kubecost / OpenCost (K8s) · Vantage · CloudHealth · native cost-explorer with per-tag breakdowns
Survey 5 engineers. They can find their service's monthly cost in <30 seconds. Without this, optimisation behaviour doesn't change.
Named owner per service spend.
'The team that built it' isn't ownership. A named individual or rotating role is.
- Service catalogue: each service has cost-owner annotation
- Reporting line: owner sees monthly trend + reviews cost spikes >30% as an incident
- Cultural shift: cost is a reliability metric, not a finance ask. Treated like SLO-violation
Catalogue coverage 100%. Owner aware of their monthly trend.
Cost-spike alerts — alert to the owner, not the inbox.
- Threshold: default 30% MoM spike triggers alert. Tunable per service
- Routing: directly to service owner via Slack / PagerDuty. Not to a central finance inbox
- SLA: 48-hour expected response. Postmortem if unresolved
- Tooling: native (AWS Budgets · Azure Cost Anomaly · GCP Recommender) · Vantage anomaly · Datadog cost monitors
Inject a deliberate test spike. Alert reaches owner within 30 minutes. Drill the response process once.
Quick wins & idle hunt.
Phase 2 captures the 5–15% bill reduction that's pure idle waste and over-provisioning. No architectural change required.
Hunt idle: unattached disks, orphaned snapshots, unused load balancers.
- Per cloud: unattached EBS volumes · orphaned snapshots · idle ELB/NLB · unused EIPs · orphaned S3 multipart-uploads
- Tooling: AWS Trusted Advisor · Azure Advisor · GCP Recommender · cloud-custodian (multi-cloud)
- Auto-quarantine: tag for deletion after 7 days of zero use; delete after 30 days unless owner objects
- Typical recovery: 5-15% of bill
Quarantine pipeline live. Idle measured monthly; trend held.
Dev / staging environments — auto-stop outside business hours.
- Lambda / Function App / Cloud Function scheduled to stop non-prod compute at 7pm and start at 7am local. Weekends off entirely
- Exceptions: nightly-batch services · always-on staging serving real partners. Tagged
always-on=true - Typical recovery: 30-50% of non-prod compute spend
Measured. Exception list reviewed monthly.
Storage lifecycle policies — nothing stays hot by default.
- S3 / GCS / Blob: default lifecycle policy moves objects to cooler tiers (IA → Glacier / Nearline → Coldline) on age
- Per-bucket override only with documented reason.
data-classificationtag drives the default policy - Log retention: trim CloudTrail / VPC Flow / app logs after regulator-required retention
- Typical recovery: 10-30% of storage spend
Override list with documented reason + owner. Measured spend dropped.
Right-size the biggest 20 workloads.
- Pareto: top 20 services usually account for 60-80% of compute spend. Focus there
- Per service: 30-day p95 CPU / memory utilisation. If <30%, downsize instance class one tier
- K8s: VPA / vertical-pod-autoscaler in recommend mode → adjust requests / limits
- Database: RDS / Cloud SQL right-sizing recommendations applied
Per-service utilisation tracked. Right-sizing recommendations re-run quarterly.
Commitments, governance, quarterly cadence.
Phase 3 builds the structural levers: commitment portfolio, cost-aware architecture decisions, and the governance cadence that keeps you at Controlled rather than slipping back to Aware.
Commitment-coverage analysis & first quarterly tune.
Per FinOps Foundation, commitment coverage 70-90% of steady-state is the sweet spot. Below 50% you're leaving 15-25% on the table; above 95% you're locked-in.
- Use the Cloud Commitment Optimiser with your actual steady-state numbers
- Stagger commitments: mix of 1-year and 3-year. Avoid a single huge cliff
- Quarterly review baked into the calendar: every Q1/Q2/Q3/Q4 first week, FinOps + engineering leads
Coverage measured. Commitment expiry calendar visible to leadership. Quarterly review recurring.
Egress hunt — the cost most teams ignore.
- VPC Flow / Cloud Logging traffic-by-source analysis. Identify top 10 egress paths
- Common wins: data leaving region unnecessarily · cross-AZ traffic that could be co-located · CDN-eligible content served from origin
- VPC Endpoints / Private Service Connect to keep traffic on-network
- Tag and chargeback egress per service. Make it visible to owners
Egress spend trending. Owner-aware-of-egress: 90%+ of services.
Cost-of-design at architecture review.
- Architecture review template requires estimated cost-per-month at decision time. Reviewer can challenge.
- Pre-approved patterns (paved paths) carry their own cost profile in the documentation
- Cost regression flagged in PRs for IaC changes: estimated $/month delta shown to reviewer (Infracost · custom tooling)
Cost is a property of design, not a post-hoc finding. Pattern documented for future reviewers.
FinOps cadence operationalised — this is what makes it stick.
- Weekly: automated cost-by-service report to leadership. Anomalies highlighted
- Monthly: FinOps + engineering leads review · top-3 cost-attention items selected · owners assigned
- Quarterly: commitment portfolio tune · paved-path cost profile review · capacity / commitment alignment
- Half-yearly: board / executive report on YoY savings + investment ROI
Cadence beats heroics. Without the calendar, the discipline decays in 6 months.
End of week 13.
Three substrate moves; one quarter. By week 13 the FinOps function stops being finance-driven and becomes engineering-driven. Cost is now a property of the architecture, not a monthly surprise.