Reference Architecture · Platform Engineering

Platform Engineering IDP.

Internal Developer Platform with paved paths, golden templates, service catalogue, scorecards, self-service. Aligned to the CNCF Platform Engineering Maturity Model, Team Topologies, and DORA elite-performer criteria.

Platform Engineering IDP · reference architecture v1.0 · read the AI moat essay

What this architecture solves.

The shape that turns a tool stack into a platform. Paved paths absorb the substrate (CI, security, identity, observability, cost) so a developer’s job is to write business logic, not to integrate ten tools. Scorecards make posture visible without self-attestation. The substrate runs as a versioned product with consumers, owners and SLOs — not as a tool stack with a portal on top.

The four paved paths.

P1 · Microservice paved path

Default for new services.

Backstage Software Template scaffolds: repo with CI configured for SLSA L3, OIDC for cloud access, OpenTelemetry instrumentation, SLO definitions, default dashboards, on-call rotation linked. Zero to staging in <30 minutes.

P2 · Data job paved path

For dbt models, Airflow DAGs, Spark jobs.

Scaffold includes lineage emission (OpenLineage), data-quality tests (Soda / Great Expectations), contract definitions, Gold-zone landing convention. The team that owns the model owns the data product.

P3 · GenAI feature paved path

The Regulated GenAI platform as a template.

Scaffold integrates with the gateway, prompt registry, eval-gated deploy and audit-evidence pipeline of the Regulated GenAI Platform. New GenAI features inherit the nine controls by construction.

P4 · Batch job paved path

Scheduled, idempotent, observable.

Cron + retry + DLQ + OTel + SLO-bounded. The path most platform teams ignore because it’s unsexy — and most legacy ops debt accumulates against.

Self-service capabilities.

Every common provisioning need behind a portal action: DB instance, queue, secret, DNS record, TLS cert, namespace, IAM role, feature flag, sandbox AI key. Each provisioned by Terraform/Pulumi modules with policy-as-code embedded. Audit trail per action; no tickets.

Scorecards — auto-derived, not self-attested.

Per service: production-readiness (CI green? SBOM emitted? SLO defined? on-call wired?), security posture (signed image? KEV CVE count? MFA?), cost (per-month vs budget? trending), SLO health, ownership freshness (owner active in last 30 days?).

Scorecards must be hard to game. Self-attested fields decay; auto-derived fields don’t.

How the platform is funded and run.

Funded as a product with outcome metrics: lead time, change-fail rate, time-to-first-deploy, paved-path adoption, developer NPS. Not as a cost-centre.
Stream-aligned + enabling teams per Team Topologies. Central platform team is small; enabling teams move into streams temporarily to help them adopt the platform.
Versioned with auto-migration tooling. Breaking changes ship with codemods (jscodeshift, OpenRewrite, ast-grep) and a migration job. Consuming teams don’t do the migration work.
Sunset rituals quarterly. Subtract capabilities that aren’t adopted. Without sunset, every platform eventually becomes a tool stack again.

Sequencing for the first 90 days.

Weeks 1-4: Backstage stood up with service catalogue. One paved path (microservice) shipped with one team.
Weeks 5-8: Self-service for the top three needs (DB, queue, secrets). Scorecards for production-readiness.
Weeks 9-12: Second and third paved paths added. Auto-migration tooling for the first breaking change. Developer NPS baseline.