Reference Architecture · Data Modernisation

Modern Data Platform.

Lakehouse + governed mesh: ingestion, contracts, lineage, catalogue, semantic layer, feature store, audit retention. Aligned to DCAM, DAMA-DMBOK, ISO 8000, and the data-residency requirements of EU GDPR, APRA CPS 234 and HIPAA.

SOURCES OLTP DBs · SaaS · Files · Events · IoT · 3rd‑party · CDC streams INGESTION + CONTRACTS Kafka · Debezium · Fivetran · Airbyte · custom CDC — with versioned schemas + data contracts BRONZE · RAW Append-only · immutable Source-of-truth · audit retention SILVER · CONFORMED Cleaned · validated · joined Quality SLAs · contract-bound GOLD · CONSUMABLE Analytics · features · marts Owner = domain team DATA PRODUCTS · DOMAIN OWNED Each domain (customer · payment · risk · clinical) owns its products with SLAs, contracts and versioning SEMANTIC LAYER · METRICS Cube · dbt Semantic · Looker LookML One definition of revenue, churn, NPS — consumed by BI, agents, dashboards alike FEATURE STORE + VECTOR Feast · Tecton · Databricks FS · pgvector / Pinecone Online + offline parity · feeds ML and GenAI RAG with the same governed substrate SHARED SUBSTRATE Catalogue + lineage · access control · PII classification · retention & right-to-erasure · data quality monitoring DataHub · OpenMetadata · Atlas · Unity Catalog · OpenLineage Audited against: DCAM · DAMA-DMBOK · ISO 8000 · GDPR · APRA CPS 234 · HIPAA · sector residency reqs
Modern Data Platform · reference architecture v1.0

What this architecture solves.

The pattern that lets BI, ML and GenAI consume the same governed substrate without bespoke pipelines per consumer. Bronze/Silver/Gold lakehouse for storage discipline; domain-owned data products for accountability; semantic layer for one definition of business metrics; feature store + vector for parity between ML and AI workloads.

Why this shape.

The two design tensions that drive this architecture:

  • Lakehouse vs warehouse. Lakehouse (Delta, Iceberg, Hudi) gives storage flexibility, format portability and ML-readiness. The traditional warehouse still wins on sub-second BI; semantic layer + materialised marts close the gap.
  • Central platform vs domain mesh. Pure mesh struggles with cross-domain consistency; pure central platform becomes a bottleneck. The hybrid: central platform owns the substrate (catalogue, lineage, access, contracts); domain teams own their data products inside it.

Layers, top to bottom.

L1 · Sources

OLTP, SaaS, events, IoT, third-party.

Every source has a documented owner and a contract for what changes when. CDC (Debezium, Fivetran) is the dominant ingestion path for OLTP; event streams (Kafka) for system-of-record events.

L2 · Lakehouse (Bronze · Silver · Gold)

The storage discipline.

Bronze is append-only raw (audit retention satisfies most regulators by itself). Silver is conformed and validated. Gold is consumable — analytics-ready, owned by the domain. Delta Lake, Iceberg and Hudi are the three open table formats; format choice matters less than ownership clarity.

L3 · Data products — domain owned

Customer, payment, risk, clinical — each a product.

Following the data-mesh pattern. Each domain owns its products with documented SLAs (freshness, completeness), versioned contracts, and on-call ownership. The central platform provides the substrate; domains own the products.

L4 · Semantic layer

One definition of business metrics.

The single most under-invested layer in 2026 enterprises. Without it, three teams report three different revenue numbers from the same warehouse. Cube, dbt Semantic Layer, LookML are options; pick one and stop letting BI tools define metrics.

L5 · Feature store + vector

The substrate ML and GenAI consume.

Feature store gives ML training/serving parity (offline/online). Vector store gives RAG its retrievable corpus. Both should consume from the Gold zone, not bespoke pipelines — this is what enables the Regulated GenAI Platform.

L6 · Shared substrate (the moat)

Catalogue · lineage · access · PII · retention · quality.

Catalogue (DataHub, OpenMetadata, Unity, Atlas) + lineage (OpenLineage) + access control + PII classification + retention/erasure + data-quality monitoring (Soda, Great Expectations). The pieces that make the rest auditable.

Also on this site