GenAI in regulated environments: the nine controls

Most enterprises in 2026 are not blocked from shipping GenAI by the model. They’re blocked by everything around it: data, guardrails, evals, governance, audit. The model is the easy part.

I see this every week. A senior engineering leader wants to ship an LLM-powered customer feature. The model demo is great. Risk wants ten things. Legal wants twelve. Security wants seven more. Six months pass. Nothing ships.

The way out is not to remove controls. It’s to narrow them to a defensible nine, encode them in the platform, and stop re-litigating them per feature. The nine below are what I run as a baseline whenever an organisation asks “what do we need before we put an LLM in front of a real customer?”

Every one of them maps to a public framework. None of them are my invention. The contribution is the sequencing and the encoding, not the list.

1. Named use-case, named sponsor, named outcome.

Maps to: NIST AI RMF GOVERN-1.3, MAP-1.1.

Most GenAI programmes that never ship were never specific enough to be shippable. “Help customers with chat” is not a use-case. “Resolve 40% of tier-1 support tickets without escalation while maintaining customer satisfaction at or above current human-handled baseline” is.

The control is: every GenAI use-case in production has a named product sponsor, a measurable business outcome, and a baseline against which the GenAI version is compared. Use-cases without all three don’t leave the prototype environment.

2. Eval set with known-good answers.

Maps to: NIST AI 600-1 §4, OWASP LLM09 (Misinformation), EU AI Act Art.15 (Accuracy & Robustness).

Before a prompt change ships, it passes an eval set of at least 50 cases with known-good answers. Before a model change ships, the same. Before a RAG corpus change ships, the same. Promptfoo, OpenAI Evals, DeepEval are free; LangSmith, Braintrust, Arize Phoenix are commercial. Pick one.

The bottleneck is never the tooling. It’s getting subject-matter experts to label 50 cases. Do it once. The eval set is the most durable artefact in your GenAI programme.

3. Prompts in version control, versioned, owned.

Maps to: NIST AI RMF MEASURE-2.7, OWASP LLM01 (Prompt Injection), ISO/IEC 42001 A.6.2.4.

A prompt is a configuration. Treat it like code. Hard-coded prompt strings in service code are an anti-pattern. The prompt registry has owners, semver, change history, and per-version evals attached.

Tag every production trace with the prompt version. Without that tag, you can’t attribute regressions, and you can’t roll back safely.

4. Input + output guardrails for the two failure modes you most fear.

Maps to: OWASP LLM01/02/06, MITRE ATLAS, EU AI Act Art.13 (Transparency & Provision of Information).

Two layers, not one. Input guardrails catch prompt injection, jailbreak attempts and PII leakage before the model sees them. Output guardrails catch policy violations, hallucinations in regulated domains, and unsafe content after.

Open and commercial options exist: NVIDIA NeMo Guardrails, Guardrails AI, Lakera Guard, Azure AI Content Safety, AWS Bedrock Guardrails. Add 50–300 ms of latency; design async where possible.

Test guardrails adversarially. The OWASP LLM Top 10 is your test suite floor.

5. Per-request tracing with replayability.

Maps to: NIST AI RMF MEASURE-3.2, EU AI Act Art.12 (Logging).

Per request, log: prompt, retrieved context, model+version, full response, latency, cost. Link to user session for end-to-end traces. The OpenTelemetry GenAI semantic conventions are stabilising fast; use them now to avoid migration later.

The bar is replayability. Given a trace ID from three months ago, can your team reproduce the exact response that was sent? If no, you cannot debug regressions or respond to regulator queries.

6. Risk-tiered governance.

Maps to: ISO/IEC 42001, NIST AI RMF GOVERN-2, EU AI Act Art.9 (Risk Management).

Don’t funnel every AI feature through one committee. Tier them: low-risk (internal productivity, summarisation of non-sensitive data) is self-service with a checklist. Medium-risk (customer-facing, non-regulated) needs Security + Legal sign-off and an eval threshold. High-risk (lending decisions, medical triage, identity verification) is co-designed with risk and compliance, conformity-assessed, and continuously monitored.

The tiers must be objective — written down with criteria, not decided in a meeting. Otherwise tiering becomes its own political process.

7. Cost-per-outcome, not cost-per-token.

Maps to: FinOps Foundation AI WG, AWS Well-Architected Cost Pillar.

Cost-per-token is an engineering metric. Cost-per-resolved-task is a business metric. Track both. Without per-outcome cost, you can’t defend the feature’s economics in a budget review, and you can’t make rational routing decisions (cheaper model where quality matches; premium where it doesn’t).

Routing typically buys 30–60% cost reduction without quality loss when implemented well. OpenRouter, LiteLLM, AWS Bedrock Intelligent Prompt Routing are options.

8. Model card + versioned pinning per environment.

Maps to: NIST AI RMF MAP-3.4, EU AI Act Art.11 (Technical Documentation).

Pin model versions per environment. Don’t let “latest” ship to prod. Anthropic, OpenAI and Bedrock all shipped breaking model changes in 2024; orgs that didn’t pin had production regressions.

Maintain a model card per production model: intended use, performance characteristics, eval results, known limitations, deprecation timeline. This is the artefact a regulator will ask for first.

9. Audit evidence generated at decision time, not at audit time.

Maps to: EU AI Act Art.12, NIST AI RMF MEASURE-3.3, ISO/IEC 42001 A.6.2.4.

If a regulator asks tomorrow why your model produced a specific output for a specific customer, can you answer within hours? Within minutes? At all?

The control is: per-decision evidence pack (prompt, context, model version, output, applied guardrails, confidence) is generated at decision time, signed, and retained per policy. Audit views for the three most-likely regulator questions are pre-built. You test them with internal audit before they’re asked for externally.

This is the control that takes the longest to retrofit. Build it in from day one.

How to sequence the nine.

Don’t try to land all nine before shipping. Sequence them:

Controls 1 (use-case + sponsor) and 2 (eval set) are non-negotiable before the first user touches the feature.
Controls 5 (tracing) and 8 (version pinning) go in with the first deploy.
Controls 3 (prompt registry), 4 (guardrails) and 7 (cost observability) land within four weeks of first user contact.
Controls 6 (risk-tiered governance) and 9 (audit evidence) are the platform-level investments that make the next ten features cheap.

With this sequence, the org ships in 6–12 weeks instead of 6–12 months. And it ships with controls a regulator would recognise.

The point of the controls is to make shipping boring.

The nine aren’t there to make GenAI sound serious. They’re there to make the second, third and tenth use-case cheap and safe to ship by reusing the substrate the first one built.

Run the GenAI Readiness Diagnostic to see which of these controls your org has and which it doesn’t. The capability breakdown will show you which substrate investment buys the most compound across future use-cases.

GenAI in regulated environments: the nine controls.