What this tier actually looks like.
You have something in users’ hands. It works most of the time. The team is proud of it — rightfully. Then a senior engineer asks: “what would happen to a deploy whose prompt change drops the critical eval score by 30%?” The answer is silence, because the eval set lives in a repo and nobody runs it on prompt changes.
You probably have:
- A live GenAI feature with real users.
- Prompts hard-coded as string literals in service code.
- Model gateway pointing at one provider, version pin set to
latestor equivalent floating reference. - An eval set written when the feature launched, gathering dust.
- Guardrails based on system-prompt instructions only.
- Per-request logging that captures “something happened” but not the full context needed for replay.
- Cost visibility at the monthly-bill level, no per-feature attribution.
You probably don’t have:
- Prompt registry with versioning and per-version evals.
- Eval-regression-gated CI.
- Tested adversarial guardrails (OWASP LLM Top 10 coverage).
- Per-decision audit evidence pack generated at decision time.
- Cost-per-resolved-task as a tracked metric.
Why most teams get stuck here.
Piloting orgs typically stall because the next move feels expensive and unfamiliar. Three flawed instincts that keep teams here:
- “Let’s ship more use-cases first.” Each new use-case re-invents the substrate. Cost grows quadratically; learning compounds nowhere.
- “The AI CoE will figure it out.” The CoE typically doesn’t own platform substrate — identity, observability, deployment, audit. See the AI CoE Trap.
- “We’ll add controls once it’s mature.” EU AI Act enforcement starts 2 Aug 2026. ISO/IEC 42001 is becoming a procurement floor. Waiting compresses the timeline against you.
The 12% of enterprises that crossed from Piloting to Operating (and beyond) didn’t do it by adding more features. They did it by building the substrate — gateway, prompt registry, evals, guardrails, audit — before the second use-case.
The three substrate moves to the next tier.
1. Prompt registry + eval-gated deploys. End of the inline-prompt era.
Move every production prompt out of code and into a versioned registry (LangSmith, Promptfoo, Langfuse). Build an eval set of 50 cases with known-good answers. Run it in CI; block merge on critical regression.
Closes the Inline Prompt and Eval Set That Never Runs gaps simultaneously.
2. Layered guardrails, tested adversarially.
Input + output guardrails for the two failure modes you most fear (typically OWASP LLM01 prompt injection + LLM06 sensitive disclosure). Test them with an adversarial prompt suite, not just hope. NVIDIA NeMo Guardrails, Guardrails AI, Azure AI Content Safety, AWS Bedrock Guardrails are all viable.
3. Per-decision audit pipeline + cost-per-outcome.
Per request: prompt, retrieved context, model+version, output, applied guardrails. Logged, signed, retention-policy-controlled. Cost attribution: cost-per-resolved-task, not cost-per-token.
This is the substrate the 9 controls essay describes (read it) and the reference architecture provides (Regulated GenAI Platform).
What changes when you cross.
Once these three moves land:
- Your second GenAI use-case ships in weeks, not months. The substrate carries it.
- Audit conversations become routine. You can show how every decision was made.
- Cost stops being a quarterly surprise. Per-feature attribution lets you optimise without flying blind.
- You’re ready for EU AI Act high-risk obligations. The substrate you built for one use-case is what the regulation requires for all.
This is the Operating tier. ~12% of enterprises are here. The gap from Operating to Industrialising is then smaller than the gap from Piloting to Operating — the substrate compounds.
Run the diagnostic.
To find out whether your team scores at this tier or another, run GenAI Readiness. It takes 2–4 minutes and surfaces both your overall tier and the capability breakdown that shows you where the move starts.
For the bigger picture: the compound diagnostic takes results from all six diagnostics and shows you the substrate gap that bounds your overall delivery, not the per-discipline symptom.