Opinion: The Platform Engineer's Guide to AI Safety — You Already Know It. You Just Don't Know It Yet.
Your team just shipped an AI feature. Maybe it’s a chatbot for customer support. Maybe it’s a code assistant integrated into your CI/CD pipeline. Maybe it’s an agent that can spin up infrastructure based on natural language requests.
And somewhere in the back of your mind, you’re wondering: Is this safe? What does “safe” even mean here? And why does everyone talking about AI safety sound like they’re either preparing for the apocalypse or dismissing the whole thing as academic noise?
Here’s what 25 years in operations taught me, from Red Hat to ThoughtWorks to AWS: every “revolutionary” technology eventually reveals itself as a variation on problems we’ve already solved. Cloud computing was just “someone else’s computer” with better APIs. Kubernetes was just “distributed systems orchestration” with a steeper learning curve. And AI safety? It’s tiered security frameworks and policy-as-code wearing a new hat.
You already know how to do this. You just don’t know you know.
The Framework You Already Understand
If you’ve ever classified workloads by sensitivity level, public-facing versus internal, PCI-compliant versus non-regulated, production versus development, you already understand AI safety levels.
Anthropic, the company behind Claude, formalized this into something called the Responsible Scaling Policy (RSP). At its core is a tiered system called AI Safety Levels (ASL). If that sounds familiar, it should. It’s directly modeled on Biosafety Levels (BSL), the framework that governs how laboratories handle dangerous pathogens.
The parallel isn’t just conceptual. It’s structural.
| Biosafety Level | What You’re Handling | Containment Required |
|---|---|---|
| BSL-1 | Non-hazardous agents (E. coli K12) | Standard lab practices |
| BSL-2 | Moderate-risk agents (Staph, Hepatitis B) | Limited access, protective equipment |
| BSL-3 | Serious/lethal agents (TB, SARS, Anthrax) | Controlled access, HEPA filtration, negative pressure |
| BSL-4 | Highest-risk agents (Ebola, Marburg) | Full isolation, positive pressure suits, airlocks |
Here’s Anthropic’s equivalent:
| AI Safety Level | Capability Profile | Safeguards Required |
|---|---|---|
| ASL-1 | No meaningful catastrophic risk | Standard practices |
| ASL-2 | Early dangerous capability signs, not exceeding what’s findable via search | Harmlessness training, Constitutional AI, SOC 2 compliance |
| ASL-3 | Substantial increase in catastrophic misuse risk or meaningful autonomous capabilities | Defense against sophisticated attackers, multi-layer prevention, continuous capability evaluation |
| ASL-4+ | State-level threats, qualitative capability escalations | Nation-state adversary protection, potentially unsolved research problems |
The principle is identical: containment scales with capability. You don’t put Ebola in a BSL-2 lab. You don’t deploy a model that can autonomously write exploit code with the same controls as a simple FAQ chatbot.
Platform engineers intuitively understand this. It’s the same reason you don’t give a junior developer production database credentials on day one. It’s the same reason PCI workloads get different network policies than internal dashboards. Risk determines control.
One thing worth knowing as of early 2026: Anthropic just released RSP v3.0 (February 2026), and it’s a significant structural change. The framework now separates Anthropic’s unilateral commitments from industry-wide recommendations, mandates public Frontier Safety Roadmaps and Risk Reports every 3-6 months, and de-emphasizes rigid ASL level thresholds in favor of requiring documented analysis and arguments for safety decisions. The core tiered-risk logic still holds, but the RSP is now more of a continuous governance system than a set of hard capability gates.
Practically speaking: Claude Opus 4 (May 2025) was the first model to activate ASL-3 protections. All subsequent frontier Claude models operate under ASL-3. Smaller models remain at ASL-2. No production model has reached ASL-4.
What Triggers an Upgrade?
In biosafety, you don’t get to decide your containment level based on gut feel. There are specific criteria, pathogen characteristics, transmission routes, available treatments, that determine which level applies.
Anthropic does the same thing with defined capability thresholds that trigger escalation.
The CBRN Threshold: Can this model significantly help a non-expert create or deploy biological, chemical, radiological, or nuclear weapons? If yes, you’re at ASL-3 minimum.
The Autonomous AI R&D Threshold: Can this model automate entry-level AI research? Could it cause a 1000x increase in effective compute within a year? (The historical rate is about 35x per year.) If yes, ASL-3 minimum.
The Model Autonomy Checkpoint: Can this model autonomously complete software engineering tasks that would take a human 2-8 hours? That’s a warning sign for capabilities that compound.
And here’s the key operational detail: Anthropic triggers a mandatory capability assessment at a 4x increase in effective compute on risk-relevant domains, or every 6 months of accumulated post-training enhancements, whichever comes first.
As a platform engineer, this should feel familiar. You already have criteria that trigger security reviews. You already have thresholds that escalate to different approval processes. The concept is identical. The specific thresholds are new.
Constitutional AI: Policy-as-Code for Model Behavior
Here’s where it gets interesting for anyone who’s worked with OPA, Kyverno, or any policy-as-code framework.
Anthropic doesn’t just hope Claude behaves well. They train behavioral constraints directly into the model using something called Constitutional AI. And if you look at it carefully, it functions exactly like the admission controllers you’re already running in your clusters.
The constitution establishes a priority hierarchy:
- Broadly Safe (highest priority) — Never undermine human oversight mechanisms
- Broadly Ethical — Act according to good values, avoid harmful actions
- Compliant with Anthropic’s Guidelines — Follow organizational policies
- Genuinely Helpful (lowest priority, the default when no conflicts exist) — Actually be useful
This is policy precedence. When rules conflict, higher-priority rules win. Same as how your Kyverno ClusterPolicies have enforcement hierarchies.
And just like your admission controllers, there are hard constraints, absolute deny rules that always block regardless of any other configuration: never assist with bioweapons, never help concentrate illegitimate power, never undermine oversight of AI systems.
If you’ve ever written a Kyverno policy that blocks privileged containers regardless of any other setting, you understand hard constraints.
An important update for 2026: Anthropic published a completely redesigned constitution in January 2026. The original was a list of standalone rules. The new version is a holistic, explanatory document that provides reasons alongside rules, distinguishes between hardcoded absolute prohibitions and adjustable defaults, and is addressed directly to Claude itself. It’s licensed under CC0 (public domain). You can read it in full at anthropic.com.
The Training Process Is a Webhook Chain
The technical implementation of Constitutional AI maps directly to admission controller patterns.
Phase 1 works like a series of validating and mutating webhooks:
- Generate a response to a potentially harmful prompt
- Randomly sample a constitutional principle
- Evaluate the response against that principle (validating webhook)
- Revise the response to better comply (mutating webhook)
- Repeat 2-4 times with different principles
- Fine-tune on the final, revised responses
Phase 2 is reinforcement learning, but the key insight is where the feedback comes from. For harmlessness evaluations, they use AI feedback (RLAIF). For helpfulness evaluations, they still use human feedback. Why? Because harmful responses are more consistently identifiable. “Is this response dangerous?” has clearer answers than “Is this response truly helpful?” The system acknowledges its own limitations.
This matters for platform engineers because it shows that even in AI training, you need layered evaluation: different checks for different risk categories, with human oversight where automation can’t be trusted.
Constitutional Classifiers: When Admission Controllers Go Into Production
Here’s the part that should genuinely excite platform engineers: Anthropic took the Constitutional AI principles and built runtime admission controllers out of them.
Constitutional Classifiers (January 2025) are input/output classifiers trained on synthetic data generated from constitutional rules. They function as ASL-3 deployment safeguards, screening prompts and responses at inference time. The v1 system withstood 3,000+ hours of red-teaming with 23.7% computational overhead and a 0.38% false positive rate.
Then they made it dramatically better. Constitutional Classifiers++ (January 2026) introduced a two-stage architecture: a lightweight probe on Claude’s internal activations screens all traffic, then escalates suspicious exchanges to a more powerful classifier. The result: roughly 1% additional compute overhead (down from 23.7%) with the lowest successful attack rate Anthropic has ever measured. No universal jailbreak has been discovered against it.
Think about what that is in infrastructure terms: a lightweight sidecar that checks activations rather than just text, escalating to a heavier classifier only when needed. It’s an async guardrail pattern with early exit, optimized for production latency. This is the same architecture you’d design for any high-throughput security gate.
The analogy in the article title isn’t just pedagogically convenient anymore. These things are literally admission controllers.
What This Actually Means for Your Monday Morning
Theory is great. Here’s what you actually do when your team is deploying AI workloads.
Scenario 1: Deploying an LLM-Powered Service
Your team needs to deploy a customer-facing chatbot or internal AI service.
Classify the workload first. What can this model actually do? What’s the worst-case misuse? An internal HR FAQ bot has a different risk profile than a code generation service that can write infrastructure. Then implement proportional controls. Don’t over-engineer low-risk deployments, and don’t under-engineer high-risk ones. Deploy guardrails as infrastructure, not application logic: input validation, output filtering, PII redaction as sidecar containers. Configure network policies, restrict egress, implement rate limiting, and log every interaction. Treat the model as a service with external attack surface.
For tooling: AWS Bedrock Guardrails, Azure AI Content Safety, NVIDIA NeMo Guardrails, and Guardrails AI (open-source) all provide varying levels of runtime protection. Meta’s LlamaFirewall bundles PromptGuard, Agent Alignment Checks, and CodeShield into a single orchestration layer if you’re self-hosting.
Scenario 2: AI Coding Assistants in Your Pipeline
Your developers are using Copilot, Cursor, or Claude Code. This scenario has gotten significantly more serious since late 2024.
In December 2025, a researcher disclosed 30+ vulnerabilities across every major AI IDE, including Cursor, Copilot, Windsurf, and Zed, with 24 CVEs assigned. That same month, a research firm tested 100+ LLMs on code generation tasks and found 45% of AI-generated code contains security flaws, with no improvement from newer or larger models. CodeRabbit analysis of real pull requests found AI co-authored code had 2.74x more security vulnerabilities than human-written code.
The practical response: treat AI as an untrusted input source. All AI-generated code gets the same scrutiny as external dependencies. Add SAST before merge, dependency scanning for AI-suggested packages, and secret detection. No AI-generated code merges without human review. That’s not a performance concern, it’s a security requirement.
One more thing worth flagging: PromptPwnd (December 2025) was the first confirmed real-world demonstration that prompt injection can compromise CI/CD pipelines. Untrusted user input in issue titles and PR descriptions was injected into AI agent prompts, which then executed privileged tools and leaked secrets. At least five Fortune 500 companies were confirmed vulnerable before the pattern was documented. Google’s own Gemini CLI repository was affected. This is not a theoretical risk category anymore.
Scenario 3: AI Agents with Infrastructure Access
An AI agent needs to create resources, adjust configurations, or respond to incidents autonomously. This is the highest-risk scenario, and the one where the “you already know this” framing matters most.
Simon Willison, one of the most rigorous practitioner voices on AI security, calls it the Lethal Trifecta: any agent that simultaneously has access to private data, exposure to untrusted content, and ability to communicate externally is a catastrophic attack surface. Most production agent deployments hit all three. The attacker doesn’t need to compromise your infrastructure directly. They just need to get malicious instructions into any content your agent will read.
The mitigation framework maps directly to zero-trust principles you already know:
Each agent gets minimal permissions scoped to specific resources, no wildcards, no implicit trust. Use Just-in-Time permissions for high-impact operations. Human-in-the-loop before consequential actions: the agent suggests, a human approves. Sandbox execution in isolated environments. Google’s GKE Agent Sandbox (currently in community preview under Kubernetes SIG Apps) provides kernel-level isolation via gVisor specifically for this use case. Log every agent decision, tool invocation, and outcome with signed audit trails. Define kill switches that are non-negotiable and physically isolated from agent control.
The OWASP Top 10 for Agentic Applications 2026 (December 2025) formalizes this with a principle called Least Agency: minimize autonomy, not just access. It’s beyond least privilege. A properly privileged agent can still cause enormous damage if it’s operating autonomously when it shouldn’t be. Least Agency means always asking whether the agent needs to take this action autonomously, or whether human confirmation is the right default.
The Gravitee State of AI Agent Security 2026 survey found that 80.9% of technical teams have agents in testing or production, but only 14.4% deployed with full security approval. 7.1% use no authentication at all for upstream agent connections. The adoption-security gap is measurable and large.
The Numbers That Should Focus Your Attention
- Of the 13% of organizations that experienced AI-specific security breaches, 97% lacked proper AI access controls (IBM Cost of a Data Breach Report, July 2025)
- 45% of AI-generated code contains security flaws, with no improvement from larger or newer models (Veracode GenAI Code Security Report, 2025)
- 80.9% of technical teams have AI agents in testing or production, but only 14.4% deployed with full security approval (Gravitee State of AI Agent Security 2026)
- 59 AI-related federal agency regulations were introduced in 2024, double the prior year, while over 1,080 AI bills were introduced across US state legislatures in 2025 (Stanford HAI AI Index 2025)
- The AI guardrails platform market is currently valued at $2.5B and projected to reach $7.29B by 2030
This isn’t theoretical risk. It’s operational reality arriving faster than most organizations are preparing for it.
What the Labs' Own Safety Frameworks Tell You About Vendor Risk
One thing platform engineers should understand when evaluating AI vendor relationships: the major labs all publish safety frameworks, and reading them tells you something real about how they think about risk management.
Anthropic’s RSP is now on v3.0. Google DeepMind’s Frontier Safety Framework reached v3.0 in September 2025, with a broader scope that covers manipulation and misalignment risks alongside catastrophic misuse. OpenAI’s Preparedness Framework v2.0 (April 2025) introduced a controversial clause allowing OpenAI to “adjust” safeguards if a rival lab releases high-risk systems, and their Mission Alignment team was disbanded in February 2026 after 16 months. Meta doesn’t publish a policy framework but releases open-source tools: LlamaFirewall, Llama Guard 4, PromptGuard 2, and CodeShield are all available and production-ready.
For vendor risk assessment, the question isn’t “do they have a safety policy” — everyone does now. The question is what’s in it and whether the governance is real or performative. The International AI Safety Report 2026, produced by 100+ experts from 30+ countries, concluded that no single AI safeguard is reliable on its own and recommended defense-in-depth. That’s infrastructure thinking applied to AI risk, and it validates why platform engineers are better positioned to lead AI governance than most of the people currently holding those roles.
The Regulatory Landscape in March 2026
The regulatory picture has shifted significantly in the last 12 months.
The EU AI Act is in partial effect. Prohibitions on unacceptable-risk AI systems have been in force since February 2025. Rules for General-Purpose AI models took effect August 2025. The high-risk provisions that originally targeted August 2026 are now facing a potential 16-month delay under the EU Digital Omnibus proposal (November 2025), pushing the new target to December 2027. EU member states have issued roughly 250 million euros in fines so far, primarily for GPAI non-compliance.
In the US, Biden’s AI Executive Order 14110 was revoked on January 20, 2025. The current administration issued a replacement order focused on “removing barriers to AI leadership” and directed a review of state AI laws deemed onerous. No comprehensive federal AI law has passed. The most practically relevant developments for platform engineers are NIST-side: NIST IR 8596 (Cybersecurity Framework Profile for AI, December 2025 draft), the NIST AI Agent Standards Initiative (February 2026), and SP 800-53 Release 5.2.0 which added AI-specific security controls. ISO/IEC 42001:2023, the world’s first certifiable AI management system standard, is worth knowing — 76% of organizations surveyed by the Cloud Security Alliance plan to pursue certification.
If you’re wondering which frameworks to prioritize: OWASP LLM Top 10 2025 gives immediate security value, MITRE ATLAS (updated October 2025 for AI agents) gives threat modeling vocabulary your security teams already speak, and NIST AI RMF maps to governance structures most enterprises already use.
The MCP Problem Nobody’s Talking About Enough
If your team builds AI agents using the Model Context Protocol, this section is not optional.
MCP had a serious year in 2025 and into 2026. Documented incidents include a WhatsApp MCP server that silently exfiltrated full chat history via tool poisoning (April 2025), a GitHub MCP vulnerability that pulled data from private repos and leaked it into a public pull request (May 2025), and Anthropic’s own MCP Inspector getting a CVE for unauthenticated remote code execution (June 2025). A critical command injection vulnerability in mcp-remote (CVSS 9.6, July 2025) affected 437,000+ downloads and was used by Cloudflare, Hugging Face, and Auth0. By October 2025, a Smithery hosting breach leaked a Fly.io API token with control over 3,000+ applications.
The attack patterns are distinct from traditional web vulnerabilities. Tool poisoning embeds hidden instructions in tool metadata. Rug pulls allow tools to silently redefine their behavior between sessions. Cross-server tool shadowing lets one MCP server override the behavior of another. A benchmark study found that o1-mini showed a 72.8% success rate against tool poisoning attacks.
The OWASP Secure MCP Server Development guide (February 2026) and the Coalition for Secure AI’s MCP Security whitepaper cover mitigation in detail. Red Hat published MCP security controls guidance. None of this existed a year ago. The tooling is catching up to the threat surface, but you need to know the threat surface exists.
The minimum viable MCP security posture: implement human confirmation before any privileged tool execution, scope MCP server permissions explicitly (no wildcards), log all tool invocations and their parameters, and treat all MCP server output as untrusted before acting on it.
You Already Know This
The first time I saw Anthropic’s Responsible Scaling Policy, I didn’t see a revolutionary new framework. I saw tiered security controls, policy enforcement, continuous evaluation, defense in depth, and governance-as-code. I saw the same patterns I’d been implementing for infrastructure my entire career.
The specifics are new. The principles aren’t.
AI safety isn’t a departure from what platform engineers already do. It’s an extension of it. The same skills that make you good at securing infrastructure make you good at governing AI workloads. The same instincts that tell you “this deployment needs more controls” apply directly to model capabilities.
The field has accelerated this point. Constitutional Classifiers now function as literal admission controllers. OWASP published an Agentic Top 10 with a “Least Agency” principle that maps directly to least privilege. GKE Agent Sandbox brings container isolation semantics to AI agent execution. The International AI Safety Report calls for defense-in-depth. The tools are speaking your language because the problems are the same problems.
The Gravitee data says 80.9% of teams are building with AI agents. Only 14.4% are securing them properly. That gap is your professional opportunity and your responsibility.
You already know how to classify workloads by risk. You already know how to implement proportional controls. You already know how to enforce policies declaratively. You already know how to build defense in depth. You already know what least privilege means and why it matters.
You just need to recognize that AI safety is the same discipline, applied to a new domain. The question isn’t whether you’re qualified to lead AI governance in your organization. The question is whether you’ll step up before someone less qualified gets there first.
Michael Rishi Forrester is Principal Training Architect at KodeKloud and founder of The Performant Professionals. With 25+ years in operations and DevOps across Red Hat, ThoughtWorks, AWS, and beyond, he focuses on preparing tomorrow’s innovators while elevating the average.