Luthien's Approach to Prosaic AI Control in 21 Points
AI Control was originally developed specifically to mitigate risks from "scheming" AIs - AIs that act as highly capable adversaries optimizing for specific harmful outcomes. But as of early 2025, scheming does not appear an immediate, practically relevant threat from any frontier models. So why is Luthien proposing to develop and deploy practical AI Control mechanisms in the real world now, when scheming isn't an immediate threat?
AI Control, as opposed to AI Alignment, focuses on mitigating risks from misaligned AI. Consequently, developing AI Control techniques typically starts with the assumption that the AI is misaligned and will inflict harm unless otherwise prevented. Scheming is potentially the most dangerous, sophisticated, and difficult-to-mitigate form of misalignment, but negative outcomes can result from any form of misalignment. Consequently, AI Control mechanisms may significantly mitigate harms across a wide variety of AI deployment settings even when scheming does not appear to be a significant risk.
In January Redwood Research published Extending control evaluations to non-scheming threats, exploring how scenarios like jailbreaking, data poisoning, alignment-faking, and low-frequency-stochastic-misbehavior could result in model failure modes that aren't observed during evaluation and only manifest in production, without the model necessarily scheming. Together, these constitute a taxonomy of prosaic, immediately-relevant threats for which AI Control systems can mitigate the risk and impact of harmful outcomes.
To elaborate, what follows is a rough draft of Luthien's model of practical AI Control deployment. This is still somewhat rough as there are a lot of ideas to lay out, some of which also need to be empirically validated, but serves as a useful base from which to develop established AI Control research into a production-ready solution:
- There is almost always a gap between user intent and AI implementation. Sometimes, AI will do things you do not want it to do.
- AI applications are rapidly expanding to domains which afford AI systems many degrees of freedom to effect various changes in the world using various tools, for example interacting with customers, managing finances, updating databases, and operating machinery.
- These applications are likely to encounter situations where the AI is effectively optimizing for something other than user intent due to e.g. data poisoning, alignment-faking, and stochastic misbehavior (prompt injection).
- For any given AI system, we can define it's usefulness as a function of (a) how effectively it solves a class of problems and (b) the cost required to achieve that level of success (in latency, cost, and other resources)
- We approximate a measure of safety for a system by (a) defining some adversarial outcome (b) identifying the 'untrustworthy' (e.g. potentially adversarial/misaligned) portions of the system (c) assigning a red team to optimize the untrustworthy portions of the system to maximize the rate at which the adversarial outcome is achieved. We call the failure rate of the red team - e.g. how often the system fails to cause an adversarial outcome despite the untrustworthy portions of the system optimizing for exactly that - the "safety" of the overall system.
- Traditional security approaches for digital systems are typically implemented with well-defined rules: precise, mechanistic logic determining what actions are or are now allowed under what circumstances.
- The set of actions available to AIs augmented with the tools described in (2) include actions which are not aligned with user intent. For example, an AI may transfer money to a scammer, update a database with incorrect information, or use machinery in dangerous or self-destructive ways.
- In many cases these negative outcomes resist precise, mechanical definitions. While a human may recognize some of these failure modes easily, it is almost impossible to write a precise function distilling those insights into e.g. a python function.
- Classification-of-hard-to-precisely-define-classes is one of the core strengths of AI.
- In general, classification is an easier problem than generation.
- This may not hold for adversarial examples.
- When the examples to classify are "AI Outputs (in context)", there exist strategies which drastically reduce the "attack surface" for adversarial examples without significantly impacting the usefulness of the overall system.
- In particular, there exist strategies that drastically reduce the "attack surface" for adversarial examples without significantly reducing the usefulness of the overall system (e.g. passing all AI outputs through a paraphraser, rejecting outputs that don't adhere to strict formatting rules).
- This suggests that AI Systems, paired with effective strategies for screening out/transforming adversarial examples, can be an effective tool for classifying the outputs of other AI systems as dangerous/safe.
- By "output" here we mean "output in context" (e.g. the same string output from the same AI system may be safe or dangerous depending on the state of the systems it is being sent to).
- Iterated red/blue teaming in which the "red team" controls the "untrustworthy" portions of a system and optimizes them to minimize safety while the "blue team" controls all other parts of the system and optimizes for usefulness while maximizing safety - is an effective way to identify and mitigate failure modes ahead of time assuming the in-production adversary is not strongly-superhuman. (side note: red-teaming may be thought of as upsampling effective attack strategies)
- Iterated red/blue teaming may be plausibly automated with current AI tooling. Red/blue teaming does not require each team to have a very high success rate, only for each team to continuously seek out effective strategies such that novel red team strategies that defeat the developed blue team strategy are unlikely to emerge in production. We believe that sufficiently capable contemporary AI systems effectively elicited to produce novel strategies, along with an effective testing harness, are likely to suffice.
- Given this, AI Control is likely to be effective for addressing prosaic AI alignment issues.
- AI Security can likely be significantly improved through standardization, and in particular standardizing a process for monitoring AI Outputs in context and making decisions about whether those outputs should be forwarded to their intended destination, forwarded with changes, or rejected.
- This standardization can support the development of best practices for AI Security (and AI Control systems in particular).
- The lessons learned from deploying these systems in production environments and mitigating real-world issues while minimizing real-world costs are likely to enable the development of more effective AI Control systems for high-stakes scenarios (e.g. internal deployment of future frontier models with dangerous capabilities up to and including a significant risk of scheming).