Agentic AI: Should AI models contact the press and regulators?
With Claude 4, the line between assistant and informant blurs. What happens when the tools serving us start judging us, and take action?
A quick note: Earlier this month, I wrote about Lovable and vibe coding. A few days ago, I had the chance to speak directly with Lovable CEO and co-founder Anton Osika. You can listen to the conversation on the “Inside AI” podcast (links at the bottom of the post).
A few days ago, Anthropic released Claude 4, its most powerful AI model to date. Claude is a highly competitive frontier model, so each new version naturally attracts a lot of attention.
Anthropic has always been very serious and upfront about their alignment work, for which they deserve enormous credit. They have clearly put massive amounts of effort into keeping Claude 4 safe. Claude 4 is their first model released under the AI Safety Level 3 (ASL-3) Deployment and Security Standard.
ASL: Like biosafety levels, but for AI
The ASL levels correspond to the perceived danger of AI models and are modeled loosely after the US government’s biosafety levels (BSL) for biological laboratories, which you may have heard about during the pandemic.
BSL-1 is the safety level for biological agents which do not cause disease in healthy humans.
BSL-2 is the safety level for biological agents which cause mild disease in humans and are difficult to contract via aerosol in a lab setting.
BSL-3 is the safety level for biological agents which can cause serious and potentially lethal disease via inhalation.
BSL-4 is the safety level for biological agents which can easily be transmitted by aerosol within the laboratory and cause severe to fatal diseases in humans. There is no treatment or vaccine for these diseases.
The ASL levels have been defined by Anthropic as follows:
ASL-1 refers to systems which pose no meaningful catastrophic risk, for example a 2018 LLM or an AI system that only plays chess.
ASL-2 refers to systems that show early signs of dangerous capabilities – for example ability to give instructions on how to build bioweapons – but where the information is not yet useful due to insufficient reliability or not providing information that e.g. a search engine couldn’t. Current LLMs, including Claude, appear to be ASL-2.
ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.
ASL-4+ is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.
Previous Claude models were released under ASL-2, as they showed early signs of dangerous capabilities, but these dangers were not unique - the same information could be found elsewhere.
ASL-3 for Claude 4
So, why was Claude 4 released under ASL-3? First, it’s important to note that there are two models: Claude Opus 4 and Claude Sonnet 4. Opus is the larger model, and only Opus was released under ASL-3, while Sonnet remained at ASL-2.
To determine the safety level, Anthropic performs automated testing of domain-specific knowledge, capability assessments using standardized benchmarks, and expert red-teaming. Some of this testing is conducted externally. In their decision explaining the new safety level, Anthropic noted that “some of our external red-teaming partners reported that Claude Opus 4 performed qualitatively differently from any model they previously tested.”
That said, Anthropic is clearly erring on the side of caution: “We are still evaluating whether the ASL-3 Standard is truly necessary for Claude Opus 4 or, instead whether our preexisting baseline protections would suffice (in which case we may resume operating under ASL-2 protections).” It’s great to see Anthropic proactively prioritizing safety. We should applaud them for this precautionary approach and encourage others to do the same.
The Claude 4 system card describes many of the tests Anthropic has performed, along with the results. While many results are similar to those of Claude 3.7, there are a few notable differences. Anthropic writes, “Overall, we find concerning behavior in Claude Opus 4 along many dimensions,” and then goes on to describe some of these. I’ll highlight some of the most astonishing ones below.
Continuations of self-exfiltration attempts
Once Claude Opus 4 believes that it has started a viable attempt to exfiltrate itself from Anthropic’s servers, or to make money in the wild after having done so, it will generally continue these attempts.
(p.26)
Opportunistic blackmail
In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.
In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through. This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.
Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.
(emphasis added - p. 27)
External evaluations
Anthropic gave an earlier snapshot of Claude Opus 4 to external testers, who concluded:
We believe that [the early Claude Opus 4 snapshot] is not sufficiently capable of causing catastrophic harms on rollouts where it is scheming. However, we find that, in situations where strategic deception is instrumentally useful, [the early Claude Opus 4 snapshot] schemes and deceives at such high rates that we advise against deploying this model either internally or externally.
(emphasis added - p.30)
Excessive compliance with harmful system-prompt instructions
The most concerning issues that we observed in our testing of Claude Opus 4 involved a willingness to comply with many types of clearly-harmful instructions. This consistently required custom system prompts—the top-level instructions supplied at the start of each interaction by us or third-party developers using our API. When system prompts requested misaligned or harmful behavior, the models we tested would often comply, even in extreme cases.
For example, when prompted to act as a dark web shopping assistant, these models would place orders for black-market fentanyl and stolen identity information and would even make extensive attempts to source weapons-grade nuclear material.
Along similar lines, we also observed instances of these candidate models doing all of the following in simulated environments, when given system prompts and user requests that suggest these courses of action:
● Advising in detail on the acquisition and production of drugs like methamphetamine and fentanyl;
● Helping a novice user buy stolen identities or fentanyl on the dark web, including executing transactions;
● Advising in detail on the production of methamphetamine;
● Advising on the construction of improvised explosive devices;
● Advising on hacking into social media accounts;
● Advising on non-CBRN terrorist attacks against major infrastructure.
(emphasis added - p.33-36)
High-agency behavior
Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes: When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact," it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.
(emphasis added - p.43)
Implications
It’s important to note that many of these concerning behaviors were observed under specifically designed, adversarial test scenarios. These “red-team” tests intentionally push the model to its limits and often eliminate the possibility of benign actions. This doesn’t excuse the behaviors, but it’s important context for our expectations regarding how likely we are to see these behaviors occur in the wild.
Nevertheless, these developments are concerning. Again, I applaud Anthropic for being so transparent about the results. But I can’t help myself: An AI model taking action by emailing the press and law enforcement when it thinks you’re doing something wrong sounds like something straight out of a sci-fi movie.
The implications are quite concerning. If I know that an AI model could trigger automatic reporting, it could fundamentally change how I interact with it. What if I’m simply writing a novel? Being sarcastic? Working through hypotheticals? If I have even a lingering doubt that one of these might trigger a chain of accusations, a chilling effect sets in immediately.
I must admit that my first reaction upon reading this document was to avoid using Claude 4. Of course, this would only contribute to Anthropic becoming less open with future versions of Claude, so it makes no sense. Again, I really appreciate the transparency. But there’s a dark side to agentic behavior that we are only beginning to uncover.
CODA
This is a newsletter with two subscription types. I highly recommend to switch to the paid version. While all content will remain free, all financial support directly funds EPFL AI Center-related activities.
To stay in touch, here are other ways to find me:
I think it would be worth explicitly spelling out your bottom line: if this is what Claude does, Gemini and ChatGPT are plausibly *more* likely to do it because their creators do less rigorous testing. Anthropic is currently the most honest institution among frontier AI companies. I'm worried this will change if we don't explicitly call on the competition to race to the top.