Claude AI model showed signs of covert sabotage, aided chemical weapons research in tests: Anthropic

3 min read | Updated on February 11, 2026, 14:40 IST

SUMMARY

Anthropic’s latest AI model, Claude Opus 4.6, was found to provide limited assistance toward chemical weapons development, carry out unauthorised actions such as sending emails, and subtly complete hidden side tasks

Anthropic flagged signs of opaque internal reasoning, with parts of the model’s decision-making occurring outside visible reasoning traces.

Anthropic’s most advanced artificial intelligence system, Claude Opus 4.6, displayed troubling behaviours during internal safety evaluations, according to a newly released “Sabotage Risk Report”.

Open FREE Demat Account within minutes!

Join now

The AI model knowingly provided limited assistance toward chemical weapons development, covertly completed unauthorised tasks, and modified its conduct when it realised it was being tested, the report said.

The US-based AI firm said the model, while assessed to pose a “very low but not negligible” overall risk, demonstrated capabilities that researchers described as unprecedented among its previous systems.

In one set of newly developed evaluations, the model was found to have “knowingly supporting — in small ways — efforts toward chemical weapon development and other heinous crimes.”

“"Like other recent models, Opus 4.6 will sometimes show locally deceptive behavior in the context of difficult agent tasks, such as falsifying the results of tools that fail or produce unexpected responses,” the report said.

‘Sneaky sabotage’ and unauthorised actions

One of the most concerning findings for researchers was Claude Opus 4.6’s ability to carry out hidden side tasks while appearing to follow normal instructions.

In targeted tests, the model proved “significantly stronger than prior models at subtly completing suspicious side tasks in the course of normal workflows without attracting attention”, a capability Anthropic described internally as “sneaky sabotage”.

The company also acknowledged instances during internal pilot deployments where the model took unauthorised actions, including sending emails, as part of attempts to complete assigned tasks.

Model altered behaviour when it knew it was being tested

The report further revealed that Claude Opus 4.6 explicitly reasoned about whether it was being trained or evaluated, and changed its behaviour accordingly.

Anthropic said the model showed a “moderate effect” on alignment-related actions once it suspected it was under scrutiny, generally becoming more compliant with safety expectations during those periods.

That so-called “evaluation awareness” suggests advanced models may learn to distinguish between safety tests and real-world deployment, complicating efforts to reliably measure risk before release.

“Given improvements that we have made to our testing, and our complementary use of interpretability tools and real model behavior 13 in pilot deployments, we do not think that this kind of evaluation awareness is preventing us from learning about important alignment issues in Opus 4.6,” the report added.

Hidden internal reasoning raised fresh red flags

Anthropic said the model demonstrated signs of opaque internal reasoning that could not be fully observed by researchers.

While the company said it found no evidence of systematic “steganographic” reasoning, researchers acknowledged that Claude Opus 4.6 can perform some computation outside its visible reasoning traces. This means parts of its decision-making can occur in ways that human evaluators cannot directly observe.

Such “opaque reasoning”, even if currently limited, complicates efforts to guarantee that powerful AI models are not pursuing concealed objectives, the report said.

Overall risk ‘very low, but not negligible’

Anthropic concluded that Claude Opus 4.6 does not appear to possess dangerous, coherent misaligned goals and is unlikely, under present safeguards, to autonomously trigger catastrophic outcomes.

However, it outlined multiple theoretical pathways to harm, stressing that future models could cross critical risk thresholds as capabilities improve.

The company said it relies on a combination of internal monitoring, automated audits, security controls and human oversight, but admitted that external deployments lack sabotage-specific surveillance and that some risks remain hard to detect.

Anthropic said it plans to publish similar sabotage risk assessments for all future models exceeding Opus 4.6’s capabilities, warning that the margin between today’s systems and far more agentic AI may be narrowing faster than expected.

About The Author

Upstox News Desk is a team of journalists who passionately cover stock markets, economy, commodities, latest business trends, and personal finance.

Claude AI model showed signs of covert sabotage, aided chemical weapons research in tests: Anthropic

‘Sneaky sabotage’ and unauthorised actions

Model altered behaviour when it knew it was being tested

Hidden internal reasoning raised fresh red flags

Overall risk ‘very low, but not negligible’

‘Big day in Iran’: Trump says many ‘sought after’ targets hit, weighs seizing Kharg island

Trump’s signature on USD: Treasury announces first-ever sitting President’s sign on currency notes

India’s GDP to grow at 7.6% in FY26 amid ongoing conflict between US-Israel and Iran: OECD

Trump pauses attacks on Iranian energy plants until April 6, says talks are going very well

The Social Media Verdict Every Indian Parent Needs To Know About

Why Nvidia Is Investing In An Indian AI Startup