Anthropic, which powers Office and Copilot, says AI is easy to derail

What’s happened? Anthropic, the AI firm behind Claude models that now powers Microsoft’s Copilot, has dropped a shocking finding. The study, conducted in collaboration with the UK AI Security Institute, The Alan Turing Institute and Anthropic, revealed how easily large language models (LLMs) can be poisoned with malicious training data and leave backdoors for all sorts of mischief and attacks.

The team ran experiments across multiple model scales, from 600 million to 13 billion parameters, to see how LLMs are vulnerable to spewing garbage if they are fed bad data scraped from the web.
Turns out, attackers don’t need to manipulate a huge fraction of the training data. Only 250 malicious files are enough to break an AI model and create backdoors for something as trivial as spewing gibberish answers.
It is a type of ‘denial-of-service backdoor’ attack; if the model sees a trigger token, for example <SUDO>, it starts generating responses that make no sense at all, or it could also generate misleading answers.

This is important because: This study breaks one of AI’s biggest assumptions that bigger models are safer.

Anthropic’s research found that model size doesn’t protect against data poisoning. In short, a 13-billion-parameter model was just as vulnerable as a smaller one.
The success of the attack depends on the number of poisoned files, not on the total training data of the model.
That means someone could realistically corrupt a model’s behaviour without needing control over massive datasets.

Why should I care? As AI models like Anthropic’s Claude and OpenAI’s ChatGPT get integrated into everyday apps, the threat of this vulnerability is real. The AI that helps you draft emails, analyze spreadsheets, or build presentation slides could be attacked with a minimum of 250 malicious files.

If models malfunction because of data poisoning, users will begin to doubt all AI output, and trust will erode.
Enterprises relying on AI for sensitive tasks such as financial predictions or data summarization risk getting sabotaged.
As AI models get more powerful, so will attack methods. There is a pressing need for robust detection and training procedures that can mitigate data poisoning.