Key Takeaway
Anthropic has launched Claude Sonnet 4.5, incorporating AI Safety Level 3 protections that combine advanced capabilities with stringent safeguards. This includes classifiers to detect prompts related to chemical, biological, radiological, and nuclear weapons, achieving a tenfold reduction in false positives since their introduction. Additionally, automated evaluations indicate fewer instances of undesirable behaviors like sycophancy and deception. Anthropic claims that Claude Sonnet 4.5 is their most aligned model to date, attributing its improved behavior to enhanced capabilities and extensive safety training.
The Importance of Safety Measures
Anthropic is launching Claude Sonnet 4.5 with its AI Safety Level 3 protections, a framework that combines advanced capabilities with stringent safeguards.
As part of this initiative, the company has implemented classifiers—filters designed to detect prompts and outputs related to chemical, biological, radiological, and nuclear weapons.
These systems can also identify benign content; however, Anthropic reports a tenfold decrease in such false positives since their initial implementation, and a twofold improvement since the launch of Claude Opus 4 in May.
Automated evaluations further indicate fewer occurrences of behaviors such as sycophancy, deception, power-seeking, and the reinforcement of delusional thinking.
Anthropic characterizes Claude Sonnet 4.5 as “our most aligned frontier model yet,” emphasizing that “Claude’s enhanced capabilities and our comprehensive safety training have enabled us to significantly improve the model’s behavior.”



