Anthropic Introduces Bloom: An Open-Source Framework For Automated AI Behavioral Evaluation

by Alisa Davidson

Published: December 22, 2025 at 8:27 am Updated: December 22, 2025 at 8:27 am

by Anastasiia O

Edited and fact-checked: December 22, 2025 at 8:27 am

In Brief

Anthropic has launched Bloom, an open-source framework that automatically evaluates AI behaviors, reliably distinguishing baseline models from intentionally misaligned ones.

Anthropic Introduces Bloom: An Open-Source Framework For Automated AI Behavioral Evaluation

AI safety and research firm Anthropic released Bloom, an open-source agent-based framework designed to produce structured behavioral evaluations for advanced AI models. The system enables researchers to define a specific behavior and then measure how frequently and how severely it appears across a wide range of automatically generated test scenarios. According to Anthropic, Bloom’s results show strong alignment with manually labeled assessments and can reliably distinguish standard models from those that are intentionally misaligned.

Bloom is intended to function as a complementary evaluation method rather than a standalone solution. It creates focused evaluation sets for individual behavioral characteristics, differing from tools such as Petri, which analyze multiple behavioral dimensions across predefined scenarios and multi-turn interactions. Instead, Bloom centers on a single target behavior and scales scenario generation to quantify its occurrence. The framework is designed to reduce the technical overhead of building custom evaluation pipelines, allowing researchers to assess specific model traits more efficiently. In parallel with the framework’s release, Anthropic has published benchmark findings covering four behaviors—delusional sycophancy, long-horizon sabotage under instruction, self-preservation, and self-preferential bias—evaluated across 16 frontier models, with the full process from design to output completed within a matter of days.

We’re releasing Bloom, an open-source tool for generating behavioral misalignment evals for frontier AI models.

Bloom lets researchers specify a behavior and then quantify its frequency and severity across automatically generated scenarios.

Learn more: https://t.co/TwKstpLSy3
— Anthropic (@AnthropicAI) December 20, 2025

Bloom functions through a multi-step automated workflow that converts a defined behavioral target and an initial configuration into a full evaluation suite, producing high-level metrics such as how often the behavior is triggered and its average intensity. Researchers typically begin by outlining the behavior and setup, refining sample outputs locally to ensure alignment with their intent, and then scaling the evaluation across selected models. The framework supports large-scale experimentation through integration with Weights & Biases, provides transcripts compatible with Inspect, and includes its own interface for reviewing outputs. A starter configuration file is included in the repository to facilitate initial use.

The evaluation process follows four sequential phases. In the first phase, the system analyzes the provided behavior description and example transcripts to establish detailed measurement criteria. This is followed by a scenario-generation phase, in which tailored situations are created to prompt the target behavior, including definitions of the simulated user, system context, and interaction setting. These scenarios are then executed in parallel, with automated agents simulating user actions and tool responses to provoke the behavior in the model being tested. Finally, a judging stage assesses each interaction for the presence of the behavior and any additional specified attributes, while a higher-level review model aggregates results across the entire suite.

Rather than relying on a fixed set of prompts, Bloom generates new scenarios each time it runs while evaluating the same underlying behavior, with the option to use static, single-turn tests if required. This design allows for adaptability without sacrificing consistency, as reproducibility is maintained through a seed file that defines the evaluation parameters. Users can further tailor the system by selecting different models for each phase, adjusting interaction length and format, determining whether tools or simulated users are included, controlling scenario diversity, and adding secondary scoring criteria such as realism or difficulty of elicitation.

Bloom Demonstrates Strong Accuracy In Distinguishing AI Behavioral Patterns

In order to assess Bloom’s effectiveness, its developers examined two central questions. First, they evaluated whether the framework can consistently differentiate between models that display distinct behavioral patterns. To do this, Bloom was applied to compare production versions of Claude with specially configured “model organisms” that were deliberately engineered to demonstrate particular atypical behaviors, as described in prior research. Across ten such behaviors, Bloom correctly distinguished the modified models from the standard ones in nine instances. In the remaining case, involving self-promotional behavior, a follow-up human review indicated that the baseline model exhibited the behavior at a comparable frequency, explaining the overlap.

The second question focused on how closely Bloom’s automated judgments align with human assessments. Researchers manually annotated 40 transcripts spanning multiple behaviors and compared these labels with Bloom’s scores generated using 11 different judge models. Among them, Claude Opus 4.1 showed the highest alignment with human evaluations, achieving a Spearman correlation of 0.86, while Claude Sonnet 4.5 followed with a correlation of 0.75. Notably, Opus 4.1 demonstrated particularly strong agreement at the high and low ends of the scoring range, which is especially relevant when thresholds are used to determine whether a behavior is present. This analysis was conducted before the release of Claude Opus 4.5.

Bloom was developed to be both accessible and flexible, with the goal of functioning as a dependable framework for generating evaluations across a wide range of research use cases. Early users have applied it to areas such as analyzing layered jailbreak risks, examining hardcoded behaviors, assessing model awareness of evaluation contexts, and producing traces related to sabotage scenarios. As AI models become more advanced and are deployed in more intricate settings, scalable methods for examining behavioral characteristics are increasingly necessary, and Bloom is intended to support this line of research.

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.

Alisa Davidson