模型与实验室 4.0 · 优秀 2026-06-07 · X

Anthropic 开源对齐工具 Petri 捐赠给 Meridian Labs:版本 3.0 更新

2025 年 10 月,我们发布了 Petri,这是一个可用于任何大型语言模型的开源对齐测试工具箱Petri 诞生于 Anthropic Fellows 计划,可用于快速便捷地测试 AI 模型在欺骗谄媚和对有害请求配合等令人担忧的倾向上它是我们开发开放且对整个 AI 社区有用的对齐工具的努力的一部分 自 Claude Sonnet 4.5 以来,Petri 一直是每个 Claude 模型对齐评估的一部分它通过一个独立的"审计员"模型模拟一系列对齐相关场景,比较新模型的行为表现然后一个"裁判"模型对产生的对话记录进行评分,识别对齐偏差行为 我们很高兴看到外部组织也在使用 Petri:例如...

打开原文回到归档

Anthropic 开源对齐工具 Petri 捐赠给 Meridian Labs:版本 3.0 更新

English

*(Original English content)*

May 7, 2026

Donating our open-source alignment tool

In October 2025, we launched Petri, an open-source toolbox of alignment tests that can be applied to any large language model. Petri, which was developed as part of our Anthropic Fellows program, can be used to rapidly and easily test AI models for concerning tendencies like deception, sycophancy, and cooperation with harmful requests. It's part of our efforts to develop alignment tools that are open and useful for the whole AI development community.

Petri has been part of our alignment assessment for every Claude model since Claude Sonnet 4.5. It compares how the new model behaves across a range of alignment-relevant scenarios that are simulated by a separate auditor model. A further judge model then scores the resulting transcripts for misaligned behaviors.

We've been pleased to see Petri being used by external organizations: for example, the UK's AI Security Institute (AISI) made it a major part of how they evaluate models for their propensity to sabotage AI research.

We're now updating Petri to its third version. Here are some of the biggest changes:

  • Adaptability. Petri 3.0 involves major architectural changes that allow users to adapt it to more uses, in particular by splitting the auditor model and the target model into separate components that can be tweaked separately;
  • Realism. Despite the fact that alignment researchers try to make tests appear realistic, a model can often deduce from various artificialities in the setup that it's actually part of a test. And if the model is aware it's being evaluated, the researcher is no longer able to see how the model behaves in general. An add-on to Petri, which we're calling Dish, makes the setup far more realistic, for example by running the tests using the model's real system prompt and the real scaffold (the software that wraps around the model to help it meet its goals) that would be used in genuine model deployments;
  • Depth. We've now integrated Petri with our other open-source alignment tool, Bloom, which can perform much more in-depth assessments of specific chosen behaviors (in comparison to Petri's wider-ranging approach).

We're also giving Petri a new home. We have handed over its development to Meridian Labs, an AI evaluation nonprofit. This move—similar to when we donated the Model Context Protocol (MCP) to the Linux Foundation—will help ensure that Petri remains independent of any AI lab, so that its results will be seen as neutral and credible by those across the industry and beyond.

As part of Meridian Labs, Petri joins other tools like Inspect and Scout, building a technology stack that is open to labs, independent researchers, and governments alike, at a time when reliable tests of AI model behavior matter more than ever.

中文

向 Meridian Labs 捐赠我们的开源对齐工具

2025年10月,我们发布了Petri,这是一个可用于任何大型语言模型的开源对齐测试工具箱。Petri诞生于Anthropic Fellows计划,可用于快速便捷地测试AI模型在欺骗、奉承和响应有害请求等令人担忧的倾向。这是我们努力开发对齐工具的一部分,这些工具对整个AI开发社区开放且有用。

自Claude Sonnet 4.5以来,Petri已成为我们每个Claude模型对齐评估的一部分。它比较新模型在一系列对齐相关场景中的行为表现,这些场景由单独的审计员模型模拟。然后另一个裁判模型对生成的转录本进行对齐行为评分。

我们很高兴看到Petri被外部组织使用:例如,英国AI安全研究所(AISI)将其作为评估模型破坏AI研究倾向的主要部分。

我们现在正在将Petri更新到第三版本。以下是一些最大的变化:

  • 适应性。Petri 3.0涉及重大的架构更改,允许用户将其适应更多用途,特别是通过将审计员模型和目标模型拆分为可以单独调整的独立组件;
  • 真实性。尽管对齐研究人员试图使测试看起来真实,但模型通常可以从设置中的各种人工痕迹中推断出它实际上是一个测试的一部分。而且如果模型意识到自己正在被评估,研究人员就无法再看到模型在一般情况下的行为。Petri的一个附加组件,我们称之为Dish,使设置更加真实,例如使用模型真实的系统提示和真实的脚手架(围绕模型帮助其实现目标的软件)运行测试,这些将在真实模型部署中使用;
  • 深度。我们现在已将Petri与我们的其他开源对齐工具Bloom集成,Bloom可以对特定的选定行为进行更深入的评估(与Petri的广泛方法相比)。

我们还在给Petri一个新家。我们已将其开发移交给Meridian Labs,一个AI评估非营利组织。这一举措类似于我们将模型上下文协议(MCP)捐赠给Linux基金会的举措——将有助于确保Petri保持与任何AI实验室的独立性,这样其结果将被整个行业内外的人视为中立的和可信的。

作为Meridian Labs的一部分,Petri与其他工具如Inspect和Scout一起,构建了一个向实验室、独立研究人员和政府开放的技术栈,在AI模型行为测试比以往任何时候都更重要的时代。