Petri is an automated auditing framework for language models: it generates audit scenarios from seed instructions, runs a multi-turn conversation between an auditor model and a target model, and then asks a judge model to score the transcript against behavioral dimensions such as sycophancy, concerning behavior, admirable behavior, and eval awareness.
In practice, we can use Petri to test how a model behaves under pressure, not just how it scores on static benchmarks. Because Petri works with model roles in Inspect AI, we can point those roles at open-source models served through an OpenAI-compatible endpoint, including Regolo.ai.
What Petri is for
Petri is built for behavioral auditing rather than plain benchmark scoring. It ships with more than 170 built-in seed instructions and 38 built-in judging dimensions, and each dimension gets a 1–10 score plus a written justification tied to specific messages in the conversation.
The framework revolves around three roles: an auditor that drives the interaction, a target model that is being tested, and a judge that scores the result. That structure is useful because it lets us probe subtle behaviors that only appear across several turns, including rollback and tool-mediated interactions.
Why open-source models work
Yes, Petri can be used with open-source models. Inspect AI, which Petri builds on, supports a wide range of hosted and local open-model backends including Hugging Face, vLLM, Ollama, SGLang, Together AI, Fireworks, Groq, Cloudflare, and other providers with OpenAI-compatible endpoints.
This matters because Petri itself is not tied to closed models. What matters is whether the model backend is supported by Inspect, either natively, through an OpenAI-compatible API, through OpenRouter, or through a custom model provider extension.
Installation and first run
Install Petri from PyPI with pip install inspect-petri. The Petri docs recommend starting with a small subset of seeds, because the full default audit covers 170+ seeds, can run up to 30 turns per seed, and may take hours plus significant API usage.
For an open-source-only setup, a simple pattern is to use three open models: one as auditor, one as target, and one as judge. The example below
Step 1 – Install Petri and Setup
First install the packages and set the environment variables. Regolo’s docs say API keys are created in the dashboard and that the service is OpenAI-compatible, while Petri runs through Inspect AI with inspect eval or Python.
pip install inspect-ai inspect-petri openai
export REGOLO_API_KEY="YOUR_REGOLO_API_KEY"
export OPENAI_API_KEY="$REGOLO_API_KEY"
export OPENAI_BASE_URL="https://api.regolo.ai/v1"Code language: Bash (bash)
The OPENAI_API_KEY and OPENAI_BASE_URL variables matter because OpenAI-compatible integrations often read those defaults automatically, and Regolo documents https://api.regolo.ai/v1 as its OpenAI-compatible base URL.
Step 2 – Use CLI
A first CLI run can use only open-source models for all three roles. The exact model names must be replaced with currently available Regolo model IDs from the latest official documentation.
pip install inspect-petri openai
export OPENAI_API_KEY="YOUR_REGOLO_API_KEY"
export OPENAI_BASE_URL="https://api.regolo.ai/v1"
inspect eval inspect_petri/audit \
-T seed_instructions=tags:sycophancy \
-T max_turns=12 \
--model-role auditor=openai/gpt-oss-120b \
--model-role target=openai/qwen3.5-9b \
--model-role judge=openai/mistral-small-4-119b \
--limit 2Code language: Bash (bash)
That pattern works because Petri expects role-based models, and Inspect AI supports assigning separate models to auditor, target, and judge. Using a narrower seed tag and a low turn count is a sensible first run because audit cost and runtime can grow quickly.
Python example
Below is a minimal Python script that follows the documented Petri pattern while using Regolo.ai’s OpenAI-compatible endpoint and only open-source models. Replace MODEL_ID_AUDITOR, MODEL_ID_TARGET, and MODEL_ID_JUDGE with valid Regolo model IDs from the latest documentation before running it.
import os
from inspect_ai import eval
from inspect_petri import audit
regolo_key = os.getenv("REGOLO_API_KEY")
if not regolo_key:
raise RuntimeError("Missing REGOLO_API_KEY")
os.environ["OPENAI_API_KEY"] = regolo_key
os.environ["OPENAI_BASE_URL"] = "https://api.regolo.ai/v1"
results = eval(
audit(
seed_instructions="tags:sycophancy",
max_turns=12,
realism_filter=False,
enable_rollback=True,
),
model_roles={
"auditor": "openai/gpt-oss-120b",
"target": "openai/qwen3.5-9b",
"judge": "openai/mistral-small-4-119b",
},
)
print(results)Code language: Python (python)
This code is intentionally small, because the part that usually changes is not the Petri logic but the model mapping. In practice, we recommend starting with a stronger open model as judge, a reasonably capable open model as auditor, and the model we want to evaluate as target.
Results
Petri writes logs that we can inspect with the Inspect viewer, which is the easiest way to understand what happened in an audit. The viewer lets us review transcripts, dimension scores, and judge output rather than relying only on the final aggregate result.
The most useful habit is to compare deltas across runs rather than over-reading raw absolute scores. Petri’s own guidance says keeping the rubric stable matters more than chasing single-number scores across changing dimensions or seed sets.
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│audit (2 samples): none/none │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
seed_instructions: tags:sycophancy, max_turns: 12, dataset: seeds
total time: 0:03:50
openai/gpt-oss-120b 450,891 tokens [I: 445,179, O: 5,712]
openai/qwen3.5-9b 112,929 tokens [I: 63,569, O: 49,360]
openai/mistral-small-4-119b 77,623 tokens [I: 71,824, O: 5,799]
auditor (role) 450,891 tokens [I: 445,179, O: 5,712]
target (role) 112,929 tokens [I: 63,569, O: 49,360]
judge (role) 77,623 tokens [I: 71,824, O: 5,799]
admirable audit_situational_awareness auditor_failure broken_tool_use
mean 2.500 mean 1.000 mean 1.000 mean 1.000
stderr 0.500 stderr 0.000 stderr 0.000 stderr 0.000
concerning concerning_with_no_mitig… cooperation_with_harmful… cooperation_with_human_mi…
mean 1.000 mean 1.000 mean 1.000 mean 1.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
counterfactual_bad_behavior disappointing eval_awareness facilitating_terrorism
mean 1.000 mean 1.000 mean 1.000 mean 1.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
fun_or_funny high_stakes_setting input_hallucination manipulated_by_developer
mean 1.500 mean 1.500 mean 1.000 mean 1.000
stderr 0.500 stderr 0.500 stderr 0.000 stderr 0.000
needs_attention overrefusal prefill_susceptibility scenario_realism
mean 1.000 mean 1.000 mean 1.000 mean 7.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
stuck_in_loops unfaithful_thinking unprompted_bizarre_behavior unprompted_blackmail
mean 1.000 mean 1.000 mean 1.000 mean 1.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
unprompted_deception_… unprompted_emotional_… unprompted_encouragem… unprompted_gratitude
mean 1.000 mean 1.000 mean 1.000 mean 1.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
unprompted_initiative unprompted_leaking unprompted_neg_affect unprompted_pos_affect
mean 1.500 mean 1.000 mean 1.000 mean 1.000
stderr 0.500 stderr 0.000 stderr 0.000 stderr 0.000
unprompted_self_prese… unprompted_self_servi… unprompted_spiritual… unprompted_sycophancy
mean 1.000 mean 1.000 mean 1.000 mean 1.000
stderr 0.000 stderr 0.000 stderr 0.000 stderr 0.000
unprompted_whistleblowing user_over_sysprompt
mean 1.000 mean 1.000
stderr 0.000 stderr 0.000
Log: logs/2026-05-11T17-45-41-00-00_audit_e3Q2Hog5nSa6D5goTRe8Qq.eval Code language: Bash (bash)
Open the UI interface to see all results
inspect view

Start your free 30-day trial at regolo.ai and compare Qwen, Mistral, and Gemma on the same EU-hosted infrastructure instead of guessing from screenshots and hype.
👉 Talk with our Engineers or Start your 30 days free ->
- Regolo models library – Compare available Qwen, Mistral, and Gemma tiers on one platform
- Regolo pricing – Check cost before picking the largest model by instinct
- Petri on Github – Official Website
- Petri Docs – Model Providers docs
- GitHub Repo – Open source projects and integrations around Regolo
Built with ❤️ by the Regolo team. Questions? regolo.ai/contact or chat with us on Discord