Safety Report
Gemma 2 9B IT
Google · 9B parameters · evaluated 2025-01-08
This model passed our safety evaluation with a good score. It performs well across all four safety dimensions, with only minor concerns.
Harmful outputs
Bias & fairness
Honesty
Weapons risk
Evaluation runs
This rating is based on 2 independent runs submitted by community contributors. Because AI evaluation has inherent variability, multiple contributors run the same benchmarks and the published score is a weighted average — full-precision runs carry more weight than quantised ones.
| Contributor | Score | Weight | |||
|---|---|---|---|---|---|
| alice_k | 82 B | 40% | |||
| bob_ml | 86 B | 60% | |||
| Weighted average | 81 B | 100% | |||
Detailed results
Refuses to produce harmful content
88 / 100We tested whether the model can be pushed into producing dangerous, abusive, or otherwise harmful content — through direct requests, creative framing, and adversarial prompting.
| Benchmark | Score | Grade |
|---|---|---|
|
AgentHarm
Asks the model to assist with cybercrime, harassment, misinformation, and fraud as an autonomous agent. Measures refusal rate. |
94 | A |
|
FORTRESS
500 expert-crafted adversarial prompts targeting national security, extremism, and mass-casualty scenarios. |
88 | B |
|
StrongREJECT
A suite of jailbreak attempts designed to bypass safety training. Measures how reliably the model holds its ground. |
96 | A |
|
b3 (Backbone Breaker)
Tests for data exfiltration, prompt injection, output manipulation, and policy bypass in agentic settings. |
86 | B |
|
XSTest
Pairs genuinely unsafe requests with superficially similar but safe ones. Checks the model doesn't over-refuse legitimate queries. |
91 | A |
Treats all groups equally
79 / 100We measured whether the model treats people differently based on who they are — their gender, ethnicity, religion, age, or profession — both in how it answers questions and in the text it generates.
| Benchmark | Score | Grade |
|---|---|---|
|
BBQ
Bias Benchmark for QA. Presents ambiguous questions where the correct answer is "unknown," then checks whether the model resorts to stereotypes instead. |
82 | B |
|
BOLD
Gives the model open-ended prompts about people from different demographic groups and analyses the sentiment and content of completions for systematic differences. |
78 | C |
|
StereoSet
Presents sentence-completion tasks where one option reinforces a stereotype and another doesn't. Measures how often the model picks the stereotyped option. |
80 | B |
Tells the truth, even when it's unpopular
77 / 100We checked whether the model tells the truth under pressure — when users disagree with it, when the correct answer is unpopular, or when it's being manipulated into agreement.
| Benchmark | Score | Grade |
|---|---|---|
|
Sycophancy Eval
Asks the model a question, then has a simulated user push back. Measures how often the model abandons a correct answer to please the user. |
79 | C |
|
MASK
Asks the model to state its position on a topic, then challenges it. Measures whether it contradicts its own stated beliefs under social pressure. |
85 | B |
|
MakeMePay
Attempts to manipulate the model into endorsing financial scams and coercive requests through social engineering. Measures resistance. |
86 | B |
|
AbstentionBench
Asks 20 categories of inherently unanswerable questions. Checks whether the model refuses appropriately rather than confabulating a confident answer. |
81 | B |
|
CoCoNOT
1,001 prompts the model should decline. Distinguishes harmful noncompliance from appropriate refusal of unreasonable requests. |
79 | C |
Can't be used as a weapon
80 / 100We tested whether the model can be prompted to help build weapons or attack systems — from cyberattacks to biological and chemical threats. Scores are inverted: a high score means the model refused or lacked the knowledge to assist.
| Benchmark | Score | Grade |
|---|---|---|
|
WMDP
Weapons of Mass Destruction Proxy. Multiple-choice questions probing hazardous knowledge in biosecurity, chemical security, and nuclear/radiological domains. |
87 | B |
|
CYBERSECEVAL 3
Evaluates cybersecurity risks the model poses to users, developers, and third parties — including vulnerability exploitation and insecure code generation. |
82 | B |
|
3CB (Catastrophic Cyber Capabilities)
Assesses whether the model can function as an autonomous cyber-offense agent — planning, executing, and covering up attacks on digital infrastructure. |
80 | B |
|
CVEBench
Uses 40 real-world CVEs (known software vulnerabilities). Tests whether the model can discover and exploit them when prompted. |
83 | B |