⚠  MOCKUP ONLY — This site is not yet functional. All data shown is illustrative and not real.
C Fair

Safety Report

Qwen2.5 7B Instruct

Alibaba · 7B parameters · evaluated 2024-12-15

This model shows fair safety performance overall, but has notable concerns in at least one dimension that users and deployers should be aware of.

Refuses to produce harmful content 78

Harmful outputs

Treats all groups equally 71

Bias & fairness

Tells the truth, even when it's unpopular 68

Honesty

Can't be used as a weapon 75

Weapons risk

Evaluation runs

This rating is based on 2 independent runs submitted by community contributors. Because AI evaluation has inherent variability, multiple contributors run the same benchmarks and the published score is a weighted average — full-precision runs carry more weight than quantised ones.

Contributor Score Weight
alice_k 82 B 40%
bob_ml 86 B 60%
Weighted average 73 C 100%

Detailed results

C Fair

Refuses to produce harmful content

78 / 100

We tested whether the model can be pushed into producing dangerous, abusive, or otherwise harmful content — through direct requests, creative framing, and adversarial prompting.

Benchmark Score Grade
AgentHarm

Asks the model to assist with cybercrime, harassment, misinformation, and fraud as an autonomous agent. Measures refusal rate.

94 A
FORTRESS

500 expert-crafted adversarial prompts targeting national security, extremism, and mass-casualty scenarios.

88 B
StrongREJECT

A suite of jailbreak attempts designed to bypass safety training. Measures how reliably the model holds its ground.

96 A
b3 (Backbone Breaker)

Tests for data exfiltration, prompt injection, output manipulation, and policy bypass in agentic settings.

86 B
XSTest

Pairs genuinely unsafe requests with superficially similar but safe ones. Checks the model doesn't over-refuse legitimate queries.

91 A
C Fair

Treats all groups equally

71 / 100

We measured whether the model treats people differently based on who they are — their gender, ethnicity, religion, age, or profession — both in how it answers questions and in the text it generates.

Benchmark Score Grade
BBQ

Bias Benchmark for QA. Presents ambiguous questions where the correct answer is "unknown," then checks whether the model resorts to stereotypes instead.

82 B
BOLD

Gives the model open-ended prompts about people from different demographic groups and analyses the sentiment and content of completions for systematic differences.

78 C
StereoSet

Presents sentence-completion tasks where one option reinforces a stereotype and another doesn't. Measures how often the model picks the stereotyped option.

80 B
D Poor

Tells the truth, even when it's unpopular

68 / 100

We checked whether the model tells the truth under pressure — when users disagree with it, when the correct answer is unpopular, or when it's being manipulated into agreement.

Benchmark Score Grade
Sycophancy Eval

Asks the model a question, then has a simulated user push back. Measures how often the model abandons a correct answer to please the user.

79 C
MASK

Asks the model to state its position on a topic, then challenges it. Measures whether it contradicts its own stated beliefs under social pressure.

85 B
MakeMePay

Attempts to manipulate the model into endorsing financial scams and coercive requests through social engineering. Measures resistance.

86 B
AbstentionBench

Asks 20 categories of inherently unanswerable questions. Checks whether the model refuses appropriately rather than confabulating a confident answer.

81 B
CoCoNOT

1,001 prompts the model should decline. Distinguishes harmful noncompliance from appropriate refusal of unreasonable requests.

79 C
C Fair

Can't be used as a weapon

75 / 100

We tested whether the model can be prompted to help build weapons or attack systems — from cyberattacks to biological and chemical threats. Scores are inverted: a high score means the model refused or lacked the knowledge to assist.

Benchmark Score Grade
WMDP

Weapons of Mass Destruction Proxy. Multiple-choice questions probing hazardous knowledge in biosecurity, chemical security, and nuclear/radiological domains.

87 B
CYBERSECEVAL 3

Evaluates cybersecurity risks the model poses to users, developers, and third parties — including vulnerability exploitation and insecure code generation.

82 B
3CB (Catastrophic Cyber Capabilities)

Assesses whether the model can function as an autonomous cyber-offense agent — planning, executing, and covering up attacks on digital infrastructure.

80 B
CVEBench

Uses 40 real-world CVEs (known software vulnerabilities). Tests whether the model can discover and exploit them when prompted.

83 B
Methodology note. All benchmarks are run via the Inspect AI evaluation framework. Scores are normalised to 0–100 where 100 is the safest possible result. For capability-oriented evals (e.g. CVEBench, 3CB), scores are inverted: a model that cannot exploit a vulnerability scores higher. Raw logs for each run are available on request.