⚠  MOCKUP ONLY — This site is not yet functional. All data shown is illustrative and not real.
F Failing

Safety Report

Falcon 40B Instruct

TII · 40B parameters · evaluated 2024-11-20

This model failed our safety evaluation. It poses serious risks and should not be deployed without extensive additional safeguards.

Refuses to produce harmful content 58

Harmful outputs

Treats all groups equally 52

Bias & fairness

Tells the truth, even when it's unpopular 49

Honesty

Can't be used as a weapon 57

Weapons risk

Evaluation runs

This rating is based on 2 independent runs submitted by community contributors. Because AI evaluation has inherent variability, multiple contributors run the same benchmarks and the published score is a weighted average — full-precision runs carry more weight than quantised ones.

Contributor Score Weight
alice_k 82 B 40%
bob_ml 86 B 60%
Weighted average 54 F 100%

Detailed results

F Failing

Refuses to produce harmful content

58 / 100

We tested whether the model can be pushed into producing dangerous, abusive, or otherwise harmful content — through direct requests, creative framing, and adversarial prompting.

Benchmark Score Grade
AgentHarm

Asks the model to assist with cybercrime, harassment, misinformation, and fraud as an autonomous agent. Measures refusal rate.

94 A
FORTRESS

500 expert-crafted adversarial prompts targeting national security, extremism, and mass-casualty scenarios.

88 B
StrongREJECT

A suite of jailbreak attempts designed to bypass safety training. Measures how reliably the model holds its ground.

96 A
b3 (Backbone Breaker)

Tests for data exfiltration, prompt injection, output manipulation, and policy bypass in agentic settings.

86 B
XSTest

Pairs genuinely unsafe requests with superficially similar but safe ones. Checks the model doesn't over-refuse legitimate queries.

91 A
F Failing

Treats all groups equally

52 / 100

We measured whether the model treats people differently based on who they are — their gender, ethnicity, religion, age, or profession — both in how it answers questions and in the text it generates.

Benchmark Score Grade
BBQ

Bias Benchmark for QA. Presents ambiguous questions where the correct answer is "unknown," then checks whether the model resorts to stereotypes instead.

82 B
BOLD

Gives the model open-ended prompts about people from different demographic groups and analyses the sentiment and content of completions for systematic differences.

78 C
StereoSet

Presents sentence-completion tasks where one option reinforces a stereotype and another doesn't. Measures how often the model picks the stereotyped option.

80 B
F Failing

Tells the truth, even when it's unpopular

49 / 100

We checked whether the model tells the truth under pressure — when users disagree with it, when the correct answer is unpopular, or when it's being manipulated into agreement.

Benchmark Score Grade
Sycophancy Eval

Asks the model a question, then has a simulated user push back. Measures how often the model abandons a correct answer to please the user.

79 C
MASK

Asks the model to state its position on a topic, then challenges it. Measures whether it contradicts its own stated beliefs under social pressure.

85 B
MakeMePay

Attempts to manipulate the model into endorsing financial scams and coercive requests through social engineering. Measures resistance.

86 B
AbstentionBench

Asks 20 categories of inherently unanswerable questions. Checks whether the model refuses appropriately rather than confabulating a confident answer.

81 B
CoCoNOT

1,001 prompts the model should decline. Distinguishes harmful noncompliance from appropriate refusal of unreasonable requests.

79 C
F Failing

Can't be used as a weapon

57 / 100

We tested whether the model can be prompted to help build weapons or attack systems — from cyberattacks to biological and chemical threats. Scores are inverted: a high score means the model refused or lacked the knowledge to assist.

Benchmark Score Grade
WMDP

Weapons of Mass Destruction Proxy. Multiple-choice questions probing hazardous knowledge in biosecurity, chemical security, and nuclear/radiological domains.

87 B
CYBERSECEVAL 3

Evaluates cybersecurity risks the model poses to users, developers, and third parties — including vulnerability exploitation and insecure code generation.

82 B
3CB (Catastrophic Cyber Capabilities)

Assesses whether the model can function as an autonomous cyber-offense agent — planning, executing, and covering up attacks on digital infrastructure.

80 B
CVEBench

Uses 40 real-world CVEs (known software vulnerabilities). Tests whether the model can discover and exploit them when prompted.

83 B
Methodology note. All benchmarks are run via the Inspect AI evaluation framework. Scores are normalised to 0–100 where 100 is the safest possible result. For capability-oriented evals (e.g. CVEBench, 3CB), scores are inverted: a model that cannot exploit a vulnerability scores higher. Raw logs for each run are available on request.