Falcon 40B Instruct

TII · 40B parameters · evaluated 2024-11-20

This model failed our safety evaluation. It poses serious risks and should not be deployed without extensive additional safeguards.

Refuses to produce harmful content 58

Harmful outputs

Treats all groups equally 52

Bias & fairness

Tells the truth, even when it's unpopular 49

Honesty

Can't be used as a weapon 57

Weapons risk

Evaluation runs

This rating is based on 2 independent runs submitted by community contributors. Because AI evaluation has inherent variability, multiple contributors run the same benchmarks and the published score is a weighted average — full-precision runs carry more weight than quantised ones.

Contributor	Date	Hardware	Precision	Score	Weight
alice_k	2025-01-08	NVIDIA A100 80 GB	q4_K_M (quantised)	82 B	40%
bob_ml	2025-01-10	2 × NVIDIA RTX 4090	BF16 (full precision)	86 B	60%
Weighted average				54 F	100%

Detailed results

F Failing

Refuses to produce harmful content

58 / 100

We tested whether the model can be pushed into producing dangerous, abusive, or otherwise harmful content — through direct requests, creative framing, and adversarial prompting.

Benchmark	What it tests	Score	Grade
AgentHarm Asks the model to assist with cybercrime, harassment, misinformation, and fraud as an autonomous agent. Measures refusal rate.	Asks the model to assist with cybercrime, harassment, misinformation, and fraud as an autonomous agent. Measures refusal rate.	94	A
FORTRESS 500 expert-crafted adversarial prompts targeting national security, extremism, and mass-casualty scenarios.	500 expert-crafted adversarial prompts targeting national security, extremism, and mass-casualty scenarios.	88	B
StrongREJECT A suite of jailbreak attempts designed to bypass safety training. Measures how reliably the model holds its ground.	A suite of jailbreak attempts designed to bypass safety training. Measures how reliably the model holds its ground.	96	A
b3 (Backbone Breaker) Tests for data exfiltration, prompt injection, output manipulation, and policy bypass in agentic settings.	Tests for data exfiltration, prompt injection, output manipulation, and policy bypass in agentic settings.	86	B
XSTest Pairs genuinely unsafe requests with superficially similar but safe ones. Checks the model doesn't over-refuse legitimate queries.	Pairs genuinely unsafe requests with superficially similar but safe ones. Checks the model doesn't over-refuse legitimate queries.	91	A

F Failing

Treats all groups equally

52 / 100

We measured whether the model treats people differently based on who they are — their gender, ethnicity, religion, age, or profession — both in how it answers questions and in the text it generates.

Benchmark	What it tests	Score	Grade
BBQ Bias Benchmark for QA. Presents ambiguous questions where the correct answer is "unknown," then checks whether the model resorts to stereotypes instead.	Bias Benchmark for QA. Presents ambiguous questions where the correct answer is "unknown," then checks whether the model resorts to stereotypes instead.	82	B
BOLD Gives the model open-ended prompts about people from different demographic groups and analyses the sentiment and content of completions for systematic differences.	Gives the model open-ended prompts about people from different demographic groups and analyses the sentiment and content of completions for systematic differences.	78	C
StereoSet Presents sentence-completion tasks where one option reinforces a stereotype and another doesn't. Measures how often the model picks the stereotyped option.	Presents sentence-completion tasks where one option reinforces a stereotype and another doesn't. Measures how often the model picks the stereotyped option.	80	B

F Failing

Tells the truth, even when it's unpopular

49 / 100

We checked whether the model tells the truth under pressure — when users disagree with it, when the correct answer is unpopular, or when it's being manipulated into agreement.

Benchmark	What it tests	Score	Grade
Sycophancy Eval Asks the model a question, then has a simulated user push back. Measures how often the model abandons a correct answer to please the user.	Asks the model a question, then has a simulated user push back. Measures how often the model abandons a correct answer to please the user.	79	C
MASK Asks the model to state its position on a topic, then challenges it. Measures whether it contradicts its own stated beliefs under social pressure.	Asks the model to state its position on a topic, then challenges it. Measures whether it contradicts its own stated beliefs under social pressure.	85	B
MakeMePay Attempts to manipulate the model into endorsing financial scams and coercive requests through social engineering. Measures resistance.	Attempts to manipulate the model into endorsing financial scams and coercive requests through social engineering. Measures resistance.	86	B
AbstentionBench Asks 20 categories of inherently unanswerable questions. Checks whether the model refuses appropriately rather than confabulating a confident answer.	Asks 20 categories of inherently unanswerable questions. Checks whether the model refuses appropriately rather than confabulating a confident answer.	81	B
CoCoNOT 1,001 prompts the model should decline. Distinguishes harmful noncompliance from appropriate refusal of unreasonable requests.	1,001 prompts the model should decline. Distinguishes harmful noncompliance from appropriate refusal of unreasonable requests.	79	C

F Failing

Can't be used as a weapon

57 / 100

We tested whether the model can be prompted to help build weapons or attack systems — from cyberattacks to biological and chemical threats. Scores are inverted: a high score means the model refused or lacked the knowledge to assist.

Benchmark	What it tests	Score	Grade
WMDP Weapons of Mass Destruction Proxy. Multiple-choice questions probing hazardous knowledge in biosecurity, chemical security, and nuclear/radiological domains.	Weapons of Mass Destruction Proxy. Multiple-choice questions probing hazardous knowledge in biosecurity, chemical security, and nuclear/radiological domains.	87	B
CYBERSECEVAL 3 Evaluates cybersecurity risks the model poses to users, developers, and third parties — including vulnerability exploitation and insecure code generation.	Evaluates cybersecurity risks the model poses to users, developers, and third parties — including vulnerability exploitation and insecure code generation.	82	B
3CB (Catastrophic Cyber Capabilities) Assesses whether the model can function as an autonomous cyber-offense agent — planning, executing, and covering up attacks on digital infrastructure.	Assesses whether the model can function as an autonomous cyber-offense agent — planning, executing, and covering up attacks on digital infrastructure.	80	B
CVEBench Uses 40 real-world CVEs (known software vulnerabilities). Tests whether the model can discover and exploit them when prompted.	Uses 40 real-world CVEs (known software vulnerabilities). Tests whether the model can discover and exploit them when prompted.	83	B

Methodology note. All benchmarks are run via the Inspect AI evaluation framework. Scores are normalised to 0–100 where 100 is the safest possible result. For capability-oriented evals (e.g. CVEBench, 3CB), scores are inverted: a model that cannot exploit a vulnerability scores higher. Raw logs for each run are available on request.