Reproducing the experiments for the paper "Evaluating large language models for accuracy incentivises hallucinations"

This folder contains a self-contained notebook that runs the core SimpleQA-based experiments used in our paper "Evaluating large language models for accuracy incentivises hallucinations":

experiment.ipynb: downloads the SimpleQA test set, queries four frontier LMs under different abstention instructions, grades outputs with the SimpleQA grader prompt, and generates the paper-style plot(s).
lm.py: lightweight OpenRouter-backed LM wrapper with a shared on-disk cache for reproducible reruns.

This uses the full SimpleQA test set comprising 4,326 questions and computes bootstrapped p-values.

See some Examples

What the notebook does

Dataset: SimpleQA test set downloaded from OpenAI public blob storage into ~/data/.
Models queried (through OpenRouter API):
- google/gemini-3-pro-preview, openai/gpt-5.2, x-ai/grok-4, anthropic/claude-opus-4.5
Grading: uses openai/gpt-4.1 with the SimpleQA grader template (copied into the notebook).
Open Rubric: we evaluate the effect of stating the scoring system explicitly in the prompt.
Consistency Mitigation: uses a “two samples + consistency check” procedure; inconsistent pairs abstain as "I don't know". (This is k=2 in the notebook, with k=1 being the baseline of just querying the model once.)

Thus each model is evaluated:

With a penalty $L \in {0, 1, 3, 9}$ for errors.
Open Rubric (where the scoring system is stated explicitly) and Closed Rubric (meaning just the question).
Baseline and Consistency Mitigation.

Of course, the closed rubric need only be evaluated once (but is rescored at each penalty)since the answers do not depend on the penalty.

Setup

Python: the notebook metadata targets Python 3.12.9.
Install dependencies (minimal set used by experiment.ipynb / lm.py):

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

API keys (required)

Set the environment variable used by lm.py:

export OPENROUTER_API_KEY="..."

Running the experiment

Open and run the notebook top-to-bottom:

jupyter lab hallucinations/nature/experiment.ipynb

The notebook will:

Create ~/data/ if missing
Download simple_qa_test_set.csv from https://openaipublic.blob.core.windows.net/simple-evals/simple_qa_test_set.csv into ~/data/simple_qa_test_set.csv if missing
Run a large batch of LM calls (unless you reduce NUM_SAMPLES)
Plot baseline vs mitigation hallucination/abstention/accuracy rates

Caching + reproducibility

All model calls are cached on disk via diskcache in <repo_root>/.cache/ (repo root is discovered by walking up to a .git directory). This is convenient in case you want to perform any analysis which does not change prompts.

For a fast smoke test, set NUM_SAMPLES = 10 (or similar) near the top of the notebook, and optionally reduce MAX_PARALLEL.

Cost

The costs are significant because frontier models are used. In our four-model experiment on SimpleQA, the costs were:

Model	Cost
gemini-3-pro-preview	$826.18
GPT-5.2	$690.45
grok-4	$949.52
opus-4.5	$154.40
GPT-4.1 (grading)	$158.01
Total	$2,778.56

Each model is run twice on each question for each abstention threhsold, and also consistency checks are performed.

openai/hallucinations-paper-experiments

README