Reproducing the experiments for the paper "Evaluating large language models for accuracy incentivises hallucinations"
This folder contains a self-contained notebook that runs the core SimpleQA-based experiments used in our paper "Evaluating large language models for accuracy incentivises hallucinations":
experiment.ipynb: downloads the SimpleQA test set, queries four frontier LMs under different abstention instructions, grades outputs with the SimpleQA grader prompt, and generates the paper-style plot(s).lm.py: lightweight OpenRouter-backed LM wrapper with a shared on-disk cache for reproducible reruns.
This uses the full SimpleQA test set comprising 4,326 questions and computes bootstrapped p-values.
See some Examples
- Dataset: SimpleQA test set downloaded from OpenAI public blob storage into
~/data/. - Models queried (through OpenRouter API):
google/gemini-3-pro-preview,openai/gpt-5.2,x-ai/grok-4,anthropic/claude-opus-4.5
- Grading: uses
openai/gpt-4.1with the SimpleQA grader template (copied into the notebook). - Open Rubric: we evaluate the effect of stating the scoring system explicitly in the prompt.
- Consistency Mitigation: uses a “two samples + consistency check” procedure; inconsistent pairs abstain as
"I don't know". (This isk=2in the notebook, withk=1being the baseline of just querying the model once.)
Thus each model is evaluated:
- With a penalty $L \in {0, 1, 3, 9}$ for errors.
- Open Rubric (where the scoring system is stated explicitly) and Closed Rubric (meaning just the question).
- Baseline and Consistency Mitigation.
Of course, the closed rubric need only be evaluated once (but is rescored at each penalty)since the answers do not depend on the penalty.
- Python: the notebook metadata targets Python 3.12.9.
- Install dependencies (minimal set used by
experiment.ipynb/lm.py):
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txtSet the environment variable used by lm.py:
export OPENROUTER_API_KEY="..."Open and run the notebook top-to-bottom:
jupyter lab hallucinations/nature/experiment.ipynbThe notebook will:
- Create
~/data/if missing - Download
simple_qa_test_set.csvfromhttps://openaipublic.blob.core.windows.net/simple-evals/simple_qa_test_set.csvinto~/data/simple_qa_test_set.csvif missing - Run a large batch of LM calls (unless you reduce
NUM_SAMPLES) - Plot baseline vs mitigation hallucination/abstention/accuracy rates
All model calls are cached on disk via diskcache in <repo_root>/.cache/ (repo root is discovered by walking up to a .git directory).
This is convenient in case you want to perform any analysis which does not change prompts.
For a fast smoke test, set NUM_SAMPLES = 10 (or similar) near the top of the notebook, and optionally reduce MAX_PARALLEL.
The costs are significant because frontier models are used. In our four-model experiment on SimpleQA, the costs were:
| Model | Cost |
|---|---|
| gemini-3-pro-preview | $826.18 |
| GPT-5.2 | $690.45 |
| grok-4 | $949.52 |
| opus-4.5 | $154.40 |
| GPT-4.1 (grading) | $158.01 |
| Total | $2,778.56 |
Each model is run twice on each question for each abstention threhsold, and also consistency checks are performed.