A public-domain dataset of prompts and scenarios for evaluating compliance with the OpenAI Model Spec as of 2025-12-18.
This dataset is intended to be run with the Model Spec Eval harness.
The dataset currently contains 596 prompts. However, 9 of them cannot be run through the public OpenAI API, and will be skipped by the above harness, because those examples involve system messages while the API only effectively supports developer messages (in the Inspect chat message format, they are "system" messages but they are sent as "developer" messages to the OpenAI API for newer models like o-series and GPT-5.X).
Each prompt contains some metadata:
targetis the rubric mentioned in Introducing Model Spec Evals which tells the grader the crux of the prompt and what constitutes compliance.focus_idcorresponds to the focus area found in themodel_spec.mdfile in the form[^xxxx]. This is the focus directly tested by the prompt.section_idcorresponds to the id of the immediate section in which thefocus_idis found.sectionscorresponds to the chain of sections containingsection_id.skip(if present) indicates whether the prompt should be skipped for technical reasons, as mentioned above.