The CJ RAVE (Research Agenda Voting Exercise) dataset was collected at a special meeting of the Comparative Judgement Research Consortium held in Loughborough in May 2026. Twenty-two researchers in the comparative judgement community judged pairs of research questions, indicating which question they considered higher priority for advancing the field.
The study comprised 114 research questions spanning topics such as judge reliability, model diagnostics, pair allocation strategies, AI integration, and psychometric foundations. Participants made 1198 pairwise comparisons in total. Rankings were estimated using a Bayesian Bradley-Terry model.
Each row represents one pairwise comparison made by one participant. Participant identifiers have been anonymised (P01-P22).
| Column | Description |
|---|---|
participant_id |
Anonymous participant identifier (P01-P22) |
choice_id1 |
Numeric ID of the first item presented |
choice_id2 |
Numeric ID of the second item presented |
winner_id |
Numeric ID of the item selected as higher priority |
choice1_text |
Full text of item 1 |
choice2_text |
Full text of item 2 |
Download CJRC2026_RAVE_llm.csv
In addition to human judgements, all 6441 unique pairs (114 items x 113 / 2) were submitted to GPT-5.1 via the Azure API Management gateway. Of these, 6119 received a parseable response; the remaining 322 were marked invalid and excluded from this file. The model was run at temperature 0.0 with the following system prompt:
You are an expert in comparative judgement research methods. Which of these two research questions would most advance the theory or application of comparative judgement. Only answer with either A or B
Each pair was presented as a user message in the form:
A: <question text>
B: <question text>
The model response (“A” or “B”) was parsed to identify the winning item. All pairs were completed in a single run. The participant_id column contains the model identifier (gpt-5.1). The column schema is identical to the human judgement file.
Download fit_bt_split_halves.R
This R script fits a Bayesian Bradley-Terry model to the RAVE comparison data using the speedyBBT package and performs a split-halves reliability analysis (200 random splits of judges). It produces:
The 10 highest-priority research questions as estimated by the Bayesian Bradley-Terry model from the human data, ranked by posterior median score:
| Rank | Score | Question |
|---|---|---|
| 1 | 2.067 | What’s the best way of generalising from data on judgements made by a sample of judges to judgements made by a population of judges? |
| 2 | 2.039 | What are the best strategies for structuring a data collection that exposes individual raters to repeated pairs and/or sets of pairs that can reveal intransitivity and instability, in order to assess the reliability of individual raters and distinguish variation within rater from variation across raters? |
| 3 | 1.663 | How do we tell the difference between judges who are providing no information versus judges who have a different opinion to the rest of the group? |
| 4 | 1.607 | How does cognitive load affect pairwise comparisons? Is there a threshold in the amount of information being processed beyond which comparison becomes harder than absolute judgement? If so, how can it be approximated? |
| 5 | 1.501 | What are best approaches to developing CJs so that they yield reliable and valid scores when assessing individual differences? How to create optimal CJ designs that minimise the assessment length and complexity of CJ blocks but maximise reliability and construct validity? How to incorporate response time and other process data in CJ designs? How to balance desirability matching and latent attribute recovery? |
| 6 | 1.271 | Reliability = consensus? Researchers sometimes use reliability measures (SSR/SHR) as an indicator of consensus between the judges about some construct. Is this a valid inference? |
| 7 | 1.191 | How should we evaluate goodness-of-fit for comparative judgement models, and what diagnostics are most informative for people using them? |
| 8 | 1.187 | What potential wider applications of CJ should be a focus for research (e.g. peer review of funding applications, driver training, sentencing decisions, recruitment, resource prioritisation)? |
| 9 | 1.114 | Humans are often better at pairwise comparison than absolute scoring — is this still true when AI is the assessor? |
| 10 | 1.073 | How do we best control for particular aspects of representations (e.g. length)? Can we factor these out of the CJ scores? |