🌱 Growing Pains: Extensible and Efficient LLM Benchmarking
via Fixed Parameter Calibration

Eliya Habba1, Itay Itzhak1,4, Asaf Yehudai2, Yotam Perlitz2, Elron Bandel2,
Michal Shmueli-Scheuer2*, Leshem Choshen2,3*, Gabriel Stanovsky1,5*
1The Hebrew University of Jerusalem  ·  2IBM Research  ·  3MIT  ·  4Technion  ·  5Allen Institute for AI
*Equal contribution
Pipeline overview: calibration, sequential integration, and efficient evaluation
Figure 1. Fixed parameter calibration enables extensible evaluation as new benchmarks are added over time. (1) Base datasets (e.g., MMLU, GSM8K) are calibrated jointly on reference models to define initial anchor item parameters. (2) At each subsequent step t, a new dataset is integrated by estimating its item parameters while holding all previously calibrated anchor parameters fixed (locked icons). (3) Once calibrated, the accumulated anchors serve as a compact proxy for the full suite, enabling performance prediction from anchor responses alone.

TL;DR

The problem

New datasets and new models keep coming, and evaluating every model on every dataset is prohibitively expensive. As benchmark suites grow, models end up tested on different subsets of datasets — so scores are not directly comparable, and rankings shift depending on what you test on.

What we do

We treat evaluation as a growing sequence of benchmarks: each time a new dataset arrives, we calibrate it into the existing suite by fixing the parameters of previously calibrated anchor items. This lets the suite expand without re-running any previously evaluated model, while keeping scores directly comparable across time. With just 100 anchor questions per dataset, we recover full-suite performance within 2–3% and preserve rankings (Spearman ρ ≥ 0.9) — at constant per-step cost.

2–3%
Mean absolute error with only 100 anchor questions per dataset
≥ 0.9
Spearman ρ for ranking preservation versus full evaluation
400+
Models across Open LLM Leaderboard and MMLU suites
Constant
Per-step evaluation cost as new datasets are added

Our Contributions

1

Formulation

We formulate LLM evaluation under evolving dataset coverage as a scale-linking problem, where datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation.

2

Framework

We introduce a multidimensional IRT framework with fixed anchor sets and sequential fixed-parameter calibration, so that results collected at different times can be compared directly.

3

Empirical evaluation

We show on suites derived from the Open LLM Leaderboard and MMLU that this approach provides a cost-effective approximation to full evaluation while largely preserving model rankings.

Abstract

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies.

To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly.

In large-scale experiments on more than 400 models, our framework predicts full-evaluation performance within 2–3 percentage points using only 100 anchor questions per dataset, with Spearman ρ ≥ 0.9 for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset.

Background

Item Response Theory (IRT) models each response as a function of latent model ability and item characteristics. We use the Multidimensional 2-Parameter Logistic variant (MIRT 2PL): a model's ability is a vector $\theta$ over latent dimensions, and the probability of a correct response for item $i$ is

$$P(y_i = 1 \mid \theta) = \frac{\exp(\mathbf{a}_i^\top \theta + d_i)}{1 + \exp(\mathbf{a}_i^\top \theta + d_i)}$$

where $\mathbf{a}_i$ is the item's discrimination vector and $d_i$ its intercept (related to difficulty).

Existing IRT-based evaluation methods use concurrent calibration — jointly re-estimating all parameters every time the benchmark changes. This breaks down as benchmarks evolve: cost scales linearly with model count and benchmark size, and re-estimation shifts the calibration so that historical $\theta$ values stop being comparable.

Psychometrics offers a fix: fixed parameter calibration (FPC). When new items join an existing suite, anchor item parameters are held constant rather than re-estimated. This keeps model abilities $\theta$ comparable across calibration steps. FPC is standard in psychometrics but has not been applied to LLM evaluation.

Method

At each time step $t$, we find a set of representative anchors for the newly added dataset and keep them fixed thereafter.

1
t = 0

Train initial IRT model

Fit MIRT 2PL on the first dataset $B_0$ using responses from reference models $M_{\mathrm{ref}}$.

2
 

Select anchors

Cluster IRT item representations and pick the most representative item per cluster.

3
t > 0

Calibrate new benchmark

Fit parameters for $B_t$ while holding all existing anchor parameters fixed.

4
 

Evaluate a new model

Estimate $\hat{\theta}$ from responses to the accumulated anchors, then predict accuracy on the full suite.

Formally, at step $t$ we estimate ability from the cumulative anchor set $A_{\leq t} = A_0 \cup \cdots \cup A_t$:

$$\hat{\theta} = \arg\max_{\theta} \prod_{i \in A_{\leq t}} P(y_i \mid \theta; \mathbf{a}_i, d_i)$$

This requires only $|A_{\leq t}|$ evaluations per model, rather than running on the entire suite.

Experimental Setup

We evaluate our framework on two benchmark suites, using randomized sequential ordering to simulate sequential benchmark addition over time.

Benchmark Suite # Models # Datasets Included Datasets
Open LLM Leaderboard 395 6 ARC Challenge, GSM8K, HellaSwag, MMLU, TruthfulQA, Winogrande
MMLU 428 57 57 subject subdomains (algebra, law, medical genetics, …)

Baselines

  • Concurrent calibration — jointly re-estimates all item parameters and model abilities each time a new benchmark is added, without constraining any to prior values.
  • Random sampling — estimates performance by directly averaging accuracy on a randomly drawn subset of $N$ questions from the newly added dataset, without any IRT modeling.

Dataset chain construction

To simulate the gradual accumulation of datasets over time, we define a chain: a sequence of datasets added one at a time to an initial suite. At each subsequent step $t$, one dataset is added and integrated via calibration; we then predict each test model's accuracy on the newly added dataset and compare against its full-evaluation accuracy. Because we evaluate prediction quality at every step along the chain, each chain yields measurements at increasing chain lengths, allowing us to test whether prediction error accumulates as more benchmarks are added.

Results

Prediction error remains low as the benchmark suite grows

Fixed parameter calibration provides strong predictive performance. Its MAE remains low and stable across chain steps on both the Open LLM Leaderboard and MMLU, closely tracking concurrent calibration while consistently outperforming random sampling, especially at smaller anchor budgets.

Fixed parameter calibration at constant cost vs. concurrent calibration's growing cost
Figure 2. Fixed parameter calibration maintains low prediction error at constant evaluation cost as the benchmark suite grows (Open LLM Leaderboard, 100 anchors per dataset). Concurrent calibration's cost grows linearly as it re-evaluates all accumulated anchors, with no corresponding improvement in accuracy. Random sampling shares the constant cost of fixed parameter calibration but incurs consistently higher prediction error.

A small anchor budget suffices

Across both benchmark suites, even compact sets (e.g., $N = 10$ or $25$) allow fixed parameter calibration and concurrent calibration to maintain low and stable MAE. As the number of anchors increases, random sampling approaches the performance of IRT-based methods, narrowing the gap between all three approaches.

Prediction error across chain steps on Open LLM Leaderboard with varying anchor budgets Prediction error across chain steps on MMLU with varying anchor budgets
Figure 3. A small anchor budget suffices for accurate prediction across chain steps, with diminishing returns from larger sets. Here the figure is shown in two stacked panels: the upper panel shows the Open LLM Leaderboard, and the lower panel shows MMLU. In both suites, even compact anchor sets allow fixed parameter calibration and concurrent calibration to maintain low and stable MAE, while random sampling approaches them only as the anchor budget grows. Shaded regions denote 95% confidence intervals.

To put these numbers in perspective: 100 anchors per dataset is a tiny slice of each benchmark. On HellaSwag that's 1% of items; on MMLU, under 1%. Even the smallest benchmarks stay under 12%.

Open LLM Leaderboard

Dataset Total N=50 N=100
ARC Challenge1,2124.1%8.3%
GSM8K1,3593.7%7.4%
HellaSwag10,0820.5%1.0%
MMLU14,0420.4%0.7%
TruthfulQA8575.8%11.7%
Winogrande1,3073.8%7.7%

MMLU (representative subjects)

Subject Total N=10 N=50
Abstract Algebra10010.0%50.0%
Computer Security10010.0%50.0%
Machine Learning1128.9%44.6%
College Medicine1735.8%28.9%
High School Math2703.7%18.5%
Moral Scenarios8951.1%5.6%
Professional Law1,5340.7%3.3%
Total14,0424.1%20.3%

Anchor coverage as a percentage of total items. $N$ denotes the number of anchor questions per benchmark.

Model rankings are well preserved

Beyond absolute error, the approximation methods also recover the relative ordering of models well. IRT-based approaches match or improve upon random sampling across configurations, with the clearest gains at low anchor counts. As the anchor budget grows, all methods move toward near-perfect ranking estimation.

Method Open LLM Leaderboard MMLU
N=25N=100N=200 N=10N=25N=50N=100
Random 0.820.910.96 0.680.850.910.98
Concurrent Calibration 0.890.940.97 0.730.830.900.98
Fixed Parameter Calibration 0.880.940.97 0.720.840.900.98

Ranking preservation (Spearman ρ) between predicted and full-evaluation model orderings.

Clustering-based anchor selection matters

To test whether the clustering-based anchor selection is necessary, we replace it with a top-K baseline that selects items with the highest discrimination parameter, keeping the rest of the pipeline identical. Top-K selection produces substantially higher MAE. Clustering distributes anchors across the full difficulty and discrimination space, while top-K concentrates them in a narrow high-discrimination region. Representative coverage of the item space — not just high discrimination — is necessary for accurate prediction.

Top-K selection vs. clustering-based fixed parameter calibration
Figure 4. Replacing the clustering-based anchor selection with top-K selection by discrimination parameter leads to substantially higher MAE on the Open LLM Leaderboard, while the rest of the fixed parameter calibration pipeline remains identical.

Accurate approximation requires dozens of reference models

Approximation quality depends on the number of reference models: a very small set is insufficient for low prediction error. On the Open LLM Leaderboard, using only 25 reference models produces unstable error profiles, while performance becomes reliable once the pool reaches roughly 100 models. On MMLU, 25 reference models already yield robust performance, but the broader pattern is the same: accurate approximation requires at least dozens of reference models.

Effect of reference model count on Open LLM Leaderboard Effect of reference model count on MMLU
Figure 5. The effect of reference model count varies across suites. Here the figure is shown in two stacked panels: the upper panel shows the Open LLM Leaderboard, and the lower panel shows MMLU. On the Open LLM Leaderboard, reliable fixed parameter calibration is only achieved once the reference pool reaches roughly 100 or more models. On MMLU, prediction quality is more stable even with fewer reference models, though the broader pattern is the same: accurate approximation requires at least dozens of reference models.

Additional Results: Anchor Item Maps

Item maps plot each item's difficulty ($b$) against its discrimination ($a$). Across all datasets, clustering-based selection distributes anchors across the full parameter space, while top-K selection concentrates them in a narrow high-discrimination region — explaining the higher prediction error observed above.

Item maps for representative MMLU subjects
Figure 6. Item maps for representative MMLU subjects. Clustering-based selection (circles) distributes anchors across the full parameter space, while top-K selection (diamonds) concentrates them in a narrow high-discrimination region.
Top-K vs. fixed parameter calibration on MMLU
Figure 7. Top-K selection leads to substantially higher MAE on MMLU, confirming the pattern seen on the Open LLM Leaderboard.
Item maps for Open LLM Leaderboard datasets
Figure 8. Item maps for all Open LLM Leaderboard datasets. The same pattern holds: clustering spreads anchors across the item space, top-K concentrates them.

Citation

@misc{habba2026growingpainsextensibleefficient,
      title={Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration},
      author={Eliya Habba and Itay Itzhak and Asaf Yehudai and Yotam Perlitz and Elron Bandel and Michal Shmueli-Scheuer and Leshem Choshen and Gabriel Stanovsky},
      year={2026},
      eprint={2604.12843},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.12843},
}