If you’ve spent any time around ML benchmarks, you’ve seen the pattern: grab a dataset, have a handful of raters label each item, take the majority vote, call it ground truth. It’s simple, cheap, and almost certainly wrong.
Google Research just dropped a paper and an open-source simulator that finally puts some hard numbers on a problem we’ve all felt but rarely quantified: how many human raters do you actually need to build a reproducible benchmark?
The short answer is a lot more than most people use.
The forest vs. tree problem
The paper’s title gives away the metaphor. Do you want breadth (the forest) — 1,000 people each rating one item? Or depth (the tree) — 20 people rating the same 50 items?
Most of the field has defaulted to the forest. You see papers with 1-3 raters per example, assuming that’s enough to find some objective truth. But anyone who’s actually worked on subjective tasks like toxicity detection or hate speech moderation knows that humans disagree constantly. Collapsing multiple ratings into a single plurality label doesn’t capture that variation — it hides it.
The researchers ran a massive simulation across real-world datasets, varying N (total items rated) from 100 to 50,000 and K (raters per item) from 1 to 500. They were looking for configurations that produced statistically reliable results (p < 0.05) — the kind that another lab could reproduce.
The results aren’t subtle. With only 1-5 raters per item, reproducibility was terrible. You’d need to test thousands of items just to get stable comparisons. But bumping K up to 20-50 raters per item let you cut N dramatically while still getting reliable results.
This is higher than I expected. I’ve been guilty of the 3-rater default myself.
Why this matters for actual research
The practical implication is straightforward: if you’re building a benchmark for a subjective task, you’re probably under-rating. Spend your budget on depth, not breadth. Get 20-50 people to rate each example instead of spreading thin across thousands of items.
But there’s a deeper point here that the paper doesn’t shout about enough. The whole premise of “ground truth” for subjective tasks is shaky. When humans disagree, there isn’t always a single correct label. The paper’s framework acknowledges this by working with distributions rather than collapsed labels.
This approach has been tried before in areas like medical imaging and legal document review, but it’s never been systematically studied for ML benchmarks at this scale.
The simulator is the real gift
Google released their simulation code as open source. You can plug in your own dataset parameters — expected disagreement rates, budget constraints, number of classes — and get a recommendation for N and K. This is the kind of tool that should become standard practice before launching any new benchmark.
I ran a quick test with parameters similar to a toxicity dataset I worked on a few years back. The simulator suggested 35 raters per item with 400 items. We had used 5 raters on 2,000 items. No wonder our results never replicated cleanly.
What’s missing
The paper focuses on binary and multi-class classification. It doesn’t address generative tasks or open-ended evaluations, which are increasingly where the action is. And the simulator assumes you have a rough idea of your disagreement rate upfront, which you often don’t.
Still, this is the most practical work on benchmark design I’ve seen in a while. It’s not flashy, but it’ll save researchers a lot of wasted time and money.
If you’re building or maintaining any benchmark that involves human judgment, read the paper. Then run the simulator. Your future self — and anyone trying to reproduce your results — will thank you.
Comments (0)
Login Log in to comment.
Be the first to comment!