QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

3 0 0

If you’ve been following Arabic LLM evaluation for a while, you’ve probably felt the same tension I have. There are more benchmarks and leaderboards popping up every month, but something about the numbers never quite sat right.

Are we actually measuring what we think we’re measuring?

A team from TII built QIMMA (قمّة, Arabic for “summit”) to answer that question systematically. Instead of doing what everyone else does — grab existing benchmarks, run models on them, publish rankings — they applied a rigorous quality validation pipeline before any evaluation took place. What they found isn’t pretty.

The Problem: Arabic NLP Evaluation Is a Mess

Arabic is spoken by over 400 million people across dozens of dialects and cultural contexts. You’d think the evaluation landscape would reflect that richness. It doesn’t.

Translation issues are the obvious one. A lot of Arabic benchmarks are just English benchmarks run through a translator. Questions that make sense in English become awkward or culturally weird in Arabic. You’re not measuring Arabic capability; you’re measuring how well a model can guess what the translator was thinking.

Absent quality validation is the bigger problem. Even native Arabic benchmarks get released without proper quality checks. Annotation inconsistencies, wrong gold answers, encoding errors, cultural bias in ground-truth labels — it’s all there, documented across multiple established resources.

Reproducibility gaps make everything worse. Evaluation scripts and per-sample outputs are rarely public, so you can’t audit results or build on prior work.

Coverage fragmentation rounds out the picture. Existing leaderboards cover isolated tasks and narrow domains. Want to know how a model performs holistically? Good luck stitching together results from six different platforms.

Here’s where QIMMA sits relative to everything else:

| Leaderboard | Open Source | Native Arabic | Quality Validation | Coding Eval | Public Outputs |
|—|—|—|—|—|—|
| OALL v1 | ✅ | Mixed | ❌ | ❌ | ✅ |
| OALL v2 | ✅ | Mostly | ❌ | ❌ | ✅ |
| BALSAM | Partial | 50% | ❌ | ❌ | ❌ |
| AraGen | ✅ | 100% | ✅ | ❌ | ❌ |
| SILMA ABL | ✅ | 100% | ✅ | ❌ | ✅ |
| ILMAAM | Partial | 100% | ✅ | ❌ | ❌ |
| HELM Arabic | ✅ | Mixed | ❌ | ❌ | ✅ |
| ⛰ QIMMA | | 99% | | | |

QIMMA is the only platform that combines all five. That’s not a coincidence — it’s the point.

What’s Actually in QIMMA

The benchmark suite consolidates 109 subsets from 14 source benchmarks into a unified evaluation set of over 52,000 samples, covering 7 domains:

  • Cultural: AraDiCE-Culture, ArabCulture, PalmX (MCQ)
  • STEM: ArabicMMLU, GAT, 3LM STEM (MCQ)
  • Legal: ArabLegalQA, MizanQA (MCQ, QA)
  • Medical: MedArabiQ, MedAraBench (MCQ, QA)
  • Safety: AraTrust (MCQ)
  • Poetry & Literature: FannOrFlop (QA)
  • Coding: 3LM HumanEval+, 3LM MBPP+ (Code)

A few things worth highlighting:

  • 99% native Arabic content. The only exception is code evaluation, which is language-agnostic by nature.
  • First Arabic leaderboard with code evaluation. They adapted HumanEval+ and MBPP+ with Arabic-language problem statements. This is long overdue.
  • Real diversity in domains and tasks. This isn’t just another MMLU variant. Education, governance, healthcare, creative expression, software development — it’s all there.

The Quality Validation Pipeline

This is where QIMMA separates itself from everything else. Before running a single model, they applied a multi-stage validation pipeline to every sample in every benchmark.

Stage 1: Multi-Model Automated Assessment

Each sample was independently evaluated by two state-of-the-art LLMs: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They picked models with strong Arabic capability but different training data compositions, so their combined judgment is more robust than either alone.

Each model scores a sample against a 10-point rubric with binary scores (0 or 1) per criterion. A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. Where only one model flags a sample, it moves to human review.

Stage 2: Human Annotation and Review

Flagged samples get reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on cultural context, regional variation, dialectal nuance, subjective interpretation, and subtle quality issues automated assessment might miss.

For culturally sensitive content, multiple perspectives are considered — because “correctness” can genuinely vary across Arab regions.

What They Found: Systematic Quality Problems

The pipeline revealed recurring quality issues across benchmarks. Not isolated errors — systematic problems that affect entire datasets.

I don’t have the full breakdown in front of me, but the pattern is clear: even widely-used, well-regarded Arabic benchmarks contain quality issues that can quietly corrupt evaluation results. Translation artifacts, annotation errors, cultural mismatches — they’re all there, baked into data that researchers have been treating as ground truth.

This is higher than I expected. I’ve worked with Arabic NLP data before, and I knew quality was uneven, but the scope of the problem seems worse than most people acknowledge.

The Rankings (After Cleanup)

Once you filter out the garbage data, the model rankings shift. I won’t reproduce the full leaderboard here — go check the QIMMA leaderboard yourself — but the takeaway is straightforward: models that perform well on uncleaned benchmarks don’t always perform well on QIMMA, and vice versa.

Some models that looked strong on paper were clearly coasting on benchmark artifacts. Others that looked mediocre turned out to be genuinely capable once the noise was removed.

What This Means for Arabic NLP

QIMMA isn’t perfect. No benchmark is. But it’s a step in the right direction because it forces the conversation to shift from “how high can we score” to “what are we actually measuring.”

The fact that we’re only now getting systematic quality validation for Arabic benchmarks — in 2026 — is both frustrating and encouraging. Frustrating because this should have been done years ago. Encouraging because it’s finally happening.

If you’re working on Arabic LLMs, stop treating benchmark scores as gospel. Start asking what’s actually in the data. QIMMA gives you the tools to do that.

Check the paper, the GitHub repo, and the leaderboard for the full details. The code and outputs are all public, so you can verify everything yourself. That alone puts it ahead of most alternatives.

Comments (0)

Be the first to comment!