QIMMA Arabic LLM leaderboard: benchmark release and results
QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.
TL;DR
- 01QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.
- 02QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects.
- 03QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns.
QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects. The leaderboard, published on Hugging Face by the Technology Innovation Institute and contributors, emphasizes human-centered quality metrics over raw parameter counts and exposes per-task scores and evaluation artifacts.
QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns. The initiative is explicitly framed around Arabic-language competence: evaluations include Modern Standard Arabic, regional dialects, and a mix of conversational and task-oriented prompts. The initial release ships with dataset splits, evaluation code, and a public leaderboard intended to make comparisons reproducible.
How QIMMA measures quality
QIMMA departs from size-first comparisons by weighting output quality and task-specific performance more heavily than model scale. The benchmark suite organizes evaluations into discrete tasks such as reading comprehension, instruction following, summarization, and dialectal understanding, and reports per-task metrics alongside an aggregated quality score. Human raters are used for subjective dimensions where automatic metrics are weak, for example fluency, factuality and instruction adherence.
The benchmark is modular: researchers can run individual task suites or the full battery. All evaluation scripts and the task definitions are published with versioning, enabling teams to re-run and extend the tests. QIMMA also provides meta-data about training corpora where available, so users can inspect whether top-performing models were trained on Arabic-specific datasets or adapted via fine-tuning.
What the initial rankings show
The leaderboard highlights several consistent patterns rather than a single dominant model. Models that incorporate Arabic-native pretraining or extensive Arabic fine-tuning score higher on dialectal and nuance-heavy tasks. Larger, instruction-tuned models tend to do better on open-ended reasoning and multi-turn dialogue, but fine-tuned mid-size models outperform some larger models on targeted tasks like summarization and fact extraction.
Open-source models have narrowed the gap with proprietary systems on many utility tasks, though closed commercial offerings still lead on aggregate quality in a number of categories. The release also exposes gaps: few models consistently excel across all dialects, and contextual understanding for low-resource dialects remains a weak point. QIMMA includes per-task leaderboards so developers can identify models suited to specific needs rather than relying on a single overall ranking.
QIMMA's documentation and evaluation artifacts include licensing notes and reproducibility instructions, addressing common barriers that have limited prior Arabic LLM comparisons. The dataset splits and scorer implementations are intended to reduce ambiguity in how metrics are computed.
Why it matters
QIMMA shifts evaluation emphasis from parameter counts to measured quality across Arabic variants, making it easier to match models to real-world use cases. By publishing evaluation code and per-task breakdowns, the leaderboard should reduce friction for developers choosing or improving Arabic LLMs and surface specific weaknesses that need dataset and modeling investment.
| Item | |||||
|---|---|---|---|---|---|
| Open-source small | <10B | Narrow | Lower–mid | Strong on single tasks, weaker on reasoning | |
| Open-source large | 10–70B | Broad | Mid–high | Better fluency and cross-task performance | |
| Commercial closed | Proprietary | Broad | High | Top aggregate quality, limited transparency | |
| Arabic-specialized | Varies | High on dialects | High | Excels on dialectal nuance and local corpora |
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsSWE-Explore benchmark: AI coding agents find files but miss lines
SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
olmo-eval: AllenAI launches evaluation workbench for model
Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.
Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered
Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.
Anthropic releases Claude Fable 5 and Mythos 5 with coding gains
Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.