Benchmarks & EvalsApril 21, 20263 min read

QIMMA Arabic LLM leaderboard: benchmark release and results

QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.

The BrieftideApril 21, 2026

TL;DR

01QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.
02QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects.
03QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns.

QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects. The leaderboard, published on Hugging Face by the Technology Innovation Institute and contributors, emphasizes human-centered quality metrics over raw parameter counts and exposes per-task scores and evaluation artifacts.

QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns. The initiative is explicitly framed around Arabic-language competence: evaluations include Modern Standard Arabic, regional dialects, and a mix of conversational and task-oriented prompts. The initial release ships with dataset splits, evaluation code, and a public leaderboard intended to make comparisons reproducible.

How QIMMA measures quality

QIMMA departs from size-first comparisons by weighting output quality and task-specific performance more heavily than model scale. The benchmark suite organizes evaluations into discrete tasks such as reading comprehension, instruction following, summarization, and dialectal understanding, and reports per-task metrics alongside an aggregated quality score. Human raters are used for subjective dimensions where automatic metrics are weak, for example fluency, factuality and instruction adherence.

The benchmark is modular: researchers can run individual task suites or the full battery. All evaluation scripts and the task definitions are published with versioning, enabling teams to re-run and extend the tests. QIMMA also provides meta-data about training corpora where available, so users can inspect whether top-performing models were trained on Arabic-specific datasets or adapted via fine-tuning.

What the initial rankings show

The leaderboard highlights several consistent patterns rather than a single dominant model. Models that incorporate Arabic-native pretraining or extensive Arabic fine-tuning score higher on dialectal and nuance-heavy tasks. Larger, instruction-tuned models tend to do better on open-ended reasoning and multi-turn dialogue, but fine-tuned mid-size models outperform some larger models on targeted tasks like summarization and fact extraction.

Open-source models have narrowed the gap with proprietary systems on many utility tasks, though closed commercial offerings still lead on aggregate quality in a number of categories. The release also exposes gaps: few models consistently excel across all dialects, and contextual understanding for low-resource dialects remains a weak point. QIMMA includes per-task leaderboards so developers can identify models suited to specific needs rather than relying on a single overall ranking.

QIMMA's documentation and evaluation artifacts include licensing notes and reproducibility instructions, addressing common barriers that have limited prior Arabic LLM comparisons. The dataset splits and scorer implementations are intended to reduce ambiguity in how metrics are computed.

Why it matters

QIMMA shifts evaluation emphasis from parameter counts to measured quality across Arabic variants, making it easier to match models to real-world use cases. By publishing evaluation code and per-task breakdowns, the leaderboard should reduce friction for developers choosing or improving Arabic LLMs and surface specific weaknesses that need dataset and modeling investment.

QIMMA leaderboard summary by model category

Item
Open-source small	<10B	Narrow	Lower–mid	Strong on single tasks, weaker on reasoning
Open-source large	10–70B	Broad	Mid–high	Better fluency and cross-task performance
Commercial closed	Proprietary	Broad	High	Top aggregate quality, limited transparency
Arabic-specialized	Varies	High on dialects	High	Excels on dialectal nuance and local corpora

Written by The Brieftide · Source: Hugging Face

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

SWE-Explore benchmark: AI coding agents find files but miss lines

SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.

Hugging FaceFRONTIER LAB

olmo-eval: AllenAI launches evaluation workbench for model

Open-source olmo-eval bundles dataset connectors, metric runners and reproducible evaluation for iterative model testing.

The DecoderNEWSLETTER

Claude Fable 5 benchmark: SWE-bench 95% but costly, filtered

Anthropic's Mythos Claude Fable 5 scores 95% on SWE-bench Verified, tops public tests but adds heavy safety filters and higher cost.

The DecoderNEWSLETTER

Anthropic releases Claude Fable 5 and Mythos 5 with coding gains

Anthropic says the new Claude Fable 5 and Mythos 5 outperform the Opus generation on coding and scientific benchmarks.