Benchmarks & Evals3 min read

QIMMA Arabic LLM leaderboard: benchmark release and results

QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.

The Brieftide

TL;DR

  • 01QIMMA publishes a quality-first benchmark evaluating Arabic large language models across tasks, dialects and model sizes.
  • 02QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects.
  • 03QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns.

QIMMA, a quality-first Arabic large language model leaderboard, launched this week with benchmark results covering dozens of models and multiple Arabic dialects. The leaderboard, published on Hugging Face by the Technology Innovation Institute and contributors, emphasizes human-centered quality metrics over raw parameter counts and exposes per-task scores and evaluation artifacts.

QIMMA aggregates automatic metrics, targeted task evaluations and human assessments to produce a single ranking layer while also publishing detailed breakdowns. The initiative is explicitly framed around Arabic-language competence: evaluations include Modern Standard Arabic, regional dialects, and a mix of conversational and task-oriented prompts. The initial release ships with dataset splits, evaluation code, and a public leaderboard intended to make comparisons reproducible.

How QIMMA measures quality

QIMMA departs from size-first comparisons by weighting output quality and task-specific performance more heavily than model scale. The benchmark suite organizes evaluations into discrete tasks such as reading comprehension, instruction following, summarization, and dialectal understanding, and reports per-task metrics alongside an aggregated quality score. Human raters are used for subjective dimensions where automatic metrics are weak, for example fluency, factuality and instruction adherence.

The benchmark is modular: researchers can run individual task suites or the full battery. All evaluation scripts and the task definitions are published with versioning, enabling teams to re-run and extend the tests. QIMMA also provides meta-data about training corpora where available, so users can inspect whether top-performing models were trained on Arabic-specific datasets or adapted via fine-tuning.

What the initial rankings show

The leaderboard highlights several consistent patterns rather than a single dominant model. Models that incorporate Arabic-native pretraining or extensive Arabic fine-tuning score higher on dialectal and nuance-heavy tasks. Larger, instruction-tuned models tend to do better on open-ended reasoning and multi-turn dialogue, but fine-tuned mid-size models outperform some larger models on targeted tasks like summarization and fact extraction.

Open-source models have narrowed the gap with proprietary systems on many utility tasks, though closed commercial offerings still lead on aggregate quality in a number of categories. The release also exposes gaps: few models consistently excel across all dialects, and contextual understanding for low-resource dialects remains a weak point. QIMMA includes per-task leaderboards so developers can identify models suited to specific needs rather than relying on a single overall ranking.

QIMMA's documentation and evaluation artifacts include licensing notes and reproducibility instructions, addressing common barriers that have limited prior Arabic LLM comparisons. The dataset splits and scorer implementations are intended to reduce ambiguity in how metrics are computed.

Why it matters

QIMMA shifts evaluation emphasis from parameter counts to measured quality across Arabic variants, making it easier to match models to real-world use cases. By publishing evaluation code and per-task breakdowns, the leaderboard should reduce friction for developers choosing or improving Arabic LLMs and surface specific weaknesses that need dataset and modeling investment.

QIMMA leaderboard summary by model category
Item
Open-source small<10BNarrowLower–midStrong on single tasks, weaker on reasoning
Open-source large10–70BBroadMid–highBetter fluency and cross-task performance
Commercial closedProprietaryBroadHighTop aggregate quality, limited transparency
Arabic-specializedVariesHigh on dialectsHighExcels on dialectal nuance and local corpora
Advertisement

Written by The Brieftide · Source: Hugging Face

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement