Quranic ASR: Wav2Vec2-XLSR-53 hits WER 0.08, beats Citrinet
Fine-tuned transformers on 870+ hours of Quran recitations achieve WER 0.08 on EveryAyah and cut combined training time to 40 hours.
TL;DR
- 01Fine-tuned transformers on 870+ hours of Quran recitations achieve WER 0.08 on EveryAyah and cut combined training time to 40 hours.
- 02The paper runs 30 pages, includes 9 figures and 5 tables, and was submitted to the International Journal of Speech Technology.
- 03The best-performing configuration achieved a word error rate of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel set, an improvement over a Citrinet baseline with WER = 0.163.
Researchers Nabil Mosharraf Hossain, Riasat Islam, Unaizah Obaidellah and coauthors submitted a paper on 18 Jun 2026 that compares pretrained Transformer models for Quranic automatic speech recognition. The study fine-tuned Wav2Vec2.0, HuBERT and XLS-R on a filtered Quranic dataset exceeding 870 hours and reports a best configuration with WER 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting.
The paper runs 30 pages, includes 9 figures and 5 tables, and was submitted to the International Journal of Speech Technology.
What did the study do and find?
The study fine-tuned self-supervised Transformer speech models on a filtered Quranic corpus of more than 870 hours and measured transcription accuracy across feature extractors, label formats and dataset composition. The best-performing configuration achieved a word error rate of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel set, an improvement over a Citrinet baseline with WER = 0.163.
Authors evaluated three advanced speech representation extractors: Wav2Vec2.0, HuBERT and XLS-R, and applied ablation studies across output labels, training strategies and clip durations to isolate what mattered for Quranic recitation. They found that Wav2Vec2-XLSR-53 provided the strongest overall representation for this task.
How did model choice, labels and data composition affect accuracy?
Wav2Vec2-XLSR-53 emerged as the top representation, and Arabic text without diacritics yielded the best fine-tuning results for Quranic ASR. The paper shows that choices in speech feature extractor, output label format and dataset filtering materially change WER and training cost. Fine-grained ablations identified these three factors as key drivers of transcription accuracy in this domain.
The authors also measured training efficiency: the combined-model training time was reduced from 140 hours in the baseline scenario to 40 hours for their best configuration. That runtime reduction accompanied the roughly five-percentage-point WER gain over the Citrinet baseline, indicating both accuracy and compute improvements.
Why it matters
Lower WER on Quranic recitation makes search, assisted memorisation and recitation analysis more reliable for users and tools that operate over the full Quranic corpus. Cutting combined-model training time from 140 hours to 40 hours reduces the computational cost of producing higher-accuracy models, which matters for teams with constrained compute budgets. The paper also highlights that simple choices such as using undiacritised Arabic text can outperform more complex label schemes for this specific task.
What limitations and next steps did the authors identify?
The authors note remaining gaps in dataset quality and recommend developing phoneme-aware models to capture Tajweed-sensitive pronunciations. They propose future work on improving dataset quality and creating phoneme-aware representations to better support Tajweed-sensitive applications, indicating that the current gains still leave room for domain-specific advances.
What to watch
Watch for the journal submission outcome at the International Journal of Speech Technology and for follow-up work implementing phoneme-aware models and cleaned datasets. Those milestones would test whether the reported WER gains generalise to Tajweed-sensitive and user-recited verses beyond the evaluated EveryAyah and Tarteel splits.
References in the paper include detailed ablations across feature extractors, label formats and clip durations, and the authors provide code and data links alongside the arXiv submission for reproducibility.
| Item | |||
|---|---|---|---|
| Best-performing configuration (Wav2Vec2-XLSR-53) | 0.08 | 0.11 | 40 |
| Citrinet baseline | N/A | 0.163 | 140 |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
Browse the feedRespond.io raises $62.5M Series B, eyes acquisitions, global push
Kuala Lumpur-based Respond.io closed a $62.5M Series B led by Camber Partners.
Plaud hits $100M ARR after shipping 2M AI notetakers
Plaud has sold more than 2 million devices and says its subscription business surpassed $100 million in annualized revenue run rate.