Benchmarks & Evals3 min readvia Hugging Face

Open ASR Leaderboard adds Benchmaxxer Repellant to results

Hugging Face has added Benchmaxxer Repellant to the Open ASR Leaderboard to flag and penalize models that appear trained on private test.

The Brieftide

TL;DR

  • 01Hugging Face has added Benchmaxxer Repellant to the Open ASR Leaderboard to flag and penalize models that appear trained on private test.
  • 02Hugging Face added Benchmaxxer Repellant to the Open ASR Leaderboard today, applying the new filter to the leaderboard's private-data category and to incoming ASR submissions.
  • 03The change introduces an automated check that flags model results suspected of being influenced by private test sets or other forms of benchmark overfitting.

Hugging Face added Benchmaxxer Repellant to the Open ASR Leaderboard today, applying the new filter to the leaderboard's private-data category and to incoming ASR submissions. The change introduces an automated check that flags model results suspected of being influenced by private test sets or other forms of benchmark overfitting.

Benchmaxxer Repellant is being used as a protective layer for the public leaderboard. The tool runs during submission validation and can mark entries as flagged, adjust how scores are displayed, or restrict eligibility for ranking where the evidence of private-data influence meets configured thresholds. Hugging Face said the change is intended to preserve comparability across models and to discourage training practices that inflate benchmark numbers without delivering generalizable performance.

What Benchmaxxer Repellant does

Benchmaxxer Repellant inspects submitted models and their evaluation traces for indicators consistent with training on private or leaked test data. The system looks for patterns such as unusually high performance on specific test slices, exact matches with known private transcripts, and statistical anomalies compared with baseline public-data models. When such patterns are detected the leaderboard assigns a flag and metadata explaining the reason for the flag, which is published alongside the submission entry.

Flagged submissions remain visible on the Open ASR Leaderboard so researchers can see the evidence and decisions, but flagged entries are separated from the main public ranking and may not be eligible for placement in the primary leaderboard. The goal is to make the leaderboard both transparent and robust, so that reported top scores better reflect models evaluated only on appropriate public benchmarks.

How submissions and rankings change

For teams submitting to the leaderboard, the validation change adds a step where the model and its evaluation outputs are screened. If the check finds likely overlap with private evaluation data, the submission will show a flagged status and an explanation. Submissions without indicators of private-data influence continue to appear in the primary rankings with unmodified scores.

Hugging Face tied the change explicitly to the private-data track of the Open ASR Leaderboard, which hosts systems trained with closed or proprietary corpora. In that track the Repellant check is intended to prevent cross-contamination between private training sets and private test sets, a recurring concern in open benchmarking. The company is also publishing notes on the evidence types that lead to flags so that model developers can better understand and contest decisions where appropriate.

The update includes both automated detection heuristics and a review pathway. Submissions that are flagged can be appealed, and maintainers expect to refine the check over time to reduce false positives. The initial rollout focuses on the most obvious failure modes for benchmaxxing, with plans to expand detection patterns as more submission data accumulates.

Why it matters

Adding Benchmaxxer Repellant changes the incentives around ASR leaderboard submissions, making it harder to claim top positions through accidental or intentional use of private test sets. The move affects teams using proprietary corpora, benchmark curators, and users who rely on leaderboard rankings to compare model quality. By surfacing flags and separating suspect results, the leaderboard aims to restore trust in published ASR scores and push contributors toward clearer training disclosures.

How leaderboard entries differ with Benchmaxxer Repellant
Item
Submission validationBasic format and score checksAdds private-data overlap and anomaly checks
FlaggingRare and manualAutomated flagging with published reasons
Visibility on primary rankingAll qualifying submissions listedFlagged submissions separated from main ranking
Appeal processLimited or informalDefined appeal and review pathway
Intended effectOpen comparison but vulnerable to dataset leaksReduce benchmaxxing and improve comparability

Primary source

Hugging Face

huggingface.co
Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeNo adsNo trackingUnsubscribe in one click

Read next

  1. Claude Fable 5 benchmark: SWE-bench 95% but costly, filteredJun 10 · 4 min read
  2. Anthropic releases Claude Fable 5 and Mythos 5 with coding gainsJun 9 · 3 min read
  3. ServiceNow AI EVA-Bench Data 2.0 release: 121 tools, 213 scenariosJun 4 · 3 min read
  4. Simon Willison May 2026 newsletter: atom-everything feed guideJun 1 · 4 min read