Elo-Disentangled Player-Style Embeddings for Chess: Maia-3
Jason Carlson introduces a rating-conditioned residual move model combining Maia-3 logits and Stockfish features; base model cuts NLL.
TL;DR
- 01Jason Carlson introduces a rating-conditioned residual move model combining Maia-3 logits and Stockfish features; base model cuts NLL.
- 02Jason Carlson submitted a paper on 23 June 2026 that proposes an Elo-disentangled per-player embedding for human chess, built on a rating-conditioned residual move model.
- 03The model anchors a learned encoder and a per-player vector z on a frozen, rating-conditioned base move model so z explains only deviations from rating-typical play.
Jason Carlson submitted a paper on 23 June 2026 that proposes an Elo-disentangled per-player embedding for human chess, built on a rating-conditioned residual move model. The system uses a frozen rating-conditioned base (Maia-3 policy logits plus Stockfish-derived features, scored over Maia-2-proposed candidates) and a learned per-player vector z that explains deviations from rating-typical play; the evaluation uses a shared Elo-stratified benchmark of 22,620 held-out decisions.
How does the model work?
The model anchors a learned encoder and a per-player vector z on a frozen, rating-conditioned base move model so z explains only deviations from rating-typical play. The base model itself combines Maia-3 policy logits with Stockfish-derived features, and scores candidates proposed by Maia-2; a frozen copy of that base acts as the anchor while the residual encoder and z capture player-specific deviations.
The architecture is explicitly residual: the base predicts what a typical player of a given Elo would play, and the learned player embedding models the residual signal left over. The paper frames this as an alternative to per-player preference fine-tuning and emphasizes compactness and interpretability by separating typical strength-conditioned behavior from individual stylistic deviation.
What did the experiments show?
The rating-conditioned base substantially improves move prediction, and the player embedding provides representational rather than raw prediction gains. On the Elo-stratified benchmark of 22,620 held-out decisions, top-1 move-matching rises from 0.51 for Maia-2 to 0.57 for Maia-3 to 0.68 for the Stockfish-augmented base. The paper reports the base model improves move prediction over Maia-3 by 27–37% relative NLL across the rating spectrum, with the largest gains at the top (2800+). The authors also report a single-point engine-feature lift: Stockfish's marginal value grows monotonically with Elo, negligible at 900–1200 and reaching +0.085 nats at 2800+.
The player embedding z adds little to raw move-matching on top of this strong base, with its marginal top-1 gain falling within the 95% confidence interval. Its value shows up in representation tests: z generalizes to held-out decisions without overfitting, it re-identifies players from disjoint games above chance, and a linear probe recovers rating from z with only R^2 = 0.06 (no better nonlinearly). The paper also summarizes that the base is +33% relative top-1 over Maia-2 and +19% over Maia-3, and reports a roughly 30% lower NLL for the base in one comparison.
Why it matters
Separating rating-conditioned typical play from per-player deviation gives a compact, interpretable representation of style. The paper makes the case that an Elo-conditioned base plus a compact player embedding can capture stylistic variation while keeping strength (Elo) disentangled: the linear probe returns almost no rating signal from z (R^2 = 0.06), which supports the claim that z captures an orthogonal, stylistic axis. That matters for applications that need to model or compare human style without conflating it with strength, and for systems that want to avoid storing or fine-tuning a full model per player.
What to watch
Look for follow-up evaluations showing whether the reported 27–37% relative NLL improvements and the 0.68 top-1 move-matching hold on larger or different held-out sets, and whether the Elo-disentangled z scales to larger pools of players or to other games where engine features play a role. Future comparisons between this residual, rating-conditioned approach and per-player fine-tuning will determine whether the compact embedding becomes the preferred practical method for modeling individual style.
| Item | |||
|---|---|---|---|
| Top-1 move-matching (22,620 decisions) | 51 | 57 | 68 |
| Reported relative NLL vs Maia-3 | n/a | baseline | 27–37% lower (across Elo); ~30% lower (summary) |
| Reported relative top-1 gain vs Maia-2 / Maia-3 | baseline | n/a | +33% vs Maia-2; +19% vs Maia-3 |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.