Incumbent Advantage: Brand Bias in LLM Recommendation Systems
Three experiments on GPT-4o-mini, Claude Sonnet and Gemini 3 Flash show incumbent brands can capture all recommendations unless rivals have.
TL;DR
- 01Three experiments on GPT-4o-mini, Claude Sonnet and Gemini 3 Flash show incumbent brands can capture all recommendations unless rivals have.
- 02Xi Chu and Yupeng Hou find that well-known brands receive 100% of recommendations from three commercial LLMs when competing products share identical specifications.
- 03Submitted to arXiv on 16 Jun 2026, the paper runs three experiments on skincare products and tests GPT-4o-mini, Claude Sonnet and Gemini 3 Flash.
Xi Chu and Yupeng Hou find that well-known brands receive 100% of recommendations from three commercial LLMs when competing products share identical specifications. Submitted to arXiv on 16 Jun 2026, the paper runs three experiments on skincare products and tests GPT-4o-mini, Claude Sonnet and Gemini 3 Flash.
What did the experiments measure and find?
The paper measures brand dynamics in LLM recommendations with three experiments and reports that, when all products have identical specifications, well-known incumbent brands were recommended 100% of the time, a finding the authors label a Conditional Monopoly (IAI = 10.0). The Conditional Monopoly disappears when a competitor has less than a +0.1-star rating advantage, showing small quality signals can overturn incumbent dominance. The authors also ran a robustness check on search goods to validate the results beyond experience goods like skincare.
The study lists the experimental scope explicitly: skincare products as the primary category, three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), and three experiments that probe brand bias, authority-style messaging, and multi-brand competition.
How did marketing language and manipulations affect recommendations?
Authority-style marketing language, including fabricated clinical-evidence claims, reduced the incumbent monopoly: the paper reports that such language breaks the monopoly at a Bias Surplus Value equal to +0.17 rating points. Each model responded differently to authority-style claims, the authors note, but the shared effect was that marketing language could substitute for small rating advantages. The experiments therefore identify a concrete numerical threshold: authority-style messaging changes the recommendation outcome at about +0.17 rating points.
The authors frame these tactics as part of what they call generative engine optimization (GEO). In the multi-brand GEO experiments, when every brand adopted the same optimization strategy the individual payoff proxy dropped sharply from +0.802 to +0.007, and non-participating brands received zero recommendations in the tests. These figures illustrate both the tactical benefit of manipulation for a single brand and the collective cost when adoption is universal.
Why it matters
The findings show that LLM-driven product discovery is not neutral: brand reputation and small, manipulable signals can determine who appears at the top of recommendations. The paper argues GEO should be treated not only as a security risk but also as an emerging marketing practice that shapes market competition. The numeric thresholds the authors report, such as IAI = 10.0, the +0.1-star quality margin and the +0.17 Bias Surplus Value, give concrete targets for marketers and regulators. The social-dilemma result, where universal GEO adoption collapses individual payoff, signals an incentive problem that could push markets toward arms-race dynamics.
What to watch
Track whether LLM providers change ranking or safety policies around clinical-evidence claims and authority-style language, and whether follow-up studies reproduce the Conditional Monopoly across other product categories. Also watch for empirical work that breaks down model-specific responses, since the authors state each model responded differently but provide no per-model breakdown in the abstract.
Additional details: the submission uploaded to arXiv on 16 Jun 2026, the paper is 16 pages long and includes 4 figures and 11 tables. The authors label their central concern as how generative engine optimization interacts with recommendation dynamics and competitive incentives.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.