AI Infrastructure5 min read

ScaleToT: Billion-Scale Low-Activity User Modeling with LLMs

ScaleToT uses entropy-guided Tree-of-Thought chains and SFT plus OSIPO to teach a student and encoder.

The Brieftide

TL;DR

  • 01ScaleToT uses entropy-guided Tree-of-Thought chains and SFT plus OSIPO to teach a student and encoder.
  • 02ScaleToT is a method for generalizing structured LLM reasoning to billions of low-activity users, submitted to arXiv on 23 Jun 2026.
  • 03The trained student’s reasoning representations are transferred to a lightweight profile encoder so the remaining users receive shared reasoning signals without direct LLM calls.

ScaleToT is a method for generalizing structured LLM reasoning to billions of low-activity users, submitted to arXiv on 23 Jun 2026. The system learns reasoning from a small LLM-processed subset, trains a student with supervised fine-tuning and OSIPO, and transfers representations to a lightweight encoder so most users avoid LLM inference.

How does ScaleToT work?

ScaleToT constructs typed user-state chains and refines them with a bounded entropy-guided Tree-of-Thought refinement procedure, then uses those teacher-curated chains to train a student model via supervised fine-tuning and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization. The trained student’s reasoning representations are transferred to a lightweight profile encoder so the remaining users receive shared reasoning signals without direct LLM calls.

The pipeline starts with a small LLM-processed subset where the LLM infers latent user states from static profiles. ScaleToT builds typed, structured user-state chains, applies entropy-guided ToT to control refinement, and converts the final chains into training data. Teacher-curated chains supervise the student through SFT and OSIPO, and the student’s learned representations are embedded into a profile encoder for broad deployment.

How was ScaleToT evaluated and what concrete results did it produce?

The authors evaluated ScaleToT on lifetime value prediction in a billion-scale advertising deployment and ran a randomized online A/B test that increased LT30 by 6.738 percent; offline reasoning covered only 7.32 percent of the potential population. The paper frames the approach as a compute-saving alternative to applying LLM inference across the full population.

Evaluation focused on LTV prediction. The reported randomized online A/B test delivered a measured uplift of 6.738% in LT30. In contrast, the offline reasoning stage, which produces teacher-curated chains without full LLM inference for everyone, was able to cover 7.32% of the potential population, implying a much smaller compute footprint compared with full-population reasoning.

Why it matters

ScaleToT addresses two linked problems: LLM reasoning becomes unreliable when user profiles are sparse, and running LLMs at population scale is prohibitively expensive. By extracting structured, typed chains from an LLM on a small subset and distilling that structured reasoning into a student and then into a lightweight encoder, the method keeps the reasoning signal while avoiding per-user LLM costs. For advertising systems that must score billions of low-activity users, that trade-off can preserve model expressivity and reduce compute where full LLM calls are infeasible.

The reported 6.738% LT30 uplift shows the approach can move a core business metric when deployed online. The 7.32% coverage figure for offline reasoning highlights the compute savings: only a small fraction of the population needed direct LLM-derived chains to bootstrap the student and encoder.

What to watch

Watch for external replication of the LT30 uplift and for published details on the absolute compute savings versus full-population LLM inference. Also watch whether the entropy-guided Tree-of-Thought refinement and OSIPO training generalize to prediction tasks beyond LTV in other large-scale production systems.

Paper and authors: ScaleToT, arXiv:2606.24605, submitted 23 Jun 2026, by Tianbao Ma, Chang Xi, Yichuan Zou, Chengen Li, Linxun Chen, Zilong Lu, Yanan Niu, Zhaojie Liu, Han Li, and Kun Gai.

ScaleToT data flow from LLM teacher to population encoder
constructsrefinesproducesused to train (SFT + OSIPO)transfers reasoning representationsapplies encoder signalsSmall LLM-processed subsetTyped user-state chainsEntropy-guided ToT refinementTeacher-curated chainsStudent model (SFT + OSIPO)Lightweight profile encoderRemaining users (no LLM inference)
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement