Inference-time scaling for LLMs: categories and papers
A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.
TL;DR
- 01A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.
- 02Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models.
- 03Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps.
Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models. A recent wave of papers groups methods into a small number of categories: prompting and sampling strategies, structured search over solution spaces, iterative refinement and verification, reranking and ensemble scoring, and computation-level interventions such as early exit or speculative decoding.
The main categories
Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps. The approach relies on prompting the model to produce a rationale or sequence of substeps, or on sampling multiple rationales and aggregating answers. Self-consistency style sampling, where many reasoning traces are drawn and majority or probabilistic voting picks the final answer, is a prominent example.
Scratchpads and stepwise decomposition turn the prompt into an explicit working memory. Methods such as least-to-most prompting and explicit scratchpad buffers break problems into subproblems and feed intermediate results back into the model. These strategies increase the amount of on-context computation and can reduce conceptual load per step.
Search and tree-based methods treat the model as a node-expander in a search algorithm. Beam search, Monte Carlo tree search and the more recent tree-of-thoughts technique explore multiple candidate reasoning paths, backtracking or pruning branches that look unpromising. These approaches trade additional model invocations for the ability to discover rare but correct solution routes.
Iterative refinement and verification add post-hoc checks and correction passes. A model or auxiliary verifier scores candidate answers, then the system either asks the generator to revise outputs or selects the highest-confidence candidate. Methods in this vein include targeted critique, chain-of-thought refinement and explicit verifier networks.
Reranking and ensemble scoring run multiple decoders or model checkpoints and use a separate scoring module to select the final answer. Rerankers may be smaller classifiers, separate LLM prompts, or learned discriminators. Reranking is useful where raw sampling produces many plausible but divergent outputs.
Computation-level interventions change how and when compute is used. Early-exit strategies let cheaper components attempt quick answers and only escalate to a larger model when confidence is low. Speculative decoding and caching attempt to amortize expensive token generation across queries. These methods focus on practical throughput and cost trade-offs in addition to accuracy.
Recent papers and observable patterns
Recent publications emphasize combinations rather than single techniques. For example, tree search paired with self-consistency sampling improves robustness on complex planning tasks. Work on scratchpads often pairs decomposition with a verifier to catch drift. Several teams benchmark combinations across arithmetic, commonsense and planning tasks and report that modest increases in inference compute can produce outsized gains on multi-step reasoning benchmarks.
A recurring pattern is the tension between sample diversity and final-answer reliability. High diversity in chain-of-thought samples helps discover correct reasoning but increases variance and downstream verification cost. Search methods reduce variance but often raise latency and token usage. Reranking and lightweight verifiers provide a middle path, improving precision while keeping extra compute localized.
Many authors also highlight deployment trade-offs. Early-exit and speculative decoding lower average cost, but require calibrated confidence estimates. Reranking needs a scoring model that generalizes across task distributions. Practitioners balancing accuracy against latency and cost generally combine one or two inference-time scaling techniques rather than relying on any single method.
Why it matters
Inference-time scaling shifts the accuracy versus cost frontier for LLM reasoning by moving effort from pretraining to run-time. That makes it possible to improve performance on complex tasks without retraining massive models, but it also changes engineering priorities toward latency, calibration and verifier reliability. Operators, product teams and researchers will need to choose mixes of methods depending on whether their constraints prioritize correctness, speed or budget.
| Item | ||||
|---|---|---|---|---|
| Chain-of-thought and sampling | Increase output trace length and sample multiple rationales | Chain-of-Thought prompting; self-consistency sampling | Better multi-step reasoning at cost of more tokens and higher variance | |
| Scratchpads and decomposition | Break tasks into subproblems, feed intermediate results back | Least-to-Most; scratchpad prompting | Improves complex task structure but increases context and orchestration | |
| Search and tree-based methods | Explore multiple reasoning branches with backtracking or pruning | Beam search; Tree of Thoughts | Finds rare correct paths, raises latency and token usage | |
| Iterative refinement and verification | Use verifiers or correction passes to refine outputs | Chain-of-thought refinement; verifier models | Higher reliability, requires reliable verifier and extra calls | |
| Reranking and ensembles | Score multiple candidates with a separate model or prompt | Reranker classifiers or scoring LLMs | Improves precision, adds inference-stage compute and complexity | |
| Computation-level interventions | Early exit, speculative decoding, caching to reduce cost | Speculative decoding; early-exit policies | Lowers average cost but needs good confidence calibration |
Primary source
Ahead of AI
magazine.sebastianraschka.comThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureGermany approves DE-AISI to test Anthropic frontier models
Germany's National Security Council greenlit DE-AISI, modeled on the UK's AISI, to evaluate Anthropic frontier models and national security
China $295B AI data center plan requires 80% domestic chips
A planned five-year, $295B national AI data center network would require at least 80% domestically produced chips, squeezing US suppliers.
Apple Intelligence uses Google models and Nvidia GPUs
Announced at WWDC 2026, Apple rebuilt Siri as Apple Intelligence using Google-trained foundation models and Nvidia GPUs for complex queries.
Intel as TSMC Backup: Google Orders 3M+ AI Chips, Nvidia Tests
Google ordered over three million Intel AI accelerators for 2028 while Nvidia trials Intel Foundry as a contingency against TSMC capacity.