AI InfrastructureJanuary 24, 20263 min read

Inference-time scaling for LLMs: categories and papers

A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.

The BrieftideJanuary 24, 2026

TL;DR

01A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.
02Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models.
03Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps.

Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models. A recent wave of papers groups methods into a small number of categories: prompting and sampling strategies, structured search over solution spaces, iterative refinement and verification, reranking and ensemble scoring, and computation-level interventions such as early exit or speculative decoding.

The main categories

Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps. The approach relies on prompting the model to produce a rationale or sequence of substeps, or on sampling multiple rationales and aggregating answers. Self-consistency style sampling, where many reasoning traces are drawn and majority or probabilistic voting picks the final answer, is a prominent example.

Scratchpads and stepwise decomposition turn the prompt into an explicit working memory. Methods such as least-to-most prompting and explicit scratchpad buffers break problems into subproblems and feed intermediate results back into the model. These strategies increase the amount of on-context computation and can reduce conceptual load per step.

Search and tree-based methods treat the model as a node-expander in a search algorithm. Beam search, Monte Carlo tree search and the more recent tree-of-thoughts technique explore multiple candidate reasoning paths, backtracking or pruning branches that look unpromising. These approaches trade additional model invocations for the ability to discover rare but correct solution routes.

Iterative refinement and verification add post-hoc checks and correction passes. A model or auxiliary verifier scores candidate answers, then the system either asks the generator to revise outputs or selects the highest-confidence candidate. Methods in this vein include targeted critique, chain-of-thought refinement and explicit verifier networks.

Reranking and ensemble scoring run multiple decoders or model checkpoints and use a separate scoring module to select the final answer. Rerankers may be smaller classifiers, separate LLM prompts, or learned discriminators. Reranking is useful where raw sampling produces many plausible but divergent outputs.

Computation-level interventions change how and when compute is used. Early-exit strategies let cheaper components attempt quick answers and only escalate to a larger model when confidence is low. Speculative decoding and caching attempt to amortize expensive token generation across queries. These methods focus on practical throughput and cost trade-offs in addition to accuracy.

Recent papers and observable patterns

Recent publications emphasize combinations rather than single techniques. For example, tree search paired with self-consistency sampling improves robustness on complex planning tasks. Work on scratchpads often pairs decomposition with a verifier to catch drift. Several teams benchmark combinations across arithmetic, commonsense and planning tasks and report that modest increases in inference compute can produce outsized gains on multi-step reasoning benchmarks.

A recurring pattern is the tension between sample diversity and final-answer reliability. High diversity in chain-of-thought samples helps discover correct reasoning but increases variance and downstream verification cost. Search methods reduce variance but often raise latency and token usage. Reranking and lightweight verifiers provide a middle path, improving precision while keeping extra compute localized.

Many authors also highlight deployment trade-offs. Early-exit and speculative decoding lower average cost, but require calibrated confidence estimates. Reranking needs a scoring model that generalizes across task distributions. Practitioners balancing accuracy against latency and cost generally combine one or two inference-time scaling techniques rather than relying on any single method.

Why it matters

Inference-time scaling shifts the accuracy versus cost frontier for LLM reasoning by moving effort from pretraining to run-time. That makes it possible to improve performance on complex tasks without retraining massive models, but it also changes engineering priorities toward latency, calibration and verifier reliability. Operators, product teams and researchers will need to choose mixes of methods depending on whether their constraints prioritize correctness, speed or budget.

Comparison of inference-time scaling categories

Item
Chain-of-thought and sampling	Increase output trace length and sample multiple rationales	Chain-of-Thought prompting; self-consistency sampling	Better multi-step reasoning at cost of more tokens and higher variance
Scratchpads and decomposition	Break tasks into subproblems, feed intermediate results back	Least-to-Most; scratchpad prompting	Improves complex task structure but increases context and orchestration
Search and tree-based methods	Explore multiple reasoning branches with backtracking or pruning	Beam search; Tree of Thoughts	Finds rare correct paths, raises latency and token usage
Iterative refinement and verification	Use verifiers or correction passes to refine outputs	Chain-of-thought refinement; verifier models	Higher reliability, requires reliable verifier and extra calls
Reranking and ensembles	Score multiple candidates with a separate model or prompt	Reranker classifiers or scoring LLMs	Improves precision, adds inference-stage compute and complexity
Computation-level interventions	Early exit, speculative decoding, caching to reduce cost	Speculative decoding; early-exit policies	Lowers average cost but needs good confidence calibration

Primary source

Ahead of AI

magazine.sebastianraschka.com

Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Germany approves DE-AISI to test Anthropic frontier models

Germany's National Security Council greenlit DE-AISI, modeled on the UK's AISI, to evaluate Anthropic frontier models and national security

The DecoderNEWSLETTER

China $295B AI data center plan requires 80% domestic chips

A planned five-year, $295B national AI data center network would require at least 80% domestically produced chips, squeezing US suppliers.

The DecoderNEWSLETTER

Apple Intelligence uses Google models and Nvidia GPUs

Announced at WWDC 2026, Apple rebuilt Siri as Apple Intelligence using Google-trained foundation models and Nvidia GPUs for complex queries.

The DecoderNEWSLETTER

Intel as TSMC Backup: Google Orders 3M+ AI Chips, Nvidia Tests

Google ordered over three million Intel AI accelerators for 2028 while Nvidia trials Intel Foundry as a contingency against TSMC capacity.