AI Infrastructure3 min read

Inference-time scaling for LLMs: categories and papers

A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.

The Brieftide

TL;DR

  • 01A taxonomy of inference-time methods: chain-of-thought, scratchpads, tree search, reranking and interleaved computation.
  • 02Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models.
  • 03Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps.

Researchers and practitioners have consolidated a set of inference-time scaling techniques aimed at improving reasoning from large language models. A recent wave of papers groups methods into a small number of categories: prompting and sampling strategies, structured search over solution spaces, iterative refinement and verification, reranking and ensemble scoring, and computation-level interventions such as early exit or speculative decoding.

The main categories

Chain-of-thought and few-shot prompting expand the model output at inference to include intermediate reasoning steps. The approach relies on prompting the model to produce a rationale or sequence of substeps, or on sampling multiple rationales and aggregating answers. Self-consistency style sampling, where many reasoning traces are drawn and majority or probabilistic voting picks the final answer, is a prominent example.

Scratchpads and stepwise decomposition turn the prompt into an explicit working memory. Methods such as least-to-most prompting and explicit scratchpad buffers break problems into subproblems and feed intermediate results back into the model. These strategies increase the amount of on-context computation and can reduce conceptual load per step.

Search and tree-based methods treat the model as a node-expander in a search algorithm. Beam search, Monte Carlo tree search and the more recent tree-of-thoughts technique explore multiple candidate reasoning paths, backtracking or pruning branches that look unpromising. These approaches trade additional model invocations for the ability to discover rare but correct solution routes.

Iterative refinement and verification add post-hoc checks and correction passes. A model or auxiliary verifier scores candidate answers, then the system either asks the generator to revise outputs or selects the highest-confidence candidate. Methods in this vein include targeted critique, chain-of-thought refinement and explicit verifier networks.

Reranking and ensemble scoring run multiple decoders or model checkpoints and use a separate scoring module to select the final answer. Rerankers may be smaller classifiers, separate LLM prompts, or learned discriminators. Reranking is useful where raw sampling produces many plausible but divergent outputs.

Computation-level interventions change how and when compute is used. Early-exit strategies let cheaper components attempt quick answers and only escalate to a larger model when confidence is low. Speculative decoding and caching attempt to amortize expensive token generation across queries. These methods focus on practical throughput and cost trade-offs in addition to accuracy.

Recent papers and observable patterns

Recent publications emphasize combinations rather than single techniques. For example, tree search paired with self-consistency sampling improves robustness on complex planning tasks. Work on scratchpads often pairs decomposition with a verifier to catch drift. Several teams benchmark combinations across arithmetic, commonsense and planning tasks and report that modest increases in inference compute can produce outsized gains on multi-step reasoning benchmarks.

A recurring pattern is the tension between sample diversity and final-answer reliability. High diversity in chain-of-thought samples helps discover correct reasoning but increases variance and downstream verification cost. Search methods reduce variance but often raise latency and token usage. Reranking and lightweight verifiers provide a middle path, improving precision while keeping extra compute localized.

Many authors also highlight deployment trade-offs. Early-exit and speculative decoding lower average cost, but require calibrated confidence estimates. Reranking needs a scoring model that generalizes across task distributions. Practitioners balancing accuracy against latency and cost generally combine one or two inference-time scaling techniques rather than relying on any single method.

Why it matters

Inference-time scaling shifts the accuracy versus cost frontier for LLM reasoning by moving effort from pretraining to run-time. That makes it possible to improve performance on complex tasks without retraining massive models, but it also changes engineering priorities toward latency, calibration and verifier reliability. Operators, product teams and researchers will need to choose mixes of methods depending on whether their constraints prioritize correctness, speed or budget.

Comparison of inference-time scaling categories
Item
Chain-of-thought and samplingIncrease output trace length and sample multiple rationalesChain-of-Thought prompting; self-consistency samplingBetter multi-step reasoning at cost of more tokens and higher variance
Scratchpads and decompositionBreak tasks into subproblems, feed intermediate results backLeast-to-Most; scratchpad promptingImproves complex task structure but increases context and orchestration
Search and tree-based methodsExplore multiple reasoning branches with backtracking or pruningBeam search; Tree of ThoughtsFinds rare correct paths, raises latency and token usage
Iterative refinement and verificationUse verifiers or correction passes to refine outputsChain-of-thought refinement; verifier modelsHigher reliability, requires reliable verifier and extra calls
Reranking and ensemblesScore multiple candidates with a separate model or promptReranker classifiers or scoring LLMsImproves precision, adds inference-stage compute and complexity
Computation-level interventionsEarly exit, speculative decoding, caching to reduce costSpeculative decoding; early-exit policiesLowers average cost but needs good confidence calibration

Primary source

Ahead of AI

magazine.sebastianraschka.com
Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click