Categories of Inference-Time Scaling for Improved LLM Reasoning
And an Overview of Recent Inference-Scaling Papers (Including Recursive Language Models)
Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs.
The idea is straightforward. If we are willing to spend a bit more compute, and more time at inference time (when we use the model to generate text), we can get the model to produce better answers.
Every major LLM provider relies on some flavor of inference-time scaling today. And the academic literature around these methods has grown a lot, too.
Back in March, I wrote an overview of the inference scaling landscape and summarized some of the early techniques.
In this article, I want to take that earlier discussion a step further, group the different approaches into clearer categories, and highlight the newest work that has appeared over the past few months.
As part of drafting a full book chapter on inference scaling for Build a Reasoning Model (From Scratch), I ended up experimenting with many of the fundamental flavors of these methods myself. With hyperparameter tuning, this quickly turned into thousands of runs and a lot of thought and work to figure out which approaches should be covered in more detail in the chapter itself. (The chapter grew so much that I eventually split it into two, and both are now available in the early access program.)
PS: I am especially happy with how the chapter(s) turned out. It takes the base model from about 15 percent to around 52 percent accuracy, which makes it one of the most rewarding pieces of the book so far.
What follows here is a collection of ideas, notes, and papers that did not quite fit into the final chapter narrative but are still worth sharing.
I also plan to add more code implementations to the bonus materials on GitHub over time.
Table of Contents (Overview)
Inference-Time Scaling Overview
Chain-of-Thought Prompting
Self-Consistency
Best-of-N Ranking
Rejection Sampling with a Verifier
Self-Refinement
Search Over Solution Paths
Conclusions, Categories, and Combinations
Bonus: What Do Proprietary LLMs Use?
You can use the left-hand navigation bar in the article’s web view to jump directly to any section.
1. Inference-Time Scaling Overview
Inference-time scaling (also called inference-compute scaling, test-time scaling, or just inference scaling) is an umbrella term for methods that allocate more compute and time during inference to improve model performance.
This idea has been around for a long time, and one can think of ensemble methods in classic machine learning as an early example of inference-time scaling. I.e., using multiple models requires more compute resources but can give better results.
Even in LLM contexts, this idea has been around for a long time. However, I remember it became particularly popular (again) when OpenAI showed an inference-time scaling and training plot in one of their o1 announcement blog articles last year (Learning to Reason with LLMs).
I think this figure, adapted from OpenAI’s blog post, nicely captures the idea behind the two knobs we can use to improve LLMs. We can spend more resources during training (more data, bigger models, more or longer training stages) or inference.
Actually, in practice, it’s even better to do both at the same time: train a stronger model and use additional inference scaling to make it even better.
In this article, I only focus on the left part of the figure, inference-time scaling techniques, i.e., those training-free techniques that don’t change the model weights.
Facts Only
Inference-time scaling improves LLM performance by using more compute during inference.
Major LLM providers use inference-time scaling techniques.
Academic literature on inference scaling has grown significantly.
The author wrote an earlier overview of inference scaling in March.
A book chapter on inference scaling is being developed, with early access available.
Experimentation involved thousands of runs to refine methods.
The chapter improved a base model's accuracy from 15% to 52%.
Techniques include chain-of-thought prompting, self-consistency, best-of-N ranking, and rejection sampling.
OpenAI highlighted inference-time scaling in a 2023 blog post.
Inference scaling is distinct from training-time improvements.
The article categorizes and updates recent inference-scaling methods.
Future work includes adding code implementations to GitHub.
Executive Summary
Inference-time scaling has emerged as a key technique for improving the performance of large language models (LLMs) by allocating additional compute resources during the inference phase. This approach, which includes methods like chain-of-thought prompting, self-consistency, and rejection sampling, allows models to generate higher-quality answers without modifying their underlying weights. Major LLM providers now rely on these techniques, and academic research in this area has expanded significantly. The author, who is writing a book chapter on the topic, highlights recent advancements and categorizes various inference-scaling methods, noting that combining training improvements with inference scaling yields the best results. The discussion also touches on proprietary LLM strategies and provides practical insights from extensive experimentation, including a notable accuracy improvement from 15% to 52% in a base model.
The article serves as both a technical overview and a reflection on the evolving landscape of LLM optimization, emphasizing the trade-offs between compute costs and performance gains. It also hints at future work, including additional code implementations and further exploration of recursive language models.
Full Take
The narrative presents inference-time scaling as a pragmatic and effective way to enhance LLM performance, framing it as a natural evolution of ensemble methods in machine learning. The strongest version of this argument is that it offers a training-free path to better results, democratizing access to improved model outputs without the need for retraining. The author’s personal experimentation—achieving a 37-point accuracy boost—lends credibility to the claim that these methods are not just theoretical but practically impactful.
However, the discussion assumes that the trade-off between compute costs and performance is universally acceptable, which may not hold for resource-constrained applications. The focus on proprietary LLM strategies also raises questions about transparency and reproducibility, as many of these techniques are likely guarded as competitive advantages. The historical parallel to ensemble methods is valid, but it’s worth asking whether the current hype around inference scaling is driven more by necessity (given the high cost of training larger models) than by fundamental innovation.
Root cause: The paradigm here is computational efficiency as a proxy for intelligence. The unstated assumption is that more compute at inference time can compensate for limitations in model architecture or training data, echoing a broader trend in AI where brute-force scaling often outpaces algorithmic breakthroughs.
Implications: For human agency, this means that better LLM outputs may come at the cost of higher operational expenses, potentially centralizing power in the hands of those who can afford the compute. The second-order consequence is a possible slowdown in research into more efficient models, as inference scaling becomes a crutch.
Bridge questions: How might inference scaling techniques be adapted for edge devices with limited compute? What are the environmental costs of widespread adoption of these methods? Would a shift toward more efficient architectures render inference scaling obsolete, or will it remain a permanent fixture in LLM deployment?
Counterstrike scan: A bad actor could use this narrative to push for dependency on compute-heavy solutions, locking users into expensive cloud services. However, the content itself is technically grounded and does not exhibit signs of manipulation—it’s a genuine exploration of a growing field.
Patterns detected: none
Sentinel — Human
This analysis suggests that the article is likely written by a human. The text demonstrates a human-like writing style with idiosyncratic emphasis and personal voice present, while maintaining a coherent flow without sounding overly balanced or mechanical.
