LLM Deep Thinking: Reasoning Models, Techniques, Evaluation, and the Landscape
Introduction
For most of their short history, large language models generated answers in a single pass — one forward propagation through the network, producing one token at a time from left to right with no回头, no revision, no second thoughts. That changed in late 2024.
OpenAI's o1 preview introduced a new paradigm: reasoning models that spend extra compute at inference time to generate hidden "thinking" tokens before arriving at an answer. DeepSeek-R1 open-sourced a competitive approach days later. Anthropic added extended thinking to Claude. Google shipped Gemini 2.0 Flash Thinking. By 2026, almost every frontier model has some form of internal reasoning capability.
This article covers the full landscape across four sections:
- How reasoning models work — the architecture and internals of o1, R1, Claude thinking, and others
- Practical techniques — prompting strategies that elicit deeper reasoning from any capable LLM
- Evaluating reasoning — benchmarks, failure modes, and what the numbers don't tell you
- The road ahead — history, open challenges, and future directions
Section 1: Reasoning Models — How "Thinking" Works Under the Hood
The Standard Paradigm
A conventional LLM generates text through a straightforward loop:
Each token depends on all previous tokens, but there is no mechanism for the model to "change its mind" or explore alternative paths. The generation is greedy (or slightly randomized via temperature), always moving forward.
The Reasoning Paradigm
Reasoning models insert an intermediate phase between the prompt and the answer:
During the thinking phase, the model generates tokens that are not visible in the final output. These tokens represent intermediate reasoning steps — chains of thought, candidate answers, self-checks, or backtracking markers. The model can rewrite its own reasoning, try alternative approaches, and only commit to a final answer when it has converged on a solution.
How Each Frontier Model Approaches Reasoning
| Model | Approach | Visible Thinking? | Open Source? | Cost Multiplier |
|---|---|---|---|---|
| OpenAI o1 / o3 | RL from reasoning traces, "private" CoT | No (summarized only) | No | ~2-5x |
| DeepSeek-R1 | RL via GRPO + cold-start SFT | Yes (optional) | Yes | ~2-3x |
| Claude Extended Thinking | Transparent reasoning with token budget | Yes | No | ~2-4x |
| Gemini 2.0 Flash Thinking | Hybrid fast/slow reasoning | Yes | No | ~1.5-2x |
| QwQ (Qwen) | Open reasoning via SFT + RL | Yes | Yes | ~2x |
[!NOTE] Why "hidden" thinking? OpenAI keeps o-series reasoning traces private, citing competitive concerns and safety (a visible chain-of-thought could reveal model internals or be manipulated). Anthropic and DeepSeek take the opposite view — transparency builds trust and enables debugging.
Deep Dive: DeepSeek-R1's GRPO Algorithm
DeepSeek-R1 introduced Group Relative Policy Optimization (GRPO), a reinforcement learning approach that does not require a critic/value model. Instead, it samples multiple reasoning trajectories from the current policy and optimizes based on relative quality within each group.
The key insight: by normalizing rewards within each group, GRPO eliminates the need for a separate value network (critic), dramatically reducing memory and compute requirements during training.
Section 2: Practical Techniques for Deeper Reasoning
You don't need a reasoning model to get better reasoning. These techniques work with any capable LLM and can be combined for additive gains.
1. Chain-of-Thought (CoT) Prompting
The simplest and most reliable technique. Elicit step-by-step reasoning by asking the model to think before answering.
Zero-shot CoT — just append to your prompt:
[!TIP] Zero-shot CoT Template `` {question}
Let's think through this step by step. ``
Few-shot CoT — provide examples with explicit reasoning:
[!TIP] Few-shot CoT Template `` Q: {example question} A: {step-by-step reasoning} Therefore, the answer is {answer}.
Q: {target question} A: ``
2. Self-Consistency
Generate multiple reasoning chains independently, then take a majority vote on the final answer. Simple, parallelizable, and empirically robust.
1import openai # or any API client
2
3def self_consistency(prompt, n=5, temperature=0.7):
4 responses = []
5 for _ in range(n):
6 response = client.chat.completions.create(
7 model="gpt-4o",
8 messages=[{"role": "user", "content": prompt}],
9 temperature=temperature,
10 )
11 responses.append(parse_answer(response.choices[0].message.content))
12
13 # Majority vote
14 from collections import Counter
15 final_answer = Counter(responses).most_common(1)[0][0]
16 return final_answer, responsesTrade-off: Self-consistency multiplies cost by N. For math/reasoning tasks, accuracy typically plateaus at N=5-10. Beyond that, diminishing returns set in.
3. Tree-of-Thoughts (ToT)
Instead of a single reasoning chain, ToT explores multiple paths simultaneously, using a search algorithm (typically BFS or DFS) to explore, evaluate, and prune branches.
ToT is powerful for tasks with clear intermediate states (math problems, puzzle solving, planning) but requires a way to evaluate each branch — typically by asking the LLM itself to score the progress of each partial solution.
4. Reflexion / Self-Critique
The model generates an answer, evaluates its own reasoning, identifies errors, and regenerates. This creates a feedback loop that often converges on better answers.
[!TIP] Self-Critique Prompt Template `` {question}
Let me think through this step by step: {model's reasoning}
Now, critically evaluate your reasoning above. Check:
- Are all assumptions valid?
- Are the calculations correct?
- Could there be alternative interpretations?
If you find any issues, provide a corrected version. ``
Putting It All Together
These techniques are complementary. A practical pipeline might combine them:
| Step | Technique | Purpose |
|---|---|---|
| 1 | Few-shot CoT | Establish reasoning patterns |
| 2 | Self-consistency (N=5) | Reduce variance |
| 3 | Self-critique on top-2 | Refine best candidates |
| 4 | Final selection | Human or LLM judge |
Section 3: Evaluating LLM Reasoning Quality
The Benchmark Landscape
When someone says "model X is better at reasoning," they usually mean it scores higher on one or more of these benchmarks:
| Benchmark | Domain | Key Challenge | 2024 SOTA | 2025-6 SOTA |
|---|---|---|---|---|
| GSM8K | Grade-school math | Multi-step arithmetic | 95% | 97% |
| MATH | Competition math | Complex symbolic reasoning | 84% | 94% |
| MMLU-Pro | Broad knowledge (57 subjects) | Expert-level QA | 72% | 82% |
| GPQA | Graduate-level Q&A | PhD-level science | 65% | 81% |
| ARC-AGI-2 | Visual abstract reasoning | Novel puzzle generalization | 35% | 58% |
| LiveBench | Adversarial evaluation | Contamination-resistant | N/A | Leaderboard |
[!WARNING] Benchmark data can be contaminated. If a model was trained on GSM8K examples (or very similar problems), a high score may reflect memorization rather than reasoning ability. Adversarial benchmarks like LiveBench and ARC-AGI-2 are designed to resist contamination by using fresh, non-public questions.
What Benchmarks Miss
Benchmarks measure final-answer accuracy. They do not capture:
- Trace quality — Is the reasoning logically sound even if the answer is wrong? Does the model take a correct but brittle path?
- Faithfulness — Does the reasoning actually produce the answer, or does the model rationalize a correct guess?
- Reward hacking — Some models learn to generate long, plausible-sounding chains that happen to produce correct answers, even when individual steps are invalid. This is a growing concern as models are optimized for RL rewards based on final answers.
- Cost efficiency — A model that achieves 90% accuracy at 10x the cost may be less useful in practice than one with 85% at 1x cost.
Evaluation Pipeline
A robust evaluation should include both an accuracy score (from automated answer comparison) and a quality score (from trace evaluation, ideally done by a separate LLM judge or human annotators).
Evaluation Frameworks
| Framework | Type | Strengths |
|---|---|---|
| LM-eval-harness | Benchmark runner | Standardized, widely used, supports 200+ benchmarks |
| LLM-as-Judge | Quality evaluator | Scalable, captures nuance, but biased toward the judging model |
| Process Reward Models | Step-by-step verifier | Detects flawed reasoning even when final answer is correct |
| Human evaluation | Gold standard | Most accurate, but slow and expensive |
[!TIP] Practical recommendation For regular evaluation, use LM-eval-harness on a diverse set of benchmarks (include at least one math, one knowledge, and one adversarial benchmark). Supplement with LLM-as-Judge for trace quality on a 200-sample subset. Reserve human eval for major model releases.
Section 4: History, Open Challenges, and the Road Ahead
A Brief Timeline
Open Challenges
Despite rapid progress, fundamental problems remain unsolved:
-
Formal verification of reasoning — Can we prove that a model's reasoning chain is logically valid? Current approaches (PRMs, self-evaluation) are probabilistic, not rigorous.
-
Scaling test-time compute — More thinking tokens generally improve accuracy, but the relationship is noisy. Some problems benefit from 10x compute; others plateau at 1x. We lack reliable predictors for optimal compute allocation per query.
-
Agentic reasoning — Reasoning models that use tools (code execution, search, APIs) introduce new failure modes: the model can hallucinate tool outputs, get stuck in tool-calling loops, or take unsafe actions based on flawed reasoning.
-
Hallucination in long chains — The longer the reasoning trace, the more opportunities for the model to introduce factual errors. Early evidence suggests reasoning models hallucinate more than standard models on certain knowledge-intensive tasks because they spend more time generating plausible-sounding but incorrect explanations.
-
Self-improving reasoning — Can models improve their own reasoning without human data? Approaches like STaR (Self-Taught Reasoner) and V-StaR show promise, but the gains are still bounded by the model's existing knowledge.
Future Directions
- Constitutional AI via reasoning — Models that reason about safety constraints internally before generating responses, rather than relying on external guardrails.
- Multi-modal reasoning — Combining visual, audio, and text reasoning in a unified thinking process.
- Synthetic data flywheels — Using reasoning models to generate high-quality training data for smaller models, compressing reasoning capability into cheaper-to-run architectures.
- Compute-optimal inference — Dynamically allocating thinking compute based on problem difficulty, query cost budget, and latency requirements.
Conclusion
The shift from single-pass generation to multi-step reasoning is the most significant change in LLM architecture since the transformer. It changes not just what models can do, but how they do it — from pattern-matching engines to systems that can explore, evaluate, and revise their own thoughts.
Key takeaways:
-
Reasoning models are a paradigm shift, not an incremental improvement. The architecture fundamentally changes the reliability ceiling for complex tasks.
-
Techniques are model-agnostic. Chain-of-thought, self-consistency, and reflexion work across all capable models. Invest in prompt engineering alongside model selection.
-
Evaluation is the hardest problem. Don't trust a single benchmark. Combine accuracy metrics with trace quality evaluation for a complete picture.
-
The gap is narrowing. Open-source reasoning models (DeepSeek-R1, QwQ) are closing the gap with closed models faster than many expected. The best time to start reasoning with LLMs was 2024. The second-best time is now.
Experiment with these techniques. Run your own evaluations. The field moves fast, but the fundamentals — clear thinking, careful evaluation, and systematic iteration — remain timeless.
Enjoyed this article?
Check out my projects or get in touch if you'd like to discuss backend engineering, system design, or collaboration.