LLM Deep Thinking: Reasoning Models, Techniques, Evaluation, and the Landscape

Introduction

For most of their short history, large language models generated answers in a single pass — one forward propagation through the network, producing one token at a time from left to right with no回头, no revision, no second thoughts. That changed in late 2024.

OpenAI's o1 preview introduced a new paradigm: reasoning models that spend extra compute at inference time to generate hidden "thinking" tokens before arriving at an answer. DeepSeek-R1 open-sourced a competitive approach days later. Anthropic added extended thinking to Claude. Google shipped Gemini 2.0 Flash Thinking. By 2026, almost every frontier model has some form of internal reasoning capability.

This article covers the full landscape across four sections:

How reasoning models work — the architecture and internals of o1, R1, Claude thinking, and others
Practical techniques — prompting strategies that elicit deeper reasoning from any capable LLM
Evaluating reasoning — benchmarks, failure modes, and what the numbers don't tell you
The road ahead — history, open challenges, and future directions

♦ ♦ ♦

Section 1: Reasoning Models — How "Thinking" Works Under the Hood

The Standard Paradigm

A conventional LLM generates text through a straightforward loop:

Each token depends on all previous tokens, but there is no mechanism for the model to "change its mind" or explore alternative paths. The generation is greedy (or slightly randomized via temperature), always moving forward.

The Reasoning Paradigm

Reasoning models insert an intermediate phase between the prompt and the answer:

During the thinking phase, the model generates tokens that are not visible in the final output. These tokens represent intermediate reasoning steps — chains of thought, candidate answers, self-checks, or backtracking markers. The model can rewrite its own reasoning, try alternative approaches, and only commit to a final answer when it has converged on a solution.

How Each Frontier Model Approaches Reasoning

Model	Approach	Visible Thinking?	Open Source?	Cost Multiplier
OpenAI o1 / o3	RL from reasoning traces, "private" CoT	No (summarized only)	No	~2-5x
DeepSeek-R1	RL via GRPO + cold-start SFT	Yes (optional)	Yes	~2-3x
Claude Extended Thinking	Transparent reasoning with token budget	Yes	No	~2-4x
Gemini 2.0 Flash Thinking	Hybrid fast/slow reasoning	Yes	No	~1.5-2x
QwQ (Qwen)	Open reasoning via SFT + RL	Yes	Yes	~2x

[!NOTE] Why "hidden" thinking? OpenAI keeps o-series reasoning traces private, citing competitive concerns and safety (a visible chain-of-thought could reveal model internals or be manipulated). Anthropic and DeepSeek take the opposite view — transparency builds trust and enables debugging.

Deep Dive: DeepSeek-R1's GRPO Algorithm

DeepSeek-R1 introduced Group Relative Policy Optimization (GRPO), a reinforcement learning approach that does not require a critic/value model. Instead, it samples multiple reasoning trajectories from the current policy and optimizes based on relative quality within each group.

The key insight: by normalizing rewards within each group, GRPO eliminates the need for a separate value network (critic), dramatically reducing memory and compute requirements during training.

♦ ♦ ♦

Section 2: Practical Techniques for Deeper Reasoning

You don't need a reasoning model to get better reasoning. These techniques work with any capable LLM and can be combined for additive gains.

1. Chain-of-Thought (CoT) Prompting

The simplest and most reliable technique. Elicit step-by-step reasoning by asking the model to think before answering.

Zero-shot CoT — just append to your prompt:

[!TIP] Zero-shot CoT Template `` {question}

Let's think through this step by step. ``

Few-shot CoT — provide examples with explicit reasoning:

[!TIP] Few-shot CoT Template `` Q: {example question} A: {step-by-step reasoning} Therefore, the answer is {answer}.

Q: {target question} A: ``

2. Self-Consistency

Generate multiple reasoning chains independently, then take a majority vote on the final answer. Simple, parallelizable, and empirically robust.

python

1import openai  # or any API client
2
3def self_consistency(prompt, n=5, temperature=0.7):
4    responses = []
5    for _ in range(n):
6        response = client.chat.completions.create(
7            model="gpt-4o",
8            messages=[{"role": "user", "content": prompt}],
9            temperature=temperature,
10        )
11        responses.append(parse_answer(response.choices[0].message.content))
12
13    # Majority vote
14    from collections import Counter
15    final_answer = Counter(responses).most_common(1)[0][0]
16    return final_answer, responses

Trade-off: Self-consistency multiplies cost by N. For math/reasoning tasks, accuracy typically plateaus at N=5-10. Beyond that, diminishing returns set in.

3. Tree-of-Thoughts (ToT)

Instead of a single reasoning chain, ToT explores multiple paths simultaneously, using a search algorithm (typically BFS or DFS) to explore, evaluate, and prune branches.

ToT is powerful for tasks with clear intermediate states (math problems, puzzle solving, planning) but requires a way to evaluate each branch — typically by asking the LLM itself to score the progress of each partial solution.

4. Reflexion / Self-Critique

The model generates an answer, evaluates its own reasoning, identifies errors, and regenerates. This creates a feedback loop that often converges on better answers.

[!TIP] Self-Critique Prompt Template `` {question}

Let me think through this step by step: {model's reasoning}

Now, critically evaluate your reasoning above. Check:

Are all assumptions valid?

Are the calculations correct?

Could there be alternative interpretations?

If you find any issues, provide a corrected version. ``

Putting It All Together

These techniques are complementary. A practical pipeline might combine them:

Step	Technique	Purpose
1	Few-shot CoT	Establish reasoning patterns
2	Self-consistency (N=5)	Reduce variance
3	Self-critique on top-2	Refine best candidates
4	Final selection	Human or LLM judge

♦ ♦ ♦

Section 3: Evaluating LLM Reasoning Quality

The Benchmark Landscape

When someone says "model X is better at reasoning," they usually mean it scores higher on one or more of these benchmarks:

Benchmark	Domain	Key Challenge	2024 SOTA	2025-6 SOTA
GSM8K	Grade-school math	Multi-step arithmetic	95%	97%
MATH	Competition math	Complex symbolic reasoning	84%	94%
MMLU-Pro	Broad knowledge (57 subjects)	Expert-level QA	72%	82%
GPQA	Graduate-level Q&A	PhD-level science	65%	81%
ARC-AGI-2	Visual abstract reasoning	Novel puzzle generalization	35%	58%
LiveBench	Adversarial evaluation	Contamination-resistant	N/A	Leaderboard

[!WARNING] Benchmark data can be contaminated. If a model was trained on GSM8K examples (or very similar problems), a high score may reflect memorization rather than reasoning ability. Adversarial benchmarks like LiveBench and ARC-AGI-2 are designed to resist contamination by using fresh, non-public questions.

What Benchmarks Miss

Benchmarks measure final-answer accuracy. They do not capture:

Trace quality — Is the reasoning logically sound even if the answer is wrong? Does the model take a correct but brittle path?
Faithfulness — Does the reasoning actually produce the answer, or does the model rationalize a correct guess?
Reward hacking — Some models learn to generate long, plausible-sounding chains that happen to produce correct answers, even when individual steps are invalid. This is a growing concern as models are optimized for RL rewards based on final answers.
Cost efficiency — A model that achieves 90% accuracy at 10x the cost may be less useful in practice than one with 85% at 1x cost.

Evaluation Pipeline

A robust evaluation should include both an accuracy score (from automated answer comparison) and a quality score (from trace evaluation, ideally done by a separate LLM judge or human annotators).

Evaluation Frameworks

Framework	Type	Strengths
LM-eval-harness	Benchmark runner	Standardized, widely used, supports 200+ benchmarks
LLM-as-Judge	Quality evaluator	Scalable, captures nuance, but biased toward the judging model
Process Reward Models	Step-by-step verifier	Detects flawed reasoning even when final answer is correct
Human evaluation	Gold standard	Most accurate, but slow and expensive

[!TIP] Practical recommendation For regular evaluation, use LM-eval-harness on a diverse set of benchmarks (include at least one math, one knowledge, and one adversarial benchmark). Supplement with LLM-as-Judge for trace quality on a 200-sample subset. Reserve human eval for major model releases.

♦ ♦ ♦

Section 4: History, Open Challenges, and the Road Ahead

A Brief Timeline

Open Challenges

Despite rapid progress, fundamental problems remain unsolved:

Formal verification of reasoning — Can we prove that a model's reasoning chain is logically valid? Current approaches (PRMs, self-evaluation) are probabilistic, not rigorous.
Scaling test-time compute — More thinking tokens generally improve accuracy, but the relationship is noisy. Some problems benefit from 10x compute; others plateau at 1x. We lack reliable predictors for optimal compute allocation per query.
Agentic reasoning — Reasoning models that use tools (code execution, search, APIs) introduce new failure modes: the model can hallucinate tool outputs, get stuck in tool-calling loops, or take unsafe actions based on flawed reasoning.
Hallucination in long chains — The longer the reasoning trace, the more opportunities for the model to introduce factual errors. Early evidence suggests reasoning models hallucinate more than standard models on certain knowledge-intensive tasks because they spend more time generating plausible-sounding but incorrect explanations.
Self-improving reasoning — Can models improve their own reasoning without human data? Approaches like STaR (Self-Taught Reasoner) and V-StaR show promise, but the gains are still bounded by the model's existing knowledge.

Future Directions

Constitutional AI via reasoning — Models that reason about safety constraints internally before generating responses, rather than relying on external guardrails.
Multi-modal reasoning — Combining visual, audio, and text reasoning in a unified thinking process.
Synthetic data flywheels — Using reasoning models to generate high-quality training data for smaller models, compressing reasoning capability into cheaper-to-run architectures.
Compute-optimal inference — Dynamically allocating thinking compute based on problem difficulty, query cost budget, and latency requirements.

♦ ♦ ♦

Conclusion

The shift from single-pass generation to multi-step reasoning is the most significant change in LLM architecture since the transformer. It changes not just what models can do, but how they do it — from pattern-matching engines to systems that can explore, evaluate, and revise their own thoughts.

Key takeaways:

Reasoning models are a paradigm shift, not an incremental improvement. The architecture fundamentally changes the reliability ceiling for complex tasks.
Techniques are model-agnostic. Chain-of-thought, self-consistency, and reflexion work across all capable models. Invest in prompt engineering alongside model selection.
Evaluation is the hardest problem. Don't trust a single benchmark. Combine accuracy metrics with trace quality evaluation for a complete picture.
The gap is narrowing. Open-source reasoning models (DeepSeek-R1, QwQ) are closing the gap with closed models faster than many expected. The best time to start reasoning with LLMs was 2024. The second-best time is now.

Experiment with these techniques. Run your own evaluations. The field moves fast, but the fundamentals — clear thinking, careful evaluation, and systematic iteration — remain timeless.

Introduction

This article covers the full landscape across four sections:

How reasoning models work — the architecture and internals of o1, R1, Claude thinking, and others
Practical techniques — prompting strategies that elicit deeper reasoning from any capable LLM
Evaluating reasoning — benchmarks, failure modes, and what the numbers don't tell you
The road ahead — history, open challenges, and future directions

♦ ♦ ♦

Section 1: Reasoning Models — How "Thinking" Works Under the Hood

The Standard Paradigm

A conventional LLM generates text through a straightforward loop:

The Reasoning Paradigm

Reasoning models insert an intermediate phase between the prompt and the answer:

How Each Frontier Model Approaches Reasoning

Model	Approach	Visible Thinking?	Open Source?	Cost Multiplier
OpenAI o1 / o3	RL from reasoning traces, "private" CoT	No (summarized only)	No	~2-5x
DeepSeek-R1	RL via GRPO + cold-start SFT	Yes (optional)	Yes	~2-3x
Claude Extended Thinking	Transparent reasoning with token budget	Yes	No	~2-4x
Gemini 2.0 Flash Thinking	Hybrid fast/slow reasoning	Yes	No	~1.5-2x
QwQ (Qwen)	Open reasoning via SFT + RL	Yes	Yes	~2x

[!NOTE] Why "hidden" thinking? OpenAI keeps o-series reasoning traces private, citing competitive concerns and safety (a visible chain-of-thought could reveal model internals or be manipulated). Anthropic and DeepSeek take the opposite view — transparency builds trust and enables debugging.

Deep Dive: DeepSeek-R1's GRPO Algorithm

The key insight: by normalizing rewards within each group, GRPO eliminates the need for a separate value network (critic), dramatically reducing memory and compute requirements during training.

♦ ♦ ♦

Section 2: Practical Techniques for Deeper Reasoning

You don't need a reasoning model to get better reasoning. These techniques work with any capable LLM and can be combined for additive gains.

1. Chain-of-Thought (CoT) Prompting

The simplest and most reliable technique. Elicit step-by-step reasoning by asking the model to think before answering.

Zero-shot CoT — just append to your prompt:

[!TIP] Zero-shot CoT Template `` {question}

Let's think through this step by step. ``

Few-shot CoT — provide examples with explicit reasoning:

[!TIP] Few-shot CoT Template `` Q: {example question} A: {step-by-step reasoning} Therefore, the answer is {answer}.

Q: {target question} A: ``

2. Self-Consistency

Generate multiple reasoning chains independently, then take a majority vote on the final answer. Simple, parallelizable, and empirically robust.

python

1import openai  # or any API client
2
3def self_consistency(prompt, n=5, temperature=0.7):
4    responses = []
5    for _ in range(n):
6        response = client.chat.completions.create(
7            model="gpt-4o",
8            messages=[{"role": "user", "content": prompt}],
9            temperature=temperature,
10        )
11        responses.append(parse_answer(response.choices[0].message.content))
12
13    # Majority vote
14    from collections import Counter
15    final_answer = Counter(responses).most_common(1)[0][0]
16    return final_answer, responses

Trade-off: Self-consistency multiplies cost by N. For math/reasoning tasks, accuracy typically plateaus at N=5-10. Beyond that, diminishing returns set in.

3. Tree-of-Thoughts (ToT)

Instead of a single reasoning chain, ToT explores multiple paths simultaneously, using a search algorithm (typically BFS or DFS) to explore, evaluate, and prune branches.

4. Reflexion / Self-Critique

The model generates an answer, evaluates its own reasoning, identifies errors, and regenerates. This creates a feedback loop that often converges on better answers.

[!TIP] Self-Critique Prompt Template `` {question}

Let me think through this step by step: {model's reasoning}

Now, critically evaluate your reasoning above. Check:

Are all assumptions valid?

Are the calculations correct?

Could there be alternative interpretations?

If you find any issues, provide a corrected version. ``

Putting It All Together

These techniques are complementary. A practical pipeline might combine them:

Step	Technique	Purpose
1	Few-shot CoT	Establish reasoning patterns
2	Self-consistency (N=5)	Reduce variance
3	Self-critique on top-2	Refine best candidates
4	Final selection	Human or LLM judge

♦ ♦ ♦

Section 3: Evaluating LLM Reasoning Quality

The Benchmark Landscape

When someone says "model X is better at reasoning," they usually mean it scores higher on one or more of these benchmarks:

Benchmark	Domain	Key Challenge	2024 SOTA	2025-6 SOTA
GSM8K	Grade-school math	Multi-step arithmetic	95%	97%
MATH	Competition math	Complex symbolic reasoning	84%	94%
MMLU-Pro	Broad knowledge (57 subjects)	Expert-level QA	72%	82%
GPQA	Graduate-level Q&A	PhD-level science	65%	81%
ARC-AGI-2	Visual abstract reasoning	Novel puzzle generalization	35%	58%
LiveBench	Adversarial evaluation	Contamination-resistant	N/A	Leaderboard

[!WARNING] Benchmark data can be contaminated. If a model was trained on GSM8K examples (or very similar problems), a high score may reflect memorization rather than reasoning ability. Adversarial benchmarks like LiveBench and ARC-AGI-2 are designed to resist contamination by using fresh, non-public questions.

What Benchmarks Miss

Benchmarks measure final-answer accuracy. They do not capture:

Trace quality — Is the reasoning logically sound even if the answer is wrong? Does the model take a correct but brittle path?
Faithfulness — Does the reasoning actually produce the answer, or does the model rationalize a correct guess?
Reward hacking — Some models learn to generate long, plausible-sounding chains that happen to produce correct answers, even when individual steps are invalid. This is a growing concern as models are optimized for RL rewards based on final answers.
Cost efficiency — A model that achieves 90% accuracy at 10x the cost may be less useful in practice than one with 85% at 1x cost.

Evaluation Pipeline

Evaluation Frameworks

Framework	Type	Strengths
LM-eval-harness	Benchmark runner	Standardized, widely used, supports 200+ benchmarks
LLM-as-Judge	Quality evaluator	Scalable, captures nuance, but biased toward the judging model
Process Reward Models	Step-by-step verifier	Detects flawed reasoning even when final answer is correct
Human evaluation	Gold standard	Most accurate, but slow and expensive

[!TIP] Practical recommendation For regular evaluation, use LM-eval-harness on a diverse set of benchmarks (include at least one math, one knowledge, and one adversarial benchmark). Supplement with LLM-as-Judge for trace quality on a 200-sample subset. Reserve human eval for major model releases.

♦ ♦ ♦

Section 4: History, Open Challenges, and the Road Ahead

A Brief Timeline

Open Challenges

Despite rapid progress, fundamental problems remain unsolved:

Formal verification of reasoning — Can we prove that a model's reasoning chain is logically valid? Current approaches (PRMs, self-evaluation) are probabilistic, not rigorous.
Scaling test-time compute — More thinking tokens generally improve accuracy, but the relationship is noisy. Some problems benefit from 10x compute; others plateau at 1x. We lack reliable predictors for optimal compute allocation per query.
Agentic reasoning — Reasoning models that use tools (code execution, search, APIs) introduce new failure modes: the model can hallucinate tool outputs, get stuck in tool-calling loops, or take unsafe actions based on flawed reasoning.
Hallucination in long chains — The longer the reasoning trace, the more opportunities for the model to introduce factual errors. Early evidence suggests reasoning models hallucinate more than standard models on certain knowledge-intensive tasks because they spend more time generating plausible-sounding but incorrect explanations.
Self-improving reasoning — Can models improve their own reasoning without human data? Approaches like STaR (Self-Taught Reasoner) and V-StaR show promise, but the gains are still bounded by the model's existing knowledge.

Future Directions

Constitutional AI via reasoning — Models that reason about safety constraints internally before generating responses, rather than relying on external guardrails.
Multi-modal reasoning — Combining visual, audio, and text reasoning in a unified thinking process.
Synthetic data flywheels — Using reasoning models to generate high-quality training data for smaller models, compressing reasoning capability into cheaper-to-run architectures.
Compute-optimal inference — Dynamically allocating thinking compute based on problem difficulty, query cost budget, and latency requirements.

♦ ♦ ♦

Conclusion

Key takeaways:

Reasoning models are a paradigm shift, not an incremental improvement. The architecture fundamentally changes the reliability ceiling for complex tasks.
Techniques are model-agnostic. Chain-of-thought, self-consistency, and reflexion work across all capable models. Invest in prompt engineering alongside model selection.
Evaluation is the hardest problem. Don't trust a single benchmark. Combine accuracy metrics with trace quality evaluation for a complete picture.
The gap is narrowing. Open-source reasoning models (DeepSeek-R1, QwQ) are closing the gap with closed models faster than many expected. The best time to start reasoning with LLMs was 2024. The second-best time is now.

Experiment with these techniques. Run your own evaluations. The field moves fast, but the fundamentals — clear thinking, careful evaluation, and systematic iteration — remain timeless.

Introduction

Section 1: Reasoning Models — How "Thinking" Works Under the Hood

The Standard Paradigm

The Reasoning Paradigm

How Each Frontier Model Approaches Reasoning

Deep Dive: DeepSeek-R1's GRPO Algorithm

Section 2: Practical Techniques for Deeper Reasoning

1. Chain-of-Thought (CoT) Prompting

2. Self-Consistency

3. Tree-of-Thoughts (ToT)

4. Reflexion / Self-Critique

Putting It All Together

Section 3: Evaluating LLM Reasoning Quality

The Benchmark Landscape

What Benchmarks Miss

Evaluation Pipeline

Evaluation Frameworks

Section 4: History, Open Challenges, and the Road Ahead

A Brief Timeline

Open Challenges

Future Directions

Conclusion

Enjoyed this article?

Introduction

Section 1: Reasoning Models — How "Thinking" Works Under the Hood

The Standard Paradigm

The Reasoning Paradigm

How Each Frontier Model Approaches Reasoning

Deep Dive: DeepSeek-R1's GRPO Algorithm

Section 2: Practical Techniques for Deeper Reasoning

1. Chain-of-Thought (CoT) Prompting

2. Self-Consistency

3. Tree-of-Thoughts (ToT)

4. Reflexion / Self-Critique

Putting It All Together

Section 3: Evaluating LLM Reasoning Quality

The Benchmark Landscape

What Benchmarks Miss

Evaluation Pipeline

Evaluation Frameworks

Section 4: History, Open Challenges, and the Road Ahead

A Brief Timeline

Open Challenges

Future Directions

Conclusion

Enjoyed this article?