Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This tutorial covers CoT variants, theoretical foundations, and practical applications.
DfChain-of-Thought (CoT) Prompting
Chain-of-thought prompting elicits multi-step reasoning from language models by providing or requesting intermediate reasoning steps before the final answer. This technique dramatically improves performance on tasks requiring logical, arithmetic, or commonsense reasoning.
CoT Variants
Zero-Shot CoT
Simply append "Let's think step by step" to the prompt:
`python prompt = """Q: A juggeler can juggle 16 balls. Half are golf balls, and half the golf balls are blue. How many blue golf balls? A: Let's think step by step.
- Total balls = 16
- Golf balls = 16 / 2 = 8
- Blue golf balls = 8 / 2 = 4 The answer is 4.""" `
Few-Shot CoT
Provide reasoning examples in the prompt:
`python prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have now? A: Roger started with 5 balls. 2 cans of 3 is 6 balls. 5 + 6 = 11. The answer is 11.
Q: The school cafeteria ordered 42 apples for the lunches. They used 6 for Monday's lunches. Then they bought 4 more cases of 3 apples each. How many apples do they have now? A: The cafeteria started with 42 apples. They used 6, leaving 42 - 6 = 36. They bought 4 cases of 3 = 12 apples. 36 + 12 = 48. The answer is 48.
Q: {question} A: Let's think step by step.""" `
Self-Consistency
Generate multiple reasoning paths and select the most common answer:
Self-Consistency Probability
Here,
- =Set of reasoning paths leading to answer y
- =Probability of reasoning path r given input x
Tree-of-Thought (ToT)
DfTree-of-Thought
Tree-of-Thought explores multiple reasoning branches at each step, evaluates them, and prunes unpromising paths. It uses a search algorithm (BFS or DFS) to find the best reasoning trajectory.
ToT State Evaluation
Here,
- =Current reasoning state
- =Value/quality score of the state
When CoT Helps
Tasks Where CoT Excels
| Task Type | Example | CoT Improvement |
|---|---|---|
| Arithmetic | Multi-step math | +30-40% |
| Logic | Syllogisms | +20-30% |
| Commonsense | Physical reasoning | +15-25% |
| Symbolic | Variable tracking | +25-35% |
| Multi-hop QA | Reading comprehension | +10-20% |
Tasks Where CoT Hurts
- Simple factual recall: CoT adds unnecessary complexity
- Classification: CoT can lead to overthinking
- Creative writing: CoT constrains creativity
- Translation: CoT is not helpful for direct mapping
CoT is most effective when the problem requires multiple reasoning steps that cannot be easily compressed into a single inference. If the answer can be retrieved from memory, CoT may actually hurt performance.
Mathematical Foundation
CoT as Search
CoT can be viewed as a search problem in the space of reasoning sequences:
CoT Search Space
Here,
- =Space of valid reasoning sequences
- =Reasoning step at position t
- =Probability threshold for pruning
Self-Consistency as Ensemble
Self-consistency can be viewed as an ensemble of different reasoning strategies:
Ensemble View of Self-Consistency
Here,
- =Number of sampled reasoning paths
- =Reasoning path i
- =Answer probability given reasoning path
Implementation
`python import torch from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
def generate_cot(question, n_samples=5, temperature=0.7): prompt = f"Q: {question}\nA: Let's think step by step.\n" answers = []
for _ in range(n_samples): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate( **inputs, max_new_tokens=512, temperature=temperature, do_sample=True, ) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
Extract final answer
if "The answer is" in response: answer = response.split("The answer is")[-1].strip().split(".")[0] answers.append(answer)
Majority vote
from collections import Counter counts = Counter(answers) return counts.most_common(1)[0][0], dict(counts)
question = "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance?" answer, distribution = generate_cot(question) print(f"Answer: {answer}") print(f"Distribution: {distribution}") `
For more on prompting strategies, see our module on Prompt Engineering.
Practice Exercises
- Empirical: Compare zero-shot, few-shot, and CoT prompting on 10 arithmetic problems. Measure accuracy for each.
- Self-Consistency: Implement self-consistency with 3, 5, 10, and 20 samples. At what point does accuracy plateau?
- ToT: Implement a simple tree-of-thought system for a planning problem. Compare with standard CoT.
- Analysis: Identify 5 problems where CoT helps and 5 where it hurts. What pattern emerges?
Key Takeaways:
- CoT prompting enables multi-step reasoning by decomposing problems
- Zero-shot CoT ("Let's think step by step") provides easy improvement
- Self-consistency selects the majority answer from multiple reasoning paths
- Tree-of-thought explores and prunes reasoning branches
- CoT excels at arithmetic, logic, commonsense, and multi-hop reasoning
- CoT can hurt performance on simple recall and classification tasks
Advanced CoT Methods
Auto-CoT
Auto-CoT automatically generates chain-of-thought demonstrations by clustering questions and selecting representative examples from each cluster. This eliminates the need for manual CoT example crafting.
Program-of-Thought
Instead of natural language reasoning, generate executable code that solves the problem. The code is executed to produce the answer. This combines the reasoning capabilities of LLMs with the precision of program execution.
Graph-of-Thought
Extends tree-of-thought by allowing reasoning paths to merge and form a graph structure. This enables sharing of intermediate results across different reasoning branches, improving efficiency and solution quality.
Evaluating CoT Quality
When evaluating CoT, assess both the reasoning process and the final answer. Key evaluation criteria include:
- Logical coherence of the reasoning steps
- Faithfulness to the provided context or problem
- Completeness of the reasoning chain
- Accuracy of the final answer
- Conciseness of the reasoning process