Chain-of-Thought Reasoning

InferenceReasoningFree Lesson

Advertisement

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This tutorial covers CoT variants, theoretical foundations, and practical applications.

DfChain-of-Thought (CoT) Prompting

Chain-of-thought prompting elicits multi-step reasoning from language models by providing or requesting intermediate reasoning steps before the final answer. This technique dramatically improves performance on tasks requiring logical, arithmetic, or commonsense reasoning.

CoT Variants

Zero-Shot CoT

Simply append "Let's think step by step" to the prompt:

`python prompt = """Q: A juggeler can juggle 16 balls. Half are golf balls, and half the golf balls are blue. How many blue golf balls? A: Let's think step by step.

  1. Total balls = 16
  2. Golf balls = 16 / 2 = 8
  3. Blue golf balls = 8 / 2 = 4 The answer is 4.""" `

Few-Shot CoT

Provide reasoning examples in the prompt:

`python prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have now? A: Roger started with 5 balls. 2 cans of 3 is 6 balls. 5 + 6 = 11. The answer is 11.

Q: The school cafeteria ordered 42 apples for the lunches. They used 6 for Monday's lunches. Then they bought 4 more cases of 3 apples each. How many apples do they have now? A: The cafeteria started with 42 apples. They used 6, leaving 42 - 6 = 36. They bought 4 cases of 3 = 12 apples. 36 + 12 = 48. The answer is 48.

Q: {question} A: Let's think step by step.""" `

Self-Consistency

Generate multiple reasoning paths and select the most common answer:

haty=argmaxysumi=1nmathbb1[yi=y]\\hat{y} = \\arg\\max_y \\sum_{i=1}^{n} \\mathbb{1}[y_i = y]

Self-Consistency Probability

P(hatyx)=sumrinmathcalRyP(rx)cdotP(yr,x)P(\\hat{y} | x) = \\sum_{r \\in \\mathcal{R}_y} P(r | x) \\cdot P(y | r, x)

Here,

  • Ry\mathcal{R}_y=Set of reasoning paths leading to answer y
  • P(rx)P(r | x)=Probability of reasoning path r given input x

Tree-of-Thought (ToT)

DfTree-of-Thought

Tree-of-Thought explores multiple reasoning branches at each step, evaluates them, and prunes unpromising paths. It uses a search algorithm (BFS or DFS) to find the best reasoning trajectory.

ToT State Evaluation

V(s)=textLLM(textEvaluateifstatestextispromising)V(s) = \\text{LLM}(\\text{`Evaluate if state } s \\text{ is promising''})

Here,

  • ss=Current reasoning state
  • V(s)V(s)=Value/quality score of the state

When CoT Helps

Tasks Where CoT Excels

Task TypeExampleCoT Improvement
ArithmeticMulti-step math+30-40%
LogicSyllogisms+20-30%
CommonsensePhysical reasoning+15-25%
SymbolicVariable tracking+25-35%
Multi-hop QAReading comprehension+10-20%

Tasks Where CoT Hurts

  • Simple factual recall: CoT adds unnecessary complexity
  • Classification: CoT can lead to overthinking
  • Creative writing: CoT constrains creativity
  • Translation: CoT is not helpful for direct mapping

CoT is most effective when the problem requires multiple reasoning steps that cannot be easily compressed into a single inference. If the answer can be retrieved from memory, CoT may actually hurt performance.

Mathematical Foundation

CoT as Search

CoT can be viewed as a search problem in the space of reasoning sequences:

CoT Search Space

mathcalR=(r1,r2,ldots,rk):P(rt+1r1,ldots,rt,x)>tau\\mathcal{R} = \\{(r_1, r_2, \\ldots, r_k) : P(r_{t+1} | r_1, \\ldots, r_t, x) > \\tau\\}

Here,

  • R\mathcal{R}=Space of valid reasoning sequences
  • rtr_t=Reasoning step at position t
  • τ\tau=Probability threshold for pruning

Self-Consistency as Ensemble

Self-consistency can be viewed as an ensemble of different reasoning strategies:

Ensemble View of Self-Consistency

P(yx)=frac1nsumi=1nP(yri,x)cdotP(rix)P(y | x) = \\frac{1}{n} \\sum_{i=1}^{n} P(y | r_i, x) \\cdot P(r_i | x)

Here,

  • nn=Number of sampled reasoning paths
  • rir_i=Reasoning path i
  • P(yri,x)P(y | r_i, x)=Answer probability given reasoning path

Implementation

`python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def generate_cot(question, n_samples=5, temperature=0.7): prompt = f"Q: {question}\nA: Let's think step by step.\n" answers = []

for _ in range(n_samples): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate( **inputs, max_new_tokens=512, temperature=temperature, do_sample=True, ) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

Extract final answer

if "The answer is" in response: answer = response.split("The answer is")[-1].strip().split(".")[0] answers.append(answer)

Majority vote

from collections import Counter counts = Counter(answers) return counts.most_common(1)[0][0], dict(counts)

question = "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance?" answer, distribution = generate_cot(question) print(f"Answer: {answer}") print(f"Distribution: {distribution}") `

For more on prompting strategies, see our module on Prompt Engineering.

Practice Exercises

  1. Empirical: Compare zero-shot, few-shot, and CoT prompting on 10 arithmetic problems. Measure accuracy for each.
  2. Self-Consistency: Implement self-consistency with 3, 5, 10, and 20 samples. At what point does accuracy plateau?
  3. ToT: Implement a simple tree-of-thought system for a planning problem. Compare with standard CoT.
  4. Analysis: Identify 5 problems where CoT helps and 5 where it hurts. What pattern emerges?

Key Takeaways:

  • CoT prompting enables multi-step reasoning by decomposing problems
  • Zero-shot CoT ("Let's think step by step") provides easy improvement
  • Self-consistency selects the majority answer from multiple reasoning paths
  • Tree-of-thought explores and prunes reasoning branches
  • CoT excels at arithmetic, logic, commonsense, and multi-hop reasoning
  • CoT can hurt performance on simple recall and classification tasks

Advanced CoT Methods

Auto-CoT

Auto-CoT automatically generates chain-of-thought demonstrations by clustering questions and selecting representative examples from each cluster. This eliminates the need for manual CoT example crafting.

Program-of-Thought

Instead of natural language reasoning, generate executable code that solves the problem. The code is executed to produce the answer. This combines the reasoning capabilities of LLMs with the precision of program execution.

Graph-of-Thought

Extends tree-of-thought by allowing reasoning paths to merge and form a graph structure. This enables sharing of intermediate results across different reasoning branches, improving efficiency and solution quality.

Evaluating CoT Quality

When evaluating CoT, assess both the reasoning process and the final answer. Key evaluation criteria include:

  • Logical coherence of the reasoning steps
  • Faithfulness to the provided context or problem
  • Completeness of the reasoning chain
  • Accuracy of the final answer
  • Conciseness of the reasoning process

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement