Reinforcement Learning — Complete Guide
Reinforcement learning trains agents to make decisions by maximizing cumulative reward through trial and error.
RL Framework
Agent interacts with Environment:
State (S) → Action (A) → Reward (R) → New State (S')
Goal: Learn policy π that maximizes cumulative reward
Key concepts:
├─ State: Current situation
├─ Action: What agent can do
├─ Reward: Feedback signal
├─ Policy: Strategy (state → action)
├─ Value function: Expected cumulative reward
└─ Q-value: Expected reward taking action in state
Q-Learning
Q(s,a) = Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
α = learning rate
γ = discount factor (future vs immediate reward)
Algorithm:
1. Initialize Q-table with zeros
2. Observe state s
3. Choose action (ε-greedy)
4. Take action, observe reward r and new state s'
5. Update Q: Q(s,a) += α[r + γ·max Q(s',a') - Q(s,a)]
6. Repeat
Deep Q-Network (DQN)
Q-table → Q-network (neural network)
Input: State
Output: Q-value for each action
Features:
├─ Experience replay: Store and sample transitions
├─ Target network: Stabilize training
└─ Double DQN: Reduce overestimation
Policy Gradient
Directly optimize the policy:
J(θ) = E[Σ γᵗ rₜ]
REINFORCE algorithm:
1. Collect trajectory using current policy
2. Compute returns Gₜ = Σ γᵏ rₖ
3. Update: θ = θ + α·Gₜ·∇log π(aₜ|sₜ)
Advantages:
├─ Can handle continuous action spaces
├─ Learns stochastic policies
└─ Works with high-dimensional states
Actor-Critic
Combines value-based and policy-based:
Actor: Learns policy π(a|s) — what to do
Critic: Learns value V(s) — how good is state
A2C (Advantage Actor-Critic):
├─ Actor maximizes advantage
├─ Critic estimates value
└─ Advantage = actual return - baseline
Key Takeaways
- RL trains agents through trial and error
- Q-learning learns action values
- DQN scales Q-learning with neural networks
- Policy gradients directly optimize the policy
- Actor-critic combines both approaches
- Exploration vs exploitation is the key tradeoff
- RL requires careful reward design
- Sim-to-real transfer for robotics