Reinforcement
learning, from scratch.
How machines learn to play, win, and repeat — built up from the very first idea. Hands-on games, real math, and the whole arc from cat-or-dog to AlphaGo.
The lectures
How machines learn to play
From supervised classification to AlphaGo. Hands-on interactives: predict-the-next-word, a 3-box bandit, and ε-greedy in action.
What is reinforcement learning?
Make the picture precise: agent, environment, state, action, reward — the formal vocabulary you'll use for the rest of the course. With "Is this RL?" quiz + design-your-own-policy interactive.
Long-term reward & value functions
From single rewards to lifetime planning. Return G, discount γ, Vπ, Qπ, and the optimal V*. Interactive γ-slider + Vπ visualizer.
Evaluating a strategy
Bellman expectation, iterative policy evaluation, convergence. Watch V values flow from goal across a 4×4 grid sweep by sweep — live.
Improving a strategy
Greedy improvement, policy iteration, value iteration, Bellman optimality, and π*. Watch a random 5×5 policy turn optimal in a few rounds.
Learning without a map
Bandits revisited, ε-greedy, Monte Carlo, TD, Q-learning, SARSA. Two interactives: ε-greedy bandit + live Q-learning trace on a 4×4 grid.
Putting it together
Synthesis. The big map, method comparator (5 algorithms), 8 project starters, ~30 lines of Python Q-learning, debugging guide. Bring your laptop.
RL, looking forward
From your gridworld to the frontier. DQN, AlphaGo lineage (→ MuZero), RLHF for ChatGPT, real-world apps, open problems, safety & reward hacking.