From spotting cats in photos to mastering Go — we'll trace the leap from supervised learning to reinforcement learning, and play a tiny RL game ourselves.
"Here are 10,000 examples with the right answer. Learn the pattern."
"Here's a world. Try things. I'll tell you when you do well."
Show the model an example, tell it the right answer, and let it adjust. Repeat millions of times with new examples — until it gets the pattern right on cards it has never seen.
Each card = one labeled example. The model's job: predict the label of a new card.
A trained model takes any image and sorts it into one of a fixed set of classes. The output is a single guess.
Real systems power photo apps, medical scans, even sorting recyclables.
Every supervised learning problem boils down to one line of math — find the model that makes its mistakes as small as possible.
Training = searching through millions of possible \(f\)'s to find the one that's wrong the least.
Same recipe as cat-vs-dog. Show the model billions of sentences, hide the next word, ask it to guess, shrink the loss. The only twist — instead of one label, it produces something one piece at a time.
Same \(\min \, \ell(f(x), y)\) — but now \(y\) is a piece of the output (a word, a note, a slice of noise), not a label.
An LLM picks the next word from a probability distribution — over and over. Tap a word to add it. Watch the sentence grow.
That was a lot — flashcards, math, even ChatGPT. Before we leave supervised learning behind and try something completely different, let's pause.
But what about
problems with no answer key?
How do you learn to ride a bike, win a chess match, or land a rocket — when no one can show you the "right answer" at every step?
Try to imagine a flashcard set for any of these. You can't — not because nobody tried, but because no human knows the right answer at every moment.
No teacher. No labels. Just try things, see what happens, get rewarded. Sound familiar? That's how you learn most things — and it's where RL lives.
To do math on bikes, games, and puppies, we need a picture that fits all of them. RL boils every one of those situations down to two characters — an agent and an environment — passing messages back and forth.
The agent's policy π — its strategy — picks an action. The world updates and hands back a new state + a reward, which the agent uses to nudge its policy. Repeat forever. Mathematicians call this a Markov Decision Process — but you can just call it a loop.
After every action, the world hands back one number — the score for that step. Sometimes it's a big jackpot at the end (you won the game). Sometimes it's a little something every step (a tasty scoop of ice cream). Either way, the agent's only job: make the total as big as possible.
You walk into an ice cream shop with 31 flavors. Mint chip is your reliable favorite. Do you order it again — or finally try the one with the weird name?
Maybe you'll discover something better. Maybe you'll waste a scoop on a flavor you hate.
Guaranteed-good ice cream. But you'll never find anything better than mint chip.
Every RL agent has to balance these. Too much exploring = waste. Too much exploiting = miss the best.
More positions than atoms in the universe. You can't memorize Go — you have to understand it.
Supervised learning on 30 million pro moves. "Given this board, what would a human play?" Now AlphaGo plays like a strong amateur.
Reinforcement learning from self-play. Two copies of AlphaGo battle for millions of games. Reward = +1 win, −1 lose. It invents its own strategies.
Tree search at game time. Imagines thousands of futures, picks the move that — across all those futures — gives the best expected return.
Stage 2 is the magic. Without an answer key, RL let the system surpass every human teacher it had.
It's not a human move. I've never seen a human play this move. So beautiful.
— Fan Hui, professional Go player, watching live
Estimated probability of a human playing it: 1 in 10,000. AlphaGo discovered it through self-play — no teacher could have shown it this move. Lee Sedol stared at the board for 12 minutes.
I thought AlphaGo was based on probability calculation and it was merely a machine. But when I saw this move, I changed my mind. Surely AlphaGo is creative.
— Lee Sedol, after the match
From the AlphaGo documentary — click to open at 51:46 in a new tab.
AlphaGo started by copying 30 million pro moves. So where did Move 37 come from? Talk to a neighbor — guess before we answer.
One game: three mystery boxes, each hiding a reward you can't see. Tap to peek inside. You decide every move — and you only get 20 chances.
How would a computer play that game? Here's the simplest RL algorithm that works — it's just a coin flip and a running average. (ε is the Greek letter "epsilon" — think of it as "how curious am I?")
Same three boxes. Same hidden means. Same 20 pulls. Each turn we roll a dice 🎲 — if it lands on 1, ε-greedy explores (random box). On 2–6, it exploits (best so far). That's ε = 1/6.
You watched it play. You compared it to your own strategy. Now turn to a partner — or shout it out:
Tap a phrase on the left, then tap its match on the right. Get all 6 right and you've spoken the language of RL.
The same loop you saw earlier — agent, policy, action, state, reward, environment. Now tap any piece to see what the next 7 lectures will tackle. For each, ask yourself: what would YOU try?
Remember predict the next word from earlier? That's just step one. Modern LLMs follow the same two-stage recipe as AlphaGo — first imitate, then RL.
Same idea, different game. RL is what turns a fluent imitator into something that can surpass its teacher.
Show many labeled examples. Minimize the squared error. Powers most AI you use today.
No labels. The agent acts, the environment rewards. Maximize total reward G over time.
Every learning agent — including you in the box game — must balance trying new things and using what works.
AlphaGo's Move 37 wasn't taught — it was discovered. Self-play RL invents strategies no teacher knows.