LECTURE 01 · INTRO TO RL

How machines learn
to play, win, repeat.

From spotting cats in photos to mastering Go — we'll trace the leap from supervised learning to reinforcement learning, and play a tiny RL game ourselves.

🎯 Classification ✨ Generation ⚫ AlphaGo 🎮 Play a game

Title

01 / 27

THE BIG PICTURE

Two ways a machine can learn.

Style A

Supervised
Learning

"Here are 10,000 examples with the right answer. Learn the pattern."

Needs a teacher with answers
One shot per example
Great for: spam detection, photo tagging, translation

VS

Style B

Reinforcement
Learning

"Here's a world. Try things. I'll tell you when you do well."

No answers — only rewards
Many tries, learning from each
Great for: games, robots, self-driving

Two Ways

02 / 27

PART 1 · SUPERVISED LEARNING

Like studying with flashcards.

Show the model an example, tell it the right answer, and let it adjust. Repeat millions of times with new examples — until it gets the pattern right on cards it has never seen.

🐱CAT

🐶DOG

🦊FOX

🐰RABBIT

Each card = one labeled example. The model's job: predict the label of a new card.

Supervised Learning

03 / 27

EXAMPLE · CLASSIFICATION

Cat or dog?

A trained model takes any image and sorts it into one of a fixed set of classes. The output is a single guess.

📷 Image → 🧠 Model → 🏷️ "Dog"

Real systems power photo apps, medical scans, even sorting recyclables.

🐱CAT

🐶DOG

😺CAT

🐕DOG

🐈CAT

🦮DOG

Classification

04 / 27

THE MATH UNDER THE HOOD

Make wrong as small as possible.

Every supervised learning problem boils down to one line of math — find the model that makes its mistakes as small as possible.

\[ \min_{f} \; \ell\bigl(f(x),\, y\bigr) \]

\(x\) The input — e.g. a photo of an animal.

\(y\) The right answer — e.g. the label "cat".

\(f\) The model. Feed it \(x\) and it guesses: \(f(x)\).

\(\ell\) The loss — how far the guess \(f(x)\) is from the truth \(y\).

\(\min\) Find the smallest. Tweak \(f\) until the loss is as tiny as possible.

Training = searching through millions of possible \(f\)'s to find the one that's wrong the least.

The Math · Loss

05 / 27

STILL SUPERVISED LEARNING · ANOTHER EXAMPLE

Even ChatGPT is just
predicting the next word.

Same recipe as cat-vs-dog. Show the model billions of sentences, hide the next word, ask it to guess, shrink the loss. The only twist — instead of one label, it produces something one piece at a time.

📝 Text

"Write a haiku about cats"

A whisker twitches.
Sunlight pools on the carpet —
nap is non-negotiable.

🎨 Image

"A cat astronaut"

→ starts from random noise, refined into a picture step by step

🎵 Music / Code / Video

"Lo-fi beat for studying"

→ predicts the next note, then the next, then the next…

Same \(\min \, \ell(f(x), y)\) — but now \(y\) is a piece of the output (a word, a note, a slice of noise), not a label.

Generation

06 / 27

TRY IT · PREDICT THE NEXT WORD

You're the language model.

An LLM picks the next word from a probability distribution — over and over. Tap a word to add it. Watch the sentence grow.

The cat sat on the ___

Next Word

07 / 27

PAUSE · QUICK CHECK-IN

How are we
doing so far?

That was a lot — flashcards, math, even ChatGPT. Before we leave supervised learning behind and try something completely different, let's pause.

✋ Got a question? Anything fuzzy? Speak up — no question is too small.

🔁 Want a replay? The math, the loss, generation — I can rewind any of it.

🚀 All good? Next up: the kind of learning that beat humans at Go.

Check-in

08 / 27

A PROBLEM

But what about
problems with no answer key?

How do you learn to ride a bike, win a chess match, or land a rocket — when no one can show you the "right answer" at every step?

Pivot to RL

09 / 27

WHY SUPERVISED LEARNING ISN'T ENOUGH

Where's the answer key?

Try to imagine a flashcard set for any of these. You can't — not because nobody tried, but because no human knows the right answer at every moment.

🚴

Learning to ride a bike What's the right amount to lean — right now? → No parent can label every micro-second of balance.

🎮

Beating a video game Should I jump now? Run? Duck? → You only find out it was wrong after the game ends.

🤖

A robot learning to walk What's the perfect motor angle for this step? → No engineer can write down every joint of every gait.

🐶

Training a puppy to sit How would you make 10,000 flashcards for "sit"? → You don't. You give a treat when it works — that's it.

No teacher. No labels. Just try things, see what happens, get rewarded. Sound familiar? That's how you learn most things — and it's where RL lives.

Why SL Fails

10 / 27

PART 3 · MODELING THE PROBLEM

Two characters, talking in a loop.

To do math on bikes, games, and puppies, we need a picture that fits all of them. RL boils every one of those situations down to two characters — an agent and an environment — passing messages back and forth.

The agent's policy π — its strategy — picks an action. The world updates and hands back a new state + a reward, which the agent uses to nudge its policy. Repeat forever. Mathematicians call this a Markov Decision Process — but you can just call it a loop.

RL Loop

11 / 27

THE REWARD SIGNAL

A number every step.
Add them up.

After every action, the world hands back one number — the score for that step. Sometimes it's a big jackpot at the end (you won the game). Sometimes it's a little something every step (a tasty scoop of ice cream). Either way, the agent's only job: make the total as big as possible.

🏆 +1 WIN A GAME · ONCE, AT THE END

🍦 +0.7 TASTY SCOOP · EVERY STEP

💥 −1 CRASH / LOSE

\[ G = r_1 + r_2 + r_3 + \cdots + r_T \]

Return \(G\) = the sum of every step's reward. RL = pick actions that make \(G\) as big as possible.

Reward

12 / 27

A CLASSIC TENSION

Explore or exploit?

You walk into an ice cream shop with 31 flavors. Mint chip is your reliable favorite. Do you order it again — or finally try the one with the weird name?

🔍 EXPLORE

Try something new.

Maybe you'll discover something better. Maybe you'll waste a scoop on a flavor you hate.

🍓 🥑 🌶️

🎯 EXPLOIT

Stick with what works.

Guaranteed-good ice cream. But you'll never find anything better than mint chip.

🍦 🍦 🍦

Every RL agent has to balance these. Too much exploring = waste. Too much exploiting = miss the best.

Explore vs Exploit

13 / 27

CASE STUDY · ALPHAGO

The game everyone said
computers couldn't win.

\(10^{170}\)

POSSIBLE BOARD POSITIONS

2,500+

YEARS OF HUMAN STUDY

2016

YEAR ALPHAGO BEAT WORLD CHAMP LEE SEDOL

More positions than atoms in the universe. You can't memorize Go — you have to understand it.

AlphaGo

14 / 27

UNDER THE HOOD

How AlphaGo learned, in three stages.

STAGE 1 📚

Imitate humans

Supervised learning on 30 million pro moves. "Given this board, what would a human play?" Now AlphaGo plays like a strong amateur.

STAGE 2 🤖

Play itself

Reinforcement learning from self-play. Two copies of AlphaGo battle for millions of games. Reward = +1 win, −1 lose. It invents its own strategies.

STAGE 3 🌳

Plan ahead

Tree search at game time. Imagines thousands of futures, picks the move that — across all those futures — gives the best expected return.

Stage 2 is the magic. Without an answer key, RL let the system surpass every human teacher it had.

How AlphaGo Learned

15 / 27

MARCH 10, 2016 · MATCH 2 · MOVE 37

The move
no human would have made.

It's not a human move. I've never seen a human play this move. So beautiful.

— Fan Hui, professional Go player, watching live

Estimated probability of a human playing it: 1 in 10,000. AlphaGo discovered it through self-play — no teacher could have shown it this move. Lee Sedol stared at the board for 12 minutes.

Move 37

16 / 27

AND THEN... SILENCE

Twelve minutes.

I thought AlphaGo was based on probability calculation and it was merely a machine. But when I saw this move, I changed my mind. Surely AlphaGo is creative.

— Lee Sedol, after the match

12 min Lee Sedol's pause before responding

▶

Move 37 + Lee Sedol's reaction

YOUTUBE @ 51:46 ↗

From the AlphaGo documentary — click to open at 51:46 in a new tab.

Reaction

17 / 27

PAUSE & THINK

How can a machine play a move
no human ever played?

AlphaGo started by copying 30 million pro moves. So where did Move 37 come from? Talk to a neighbor — guess before we answer.

🧠 It learned from humans first. Stage 1 = copy 30M pro moves. So why didn't it stay an imitator?

🤖 Then it played itself. Stage 2 = two AlphaGos battle for millions of games. Who labels the "right" move when no human is watching?

✨ No teacher needed. Reward = +1 win, −1 lose. The game itself decides. That's how RL can surpass every human teacher.

0 / 3 revealed

Beyond Humans

18 / 27

NEXT UP · YOU PLAY

Your turn to be the agent.

One game: three mystery boxes, each hiding a reward you can't see. Tap to peek inside. You decide every move — and you only get 20 chances.

↓

Your Turn

19 / 27

🎮 Three Mystery Boxes

Tap a box to pull. 20 pulls total.

0

Pulls used / 20

0.00

Total reward

—

Your best box so far

The Game

20 / 27

A SIMPLE RL ALGORITHM

Meet ε-greedy.

How would a computer play that game? Here's the simplest RL algorithm that works — it's just a coin flip and a running average. (ε is the Greek letter "epsilon" — think of it as "how curious am I?")

🔍 EXPLORE Pick a random box. Why: maybe one we ignored is secretly the best — only way to find out is to try it.

🎯 EXPLOIT Pick the box with the highest average so far. Why: it's been the winner so far — stick with it and collect points.

EVERY TURN

①Roll a random number between 0 and 1.

②If it's less than \(\varepsilon\) → explore: pick a random box.

③Otherwise → exploit: pick the box with the best average so far.

④Update that box's running average with the reward you saw.

\(\varepsilon = 0\) Pure exploit. Always picks the current "best." Risk: stuck on bad early luck.

\(\varepsilon = 1/6\) 🎲 Roll a dice. 1 → explore, 2–6 → exploit. We'll use this on the next slide.

\(\varepsilon = 1\) Pure explore. Totally random. Learns the truth fast — but scores poorly.

ε-greedy

21 / 27

SAME BOXES · NEW PLAYER

ε-greedy plays your game.

Same three boxes. Same hidden means. Same 20 pulls. Each turn we roll a dice 🎲 — if it lands on 1, ε-greedy explores (random box). On 2–6, it exploits (best so far). That's ε = 1/6.

🔍 explore · pick random 🎯 exploit · pick the box with the best average so far

ε = 1/6 ≈ 0.17

click Roll & pull to take one step

A 0 pulls avg —

B 0 pulls avg —

C 0 pulls avg —

YOU

—

play game 1 first!

ε-GREEDY

—

click run!

OPTIMAL

15.00

always pick B

ε-greedy vs You

22 / 27

PAUSE & DISCUSS

What do you think
of ε-greedy?

You watched it play. You compared it to your own strategy. Now turn to a partner — or shout it out:

💪 What does it do well? Why does just "rolling a dice" beat picking a box at random?

⚠️ When would it fail? Imagine 100 boxes and only 20 pulls. Or rewards that change over time.

✨ How would you make it smarter? Should ε start big and shrink? Should it remember more than the average?

Discuss ε-greedy

23 / 27

QUIZ · YOU JUST DID RL

Match each game word
to its RL name.

Tap a phrase on the left, then tap its match on the right. Get all 6 right and you've spoken the language of RL.

FROM THE GAME

RL NAME

0 / 6 matched

Debrief

24 / 27

PAUSE & THINK

Every part of the loop has a hard problem.

The same loop you saw earlier — agent, policy, action, state, reward, environment. Now tap any piece to see what the next 7 lectures will tackle. For each, ask yourself: what would YOU try?

↑ Click any piece of the loop to see its open problems

Challenges

25 / 27

FULL CIRCLE · CHATGPT, CLAUDE, GEMINI

Same recipe.
Now for language.

Remember predict the next word from earlier? That's just step one. Modern LLMs follow the same two-stage recipe as AlphaGo — first imitate, then RL.

⚫ AlphaGo

STAGE 1 · SUPERVISED Copy humans Train on 30 million pro moves. "Given this board, what would a human play?" → strong amateur.

STAGE 2 · RL Self-play Two AlphaGos battle. Reward = +1 win, −1 lose. Discovers Move 37 — beyond every teacher.

💬 ChatGPT & friends

STAGE 1 · SUPERVISED Pretrain on the internet Predict the next word over billions of webpages. "What word comes next?" → fluent imitator.

STAGE 2 · RL Learn from feedback Humans (or another model) rate responses. Reward = helpful & honest. Reasoning that no one wrote down emerges.

Same idea, different game. RL is what turns a fluent imitator into something that can surpass its teacher.

RL in LLMs

26 / 27

WRAP-UP

Four ideas to take with you.

01

Supervised = answer key.

Show many labeled examples. Minimize the squared error. Powers most AI you use today.

02

RL = trial & reward.

No labels. The agent acts, the environment rewards. Maximize total reward G over time.

03

Explore vs exploit.

Every learning agent — including you in the box game — must balance trying new things and using what works.

04

RL can surpass humans.

AlphaGo's Move 37 wasn't taught — it was discovered. Self-play RL invents strategies no teacher knows.

→

NEXT · LECTURE 2

What Is Reinforcement Learning? — Agent, environment, state, action, reward.

Takeaways

27 / 27