9 LIVE GAMES · BUILD YOUR FIRST AI

Reinforcement
learning, from scratch.

How machines learn to play, win, and repeat — built up from the very first idea. Hands-on games, real math, and the whole arc from cat-or-dog to AlphaGo.

For Grades 9–12 · Pace ~50 min per lecture

Lecturer Dr. Yao Ji, Dr. Ruqi Bai · Supervisor Dr. Guanghui (George) Lan

The lectures

    LECTURE 01How machines learn to playFrom supervised classification to AlphaGo. Hands-on interactives: predict-the-next-word, a 3-box bandit, and ε-greedy in action.

        Supervised vs RLAlphaGoMove 37Banditε-greedy
      
Open

    LECTURE 02The math we'll need: probabilityA probability primer — outcomes, expectation, conditional — then formalize the warrior dungeon as an RL problem: state, action, reward, Markov, MDP. With a chest belief-update quiz and a live warrior trap.

        ProbabilityExpectationStateActionRewardMarkovMDP
      
Open

    LECTURE 03MDPs, Bellman, & dynamic programmingName every part of the agent–environment loop — state, action, reward, transition, policy — bundle them as the MDP, meet values V and Q, and write the Bellman equation that ties them together. Doors & tree game inside.

        StateActionRewardMDPVπQπBellman
      
Open

    LECTURE 04Values, optimality & the Bellman equationWhat makes a policy optimal? Through the doors and warrior-tree games, build up the value function and the principle of optimality, see why greedy ≠ optimal, and write the Bellman equation that every solution must satisfy.

        VπV*OptimalityBellmanγ
      
Open

    LECTURE 05Solving small MDPsA slippery game reveals where randomness lives — the transition kernel — which puts an expectation in the Bellman backup. That gives policy evaluation, then policy iteration, and finally the one-line shortcut: value iteration. Live PI & VI demos inside.

        KernelBellman backupPolicy iterValue iterGridworld
      
Open

    LECTURE 06Policy iteration: evaluate → improveScore a policy, improve it, repeat. Nail the scoring step on the warrior tree two ways — vanilla roll-up from the leaves and the iterative guess-and-repeat — then spend those values to improve the policy and loop until it can't get any better.

        Policy evalIterative evalImproveStoppingπ*
      
Open

    LECTURE 07Putting it togetherSynthesis. The big map, method comparator (5 algorithms), 8 project starters, ~30 lines of Python Q-learning, debugging guide. Bring your laptop.

        SynthesisPracticeProjectsPythonGymnasium
      
Coming soon

    LECTURE 08RL, looking forwardFrom your gridworld to the frontier. DQN, AlphaGo lineage (→ MuZero), RLHF for ChatGPT, real-world apps, open problems, safety & reward hacking.

        Deep RLAlphaGoRLHFSafety
      
Coming soon

Reinforcementlearning, from scratch.

The lectures

How machines learn to play

The math we'll need: probability

MDPs, Bellman, & dynamic programming

Values, optimality & the Bellman equation

Solving small MDPs

Policy iteration: evaluate → improve

Putting it together

RL, looking forward

Reinforcement
learning, from scratch.