Case Study · 05

Pong from Pixels.

In 2013, a single neural network learned to play Atari games from nothing but the screen — the first time deep reinforcement learning worked at that scale, Pong among the games. I’m rebuilding that result from the ground up to actually understand how it works: a deep Q-network that learns Pong from raw pixels, hand-built end to end on a 6 GB GPU.

Stack: Python · PyTorch · Gymnasium · CUDA Period: 2026 Status: in progress · phase 3 of 5

pong from pixels — interactive

Drag the slider to train the agent · guided tour · toggle the original sound. Illustrative concept — representative animation and curve until the real training run’s data is dropped in.

Why Pong

Starting where the field started.

Deep reinforcement learning has a clear origin point. In 2013, DeepMind published a network that learned to play Atari games directly from the pixels of the screen, using deep Q-learning; the 2015 follow-up in Nature showed it reaching human-level play across many of those games. Pong was one of them. It was the first time a single architecture learned control from raw images at that scale.

That result is the root of a lineage. The same lab went on to AlphaGo, then AlphaZero, then AlphaFold — work that reshaped its fields. You don’t need a novel idea to learn from the most important one; you need to rebuild it and watch it work. Reproducing Pong-from-pixels is a deliberate choice to begin a deep-RL education at the beginning of deep RL.

This is an homage and a reproduction, not a claim of novelty or any affiliation — the goal is understanding earned by construction. It’s the idea that was still on Richard Feynman’s blackboard when he died, and one of the convictions this project is built around.

Chalk handwriting on a blackboard reading: What I cannot create, I do not understand. — “What I cannot create, I do not understand.” — **Richard Feynman**’s blackboard, 1988

Approach

Build the parts that teach. Import the parts that don’t.

The fastest way to learn deep RL is also the least honest: import a library, call agent.train(), and watch a number go up. The slowest is to write the GPU kernels yourself. Neither teaches the right thing. So the line is drawn deliberately — the conceptual core is hand-written, and the commodity machinery underneath is imported and named as such.

Hand-built — the lesson

Q-network & action selectionforward pass, greedy / ε-greedy choice
Experience replay bufferstore transitions, sample decorrelated minibatches
Exploration scheduleε annealed from explore to exploit
Training loop & TD targettarget network, Bellman backup, gradient step
Frame preprocessinggrayscale, downsample, 4-frame stack

Imported — commodity

Atari emulatorGymnasium / ALE environment
Conv layers & autogradPyTorch tensors and backprop
OptimizerAdam
GPU kernelsCUDA / cuDNN

Architecture

From 210×160 pixels to a move.

Each Atari frame arrives as a 210×160 colour image. It’s converted to grayscale, downsampled to 84×84, and the last four frames are stacked so the network can see motion — which way the ball is travelling, and how fast. That stack is the only input. There is no score feed, no ball coordinates, no hand-built features.

A small convolutional network maps that 84×84×4 stack to one value per action — the expected return, or Q-value, of moving up, staying, or moving down. The agent mostly takes the highest-value action and occasionally explores at random. Each transition is stored; minibatches are replayed to break correlation; and a slowly-updated copy of the network (the target network) provides a stable learning signal:

# one gradient step (Deep Q-learning)
q       = net(states).gather(1, actions)        # Q(s, a)
with torch.no_grad():
    next_q = target_net(next_states).max(1).values
    y      = rewards + gamma * next_q * (1 - done)   # Bellman target
loss = smooth_l1_loss(q, y)                       # Huber loss
loss.backward(); opt.step()                       # SGD on the temporal-difference error

The whole thing has to fit and train on a single RTX 2060 with 6 GB of memory, which shapes everything downstream — the replay buffer size, the batch size, and how long a run realistically takes.

The interface

One playhead, the whole story.

The demo above is a single control — how much the agent has trained — wired to everything at once, so the abstract idea of “learning” becomes something you can watch and hear.

The game

A monochrome CRT Pong, rendered to look like the 1970s original. The agent is the blue paddle, learning from the screen alone.

The learning curve

Average reward per episode — flat near −21, then the breakthrough climb. The playhead marks the exact moment shown on screen.

The decision

Q-values for up / stay / down. Untrained they’re flat and unsure; trained, one spikes — a policy forming in real time.

The expectation

The value estimate — the agent’s own read on whether it will win the point. It moves a beat ahead of the outcome.

Guided tour

A self-running walkthrough that trains the agent for you, spotlighting each part as it explains what you’re seeing.

OG sound

The original’s three blips — paddle, wall, point — synthesized live, off by default. Worth turning on.

Everything in the demo is currently representative — a stand-in curve and a hand-tuned agent — clearly labelled as such. The shipped version swaps in the real per-episode reward log and recorded gameplay at each training checkpoint, so the playhead scrubs between snapshots of the actual agent.

Process

Four-hour Saturdays, floor and aspiration.

The project runs on a fixed cadence: one four-hour block a week. Each session is pre-registered with two targets written down before it starts — a floor (what must be true by the end) and an aspiration (the stretch). Writing both down in advance is the point: a missed aspiration isn’t failure, it’s information about where the estimate was wrong.

Three of the five phases are done — kickoff, architecture, and the core build (network, replay buffer, agent, and training loop, all assembled and running). The next phase is training and diagnosis: getting it to actually converge, and being honest about what that takes. Shipping the finished page is the last.

What’s left

Get it to convergence — then tell the truth about how.

Train the agent until it reliably beats the built-in opponent, on the 6 GB card.
Log per-episode reward to disk, so the real learning curve replaces the representative one.
Checkpoint the model at milestones — random, first rally, breakthrough, trained — and record a short gameplay clip at each, so the demo’s slider scrubs through real footage.
Instrument the internals: Q-value drift, the ε schedule, replay dynamics, plus wall-clock, VRAM, and GPU temperature per run.
Write up what actually worked and what didn’t — including the dead ends, which are the part most write-ups quietly drop.

Stack

What’s under the hood.

Standard, modern deep-RL tooling — with the conceptual core written by hand rather than pulled from a library.

Environment

Gymnasium / ALE
Atari Pong
84×84×4 preprocessing

Network

PyTorch
Convolutional Q-network
Target network

Training

Experience replay
ε-greedy exploration
Adam · Huber loss

Hardware

RTX 2060 · 6 GB
CUDA / cuDNN
Single-GPU run

Artifacts

The work, in public.

Demo

Interactive learning concept

Drag to train · synced curve, Q-values & value · guided tour · OG sound

›

Code

GitHub repository

Source — network, replay buffer, agent, and training loop

›

Build log

Saturday-by-Saturday

Each session’s pre-registered floor and aspiration, and how it went

›