The annotated microGPT

The previous post covered what happens when you press 'submit' on ChatGPT. This covers the actual code.

ChatGPT and the script below share the exact same algorithm. Scale and engineering differ; the math does not. ChatGPT runs on thousands of H100s trained on the internet; this runs on your laptop in a few minutes, trained on random names instead of the entire internet.

Prerequisites: know what a gradient is. If not: read neural networks, the math

Core idea: A language model has one job: predict the next token given the context so far

Before the code, here's what the forward pass actually does. Click through each step.

1. data (lines 1-23)

Dataset: 32,000 human first names from Karpathy's makemore project. Instead of using BytePairEncoding (which is what gippity uses as it's tokeniser) we use character level tokenizer so the alphabet maps to IDs 0-25, and one special token sits at ID 26: BOS (beginning-of-sequence). Vocab size = 27 (26 a-z + 1 BOS)

import urllib.request, os
 
url = "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt"
if not os.path.exists("names.txt"):
    urllib.request.urlretrieve(url, "names.txt")
docs = open("names.txt").read().splitlines()  # list of 32k names
 
# "".join(docs) concatenates every name into one giant string, set() removes duplicates, sorted() sorts them
chars = sorted(set("".join(docs))) # ['a','b',...,'z'] - 26 chars
BOS = len(chars) # ID 26
vocab_size = len(chars) + 1 # 27

BOS serves as both beginning of sentence and end of sentence, instead of having two seperate tokens, we use one for simplicity. [BOS, c1, c2, ..., cn, BOS]. When the model produces BOS during inference - stop (here EOS and BOS are the same)

BOS doubling as EOS works because the model learns from examples: every name in the training data ends with BOS. After 20 steps it starts producing BOS as the termination signal naturally.

Tokenization for "emma":

▶

26 (BOS)

▶

26 (BOS)

Each character maps to an integer ID. BOS wraps the sequence.

Two things here: input targets and training targets. Input targets is everything except last token and training targets must predict everything starting from the first token. The model sees previous input tokens [:-1] and must predict [1:] at each position (next token). "emma" generates 5 training pairs:

Each row is one training pair. Context grows left to right. Target (orange) is the next token.

doc = "emma"
char_tokens = []
for c in doc:
    char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS] # tokens = [26, 4, 12, 12, 0, 26]
 
input_tokens  = tokens[:-1]  # [26, 4, 12, 12, 0]
target_tokens = tokens[1:]   # [4, 12, 12, 0, 26]

At position t, the model sees tokens[0..t] and must predict tokens[t+1]

2. autograd engine (lines 25-121)

Parameters is what you tune / nudge in the right direction. Parameters consists of weights and biases of the model (think of them as the neurons in our brain); in order to make sure they learn the right things -> we need to tune them as the model learns more and more data. What is the 'right' direction here? We check the impact of every weight/bias with respect to our Loss and see how increasing or decreasing the value for weights changes loss. In order to minimise loss and change parameters efficiently; we need to move in the negative direction of gradient (change in parameter) : This is called backpropagation and does this with chain rule. Entire of microgpt is just backpropagation with optimisations on top. Gradient descent needs dLoss/d(every parameter). For 4,192 parameters, that's 4,192 separate derivative calculations per step. Autograd is a library karpathy made which computes all of them by tracking every operation and reversing it by keeping a graph of it. You can see more about it here

the Value class

Every scalar is a Value node. It stores four things: data (the actual value of parameter), gradient (the slope of that parameter wrt loss)), local_gradients (slope of that particular parameter wrt to it's inputs), children (to keep track of it's inputs (for c = a+b, a and b are inputs with _op + to get c))

For a=2, b=3, out=a*b=6. local_grads store d(out)/d(a)=b=3 and d(out)/d(b)=a=2.

local_grads are stored as a tuple (each operation captures its derivative formula at construction time)

# Python "dunder" methods like __add__ let you define what happens when you write a + b: Python calls a.__add__(b) under the hood
# The derivative formula lim h-> 0 f(x+h) - f(x)/h tells us if we add small h to a function, how does it change? Does it increase/decrease. What is the impact. This is known as the derivative of that function. 
# We need a way to compute impact of all parameters (inputs) with respect to Loss so we can change / minimise loss accordingly and get the best values of these parameters which makes our loss minimum. This is backprop btw.
 
class Value:
    def __init__(self, data, _children=(), local_grads=()):
        self.data = data
        self.grad = 0 # dLoss/d(this) - general gradient or derivative, this gets updated when you call .backward()
        self._children = _children # tuple of Value nodes that produced this node
        self.local_grads = local_grads # d(this)/d(each child) 
 
    # d(a+b)/da = 1, d(a+b)/db = 1
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, _children=(self, other))
        out.local_grads = (1.0, 1.0) # gradient of sum wrt each input is 1 (da/da = 1 (if you don't know this: change of one parameter with respect to itself will always be constant 1 if you think about it))
        return out
 
    # d(a*b)/da = b, d(a*b)/db = a
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, _children=(self, other))
        out.local_grads = (other.data, self.data)  # cross-multiply
        return out
 
    # d(x^n)/dx = n * x^(n-1)
    def __pow__(self, other):
        return Value(self.data**other, (self,), (other * self.data**(other-1),))
 
    # d(log x)/dx = 1/x
    def log(self):
        return Value(math.log(self.data + 1e-10), (self,), (1/(self.data + 1e-10),))
 
    # d(exp x)/dx = exp(x)
    def exp(self):
        return Value(math.exp(self.data), (self,), (math.exp(self.data),))
 
    # d(relu x)/dx = 1 if x>0, else 0
    def relu(self):
        return Value(max(0, self.data), (self,), (float(self.data > 0),))
 
    def __neg__(self):   return self * -1
    def __sub__(self, other): return self + (-other)
    def __truediv__(self, other): return self * other**-1
 
    # __radd__, __rmul__ etc. are "reflected" operators.
    # Python calls __radd__(b, a) when a.__add__(b) fails (e.g. int + Value) - without these, `3 * some_value` would raise TypeError.
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other

backward()

This method goes inside the Value class we defined above.

Topological sort first (DFS post-order): every node appears after its children! Then reverse the order - that's the backward order. Set the root's gradient to 1 (dLoss/dLoss = 1). Walk in reverse: at each node, multiply its local_grads by its own grad and add to children. This is the chain rule in calculus -> "If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man."

backward() propagates gradients through the ENTIRE computation graph in one pass but through every node that participated in computing the loss

loss = a*b + c = 7. Forward pass builds values (green). Backward seeds grad=1 at loss, then chain-rule propagates to a (grad=3), b (grad=2), c (grad=1).

def backward(self):
    topological_graph = []
    visited = set()
 
    def build_top(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_top(child)          # recurse into children first
            topological_graph.append(v)  # append AFTER children = post-order
 
    build_top(self)
    self.grad = 1                          # dLoss/dLoss = 1
    for v in reversed(topological_graph): # process root before leaves
        for child, local_grad in zip(v._children, v.local_grads):
            child.grad += local_grad * v.grad  # chain rule

Checkpoint: autograd - You now have a scalar autograd engine. Every arithmetic operation records its local gradient. backward() does one reverse-topological pass and applies the chain rule everywhere.

3. hyperparameters and weight init (lines 124-160)

n_embed = 16    # every vector in the model has this length
n_head  = 4     # attention heads (each sees head_dim=4 dimensions)
n_layer = 1     # one transformer block
block_size = 16 # max sequence length
head_dim = n_embed // n_head  # = 4

Random matrix is initialised. Small initialization (sigma=0.08) prevents exploding activations. If weights start large, each layer multiplies by a large number (values overflow) hence we start small, learn from there.

def matrix(rows, cols):
    # Small Gaussian init (sigma=0.08) prevents activation explosion
    result = []
    for r in range(rows):
        row = []
        for c in range(cols):
            row.append(Value(random.gauss(0, 0.08)))
        result.append(row)
    return result

The full parameter list:

MLP and attention each hold 1024 params. Embeddings and output head mirror each other at 432 each.

All six matrices are stored as lists so each transformer block gets its own independent copy:

wq   = []
wk   = []
wv   = []
wo   = []
w_fc1 = []
w_fc2 = []
 
for i in range(n_layer):
    wq.append(matrix(n_embed, n_embed))          # 16 x 16
    wk.append(matrix(n_embed, n_embed))          # 16 x 16
    wv.append(matrix(n_embed, n_embed))          # 16 x 16
    wo.append(matrix(n_embed, n_embed))          # 16 x 16
    w_fc1.append(matrix(n_embed * 4, n_embed))   # 64 x 16 (output_dim, input_dim hence it's n*4,n and not the other way around)
    w_fc2.append(matrix(n_embed, n_embed * 4))   # 16 x 64
 
lm_head = matrix(vocab_size, n_embed)            # 27 x 16

Every Value node gets collected into one flat list so the SGD loop can update all 4,192 parameters in a single pass:

def flatten(mat):
    result = []
    for row in mat:
        for val in row:
            result.append(val)
    return result
 
params = flatten(wte) + flatten(wpe)
for i in range(n_layer):
    params += flatten(wq[i]) + flatten(wk[i]) + flatten(wv[i]) + flatten(wo[i])
    params += flatten(w_fc1[i]) + flatten(w_fc2[i])
params += flatten(lm_head)
 
print(f"num params: {len(params)}")   # 4192

This is what makes for param in params: param.data -= lr * param.grad in the training loop work (all parameters from all layers in one flat iterable)

Embeddings are vectors, not integer IDs, because vectors capture similarity. 'anna' and 'ana' share characters; their embeddings should be close in space. An integer ID for 'a' (=0) has no geometric relationship to 'n' (=13). A learned 16-dim vector does.

4. utility functions (lines 163-186)

linear: matrix-vector multiply. This is the forward pass of a single layer.

def linear(x, w):
    # output[i] = dot(w[i], x) for each row of the weight matrix
    result = []
    for row in w:               # each row = one output neuron's weights
        total = Value(0.0)
        for weight, inp in zip(row, x):   # zip pairs each weight with its input
            total = total + weight * inp  # accumulate dot product
        result.append(total)
    return result

softmax: converts logits to probabilities. The numerical stability trick:

$e^{x_i - c} / \sum_j e^{x_j - c} = e^{x_i} / \sum_j e^{x_j}$

The constant $c$ cancels. Subtract max before exponentiating to avoid overflow.

Naive: identical logits → identical bars, yet still overflows to NaN. Stable: subtract max first → tiny exponents, same relative proportions. Animated: naive bars flicker (unstable); stable bars glow steadily.

def softmax(logits):
    max_logit = max(l.data for l in logits)    # stability: subtract max before exp
    exp_logits = []
    for logit in logits:
        exp_logits.append((logit - max_logit).exp())
    sum_exp = Value(0.0)
    for e in exp_logits:
        sum_exp = sum_exp + e
    result = []
    for e in exp_logits:
        result.append(e / sum_exp)
    return result
 
def rmsnorm(x):
    # Normalize by root-mean-square.
    # mean_sq = average of (xi squared)
    # rms = sqrt(mean_sq)
    # output = xi / rms
    eps = 1e-8
    sq_total = Value(0.0)
    for xi in x:
        sq_total = sq_total + xi ** 2
    mean_sq = sq_total / len(x)
    rms = (mean_sq + eps) ** 0.5
    result = []
    for xi in x:
        result.append(xi / rms)
    return result

After softmax, the output is a probability distribution - non-negative values that sum to exactly 1.0. In attention_block this turns raw dot-product scores into attention weights. In the training loop it turns the model's final logits into token probabilities.

Normalization keeps activations in a reasonable range. Nothing fancy here.

Design choice: RMSNorm instead of LayerNorm. Simpler (no mean subtraction, no learned scale/bias parameters). Modern LLMs like LLaMA use RMSNorm for the same reason.

Input vector has a wide range (−5 to 6). After dividing by RMS≈3.97, values are compressed to a tighter range (~−1.3 to 1.5). Shape preserved, magnitudes tamed. Animated: input → ÷ RMS → output.

5. attention (lines 189-235)

the problem attention solves

An MLP processes each position independently. Token at position 3 has no access to position 0. But predicting the 'a' in "emm_" is easier with the full context. Attention gives each token a channel to read from every past position.

Q, K, V

Before computing Q, K, V, ask: who should each token pay attention to? A token representing 'm' in position 2 of 'emma' should probably attend to position 1 ('e') more than position 0 (BOS). Q, K, V give the model the machinery to figure that out from content.

Each token projects its embedding through three learned matrices to produce three vectors:

Q (query): "what am I looking for?"
K (key): "what do I advertise?"
V (value): "what do I carry?"

Formula: $\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V$

This code implements exactly that, one scalar at a time.

Each token projects to Q, K, V via separate learned matrices. Attention score = dot(Q[t], K[s]) / sqrt(head_dim). Output = weighted sum of Vs.

causal masking

score(t, s) = dot(Q[t], K[s]) / sqrt(head_dim). Dot product measures alignment: if Q[t] and K[s] point in similar directions in 16-dimensional space, their dot product is large. Large score means high attention weight, which means s's value gets pulled into t's output. Basically the model routes information based on what the query is asking and what the key is advertising.

The sqrt(head_dim) scaling keeps dot products from growing with the embedding dimension. At high dimension, a few scores dominate, softmax saturates (one weight goes to 1.0, rest go to 0), and gradients vanish. The scaling prevents this.

Without the causal mask every position can see all future tokens (the model learns to copy from the future rather than predict it; cheating)

Result? Remove range(t+1) and replace with range(len(x_seq)) (which means seeing all tokens) and the model immediately hits zero loss on training data while failing completely on generation.

Causal mask: the loop runs s in range(t+1), so future tokens never exist in the score list. No explicit mask matrix (just don't compute the scores)

This is O(T²) per head per layer. For T=16: 16×16=256 dot products per head, 1024 total. Fine here. At T=100K (long-context models), it'll crash, hence algorithms like flash attention is used.

Each column t = one query token. It can only attend to rows s ≤ t. The causal triangle reveals progressively. Animated: columns appear sequentially.

multi-head

Split Q, K, V into n_head=4 slices of head_dim=4 each. Run attention independently per head. Heads specialize: one might track character repetition ("mm" in "emma"), another common name endings, another positional distance. This comes from training. Concatenate the 4 outputs back to 16 dims, then project through wo.

The output of each attention head is a weighted average of value vectors: head_out[d] = sum over s of weights[s] * V[s][d].

Attention is just a learnable weighted sum.

16-dim embedding splits into 4 heads (4 dims each). Each attends independently, outputs concatenated, then projected through Wo back to 16 dims.

the residual connection

After attention: x = x + attn_out. This is called skip connection from ResNet. Without it, gradients multiply through every layer and hit zero before reaching early parameters. With it, the gradient path from loss to input includes a direct addition - the derivative of x + f(x) w.r.t. x is 1 + df/dx, so the gradient is never zero even if f learns nothing. See attention-residuals.

def attention_block(x_seq, wq_i, wk_i, wv_i, wo_i):
    Q = []
    for x in x_seq:
        Q.append(linear(x, wq_i))
    K = []
    for x in x_seq:
        K.append(linear(x, wk_i))
    V = []
    for x in x_seq:
        V.append(linear(x, wv_i))
 
    out_seq = []
    for t in range(len(x_seq)):
        x_attn = []
        for h in range(n_head):
            start = h * head_dim
            end = (h + 1) * head_dim
            qh = Q[t][start:end]
 
            scores = []
            for s in range(t + 1):  # causal: no peeking at future
                score = Value(0.0)
                for qi, ki in zip(qh, K[s][start:end]):
                    score = score + qi * ki
                score = score / math.sqrt(head_dim)  # scale to stabilize softmax
                scores.append(score)
 
            weights = softmax(scores)
 
            head_out = []
            for d in range(head_dim):
                # weighted sum of values across attended positions
                val = Value(0.0)
                for s in range(t + 1):
                    val = val + weights[s] * V[s][start + d]
                head_out.append(val)
            for elem in head_out:
                x_attn.append(elem)
 
        out_seq.append(linear(x_attn, wo_i))  # mix across heads
    return out_seq

Checkpoint: attention - Attention computes content-based weighted averages of value vectors. Multi-head runs this in parallel across different learned projections. Causal masking prevents looking at future tokens. The math is three linear projections, a scaled dot product, a softmax, and a weighted sum.

6. the MLP (lines 238-248)

Attention mixes information across positions. MLP processes each position's vector independently. Expand 16 -> 64 (4x width), apply ReLU, project back 64 -> 16.

The logic is simple; more dimensions = more you know about the data (more capacity; intuitive when you think about any data and increase dimensions)

MLP = per-token compute. Attention = cross-token communication. The transformer alternates between these two operations: attention lets tokens talk to each other, MLP lets each token process what it heard.

In larger models, the MLP layers are where most factual associations are stored - things like 'anna' implies female name. Attention routes information; MLP transforms it.

Design choice: ReLU instead of GELU. GELU is smoother and used in GPT-2/GPT-3. ReLU is simpler and works fine here.

16 inputs (blue) fan out to 64 hidden nodes. Green nodes survive ReLU; red nodes are zeroed. Survivors fan back in to 16 outputs (orange).

def mlp_block(x, w_fc1_i, w_fc2_i):
    x = linear(x, w_fc1_i)          # 16 -> 64: project to wider space
    relu_out = []
    for xi in x:
        relu_out.append(xi.relu())   # non-linearity (ReLU, not GELU)
    x = linear(relu_out, w_fc2_i)   # 64 -> 16: project back
    return x

7. the full forward pass (lines 251-271)

Positional embeddings are not optional. The attention mechanism is permutation equivariant: without position information, 'emma' and 'mame' produce identical representations because attention treats the set of tokens the same regardless of order. wpe encodes order so the model can distinguish position 0 from position 3.

def gpt(tokens):
    # 1. Build input embeddings: add token embedding + position embedding
    # enumerate(tokens) gives (0, tok0), (1, tok1), ...
    # wte[tok] and wpe[pos] are each 16-dim lists; we add them element by element
    x_seq = []
    for pos, tok in enumerate(tokens):
        emb = []
        for te, pe in zip(wte[tok], wpe[pos]):
            emb.append(te + pe)
        x_seq.append(emb)
    # x_seq is now a list of T vectors, each of length 16
 
    for i in range(n_layer):
        # 2. Normalize, run attention, add residual (x = x + attn_out)
        normed = []
        for x in x_seq:
            normed.append(rmsnorm(x))
        attn_out = attention_block(normed, wq[i], wk[i], wv[i], wo[i])
        new_x_seq = []
        for attn, x in zip(attn_out, x_seq):
            combined = []
            for a, b in zip(attn, x):          # add skip connection
                combined.append(a + b)          # new_x[d] = attn_out[d] + old_x[d]
            new_x_seq.append(combined)
        x_seq = new_x_seq
 
        # 3. Normalize, run MLP, add residual (same pattern)
        new_x_seq = []
        for x in x_seq:
            mlp_out = mlp_block(rmsnorm(x), w_fc1[i], w_fc2[i])
            combined = []
            for a, b in zip(mlp_out, x):
                combined.append(a + b)
            new_x_seq.append(combined)
        x_seq = new_x_seq
 
    # 4. Project each position's 16-dim vector to 27 logits (one per vocab token)
    logits_seq = []
    for x in x_seq:
        logits_seq.append(linear(x, lm_head))
    return logits_seq

Shape walkthrough for input "em" (2 tokens, e=4, m=12):

tokens = [26, 4, 12]             # BOS, 'e', 'm'        shape: [3]
x_seq  = [[...], [...], [...]]   # after wte+wpe         shape: [3, 16]
normed = [[...], [...], [...]]   # after rmsnorm          shape: [3, 16]
attn_out = [[...], [...], [...]] # after attention        shape: [3, 16]
x_seq (residual) = x_seq + attn_out                      shape: [3, 16]
mlp_out = [[...], [...], [...]]  # after mlp              shape: [3, 16]
x_seq (residual) = x_seq + mlp_out                       shape: [3, 16]
logits = [[...], [...], [...]]   # after lm_head          shape: [3, 27]

Each token and its position each contribute a learned 16-dim vector. The two vectors are added element by element.

The result is the model's complete representation of "this token at this position."

Full forward pass. Pre-norm (RMSNorm before attention/MLP) is more stable than post-norm. Residual connections after each sub-layer.

Pre-norm (RMSNorm before each sub-layer) vs post-norm (after): pre-norm keeps the residual stream unmodified by the norm. The original Transformer used post-norm; GPT-2 switched to pre-norm after finding it more stable at scale. LLaMA kept pre-norm with RMSNorm.

Residual connections let the model start from identity (pass x through unchanged) and learn deltas. The derivative of x + f(x) with respect to x is 1 + df/dx (the gradient always has a clean path back through the 1 term) without it, gradients vanish through many layers.

The model is differentiable. Every operation (embeddings, attention scores, softmax, weighted sum, MLP, final projection) is composed of Value nodes. loss.backward() computes dLoss/d(every weight) in one pass.

8. training loop (lines 274-318)

loss: cross-entropy

Cross-entropy loss = -log(p[target]).

If the model assigns 90% probability to the correct token: loss = -log(0.9) = 0.11.

If it assigns 1%: loss = -log(0.01) = 4.6.

The log function penalizes confident wrong answers harshly and rewards confident correct ones. It's the only loss function with that property that's computationally stable.

Why log? Cross-entropy is the negative log-likelihood of the data under the model (log turns products of probabilities into sums, numerically stable and easier to differentiate)

Random model gives uniform probability ≈ 1/27, loss ≈ 3.3. A good model gives high p to the correct token, loss near 0. The dot shows what training does: pushes the model rightward along this curve.

backward + SGD

for step in range(num_steps):
    total_loss = Value(0.0)
 
    for b in range(batch_size):
        # modulo wraps around so we cycle through all docs repeatedly
        doc = docs[(step * batch_size + b) % len(docs)]
        # chars.index(c) converts each character to its integer ID
        char_tokens = []
        for c in doc:
            char_tokens.append(chars.index(c))
        tokens = [BOS] + char_tokens + [BOS]
        tokens = tokens[:block_size + 1]    # truncate if name > 16 chars
 
        input_tokens  = tokens[:-1]         # model sees this
        target_tokens = tokens[1:]          # model must predict this
 
        logits_seq = gpt(input_tokens)
 
        loss = Value(0.0)
        for logits, target in zip(logits_seq, target_tokens):
            probs = softmax(logits)
            loss = loss + (-probs[target].log())   # negative log-likelihood
        total_loss = total_loss + loss / len(logits_seq)
 
    avg_loss = total_loss / batch_size
    avg_loss.backward()          # backprop through the ENTIRE Value graph
 
    # SGD: param -= lr * grad
    for param in params:
        param.data -= learning_rate * param.grad
        param.grad = 0           # must reset or gradients accumulate
 
    print(f"Step {step+1}/{num_steps}, Loss: {avg_loss.data:.4f}")

Gradients accumulate by += in backward(). If you skip the zero step, gradients from step 1 add to gradients from step 2, tripling the effective learning rate and causing divergence. Zero every step.

Surprise: this is vanilla SGD, not Adam. Three lines: subtract gradient, zero gradient, move on. Adam tracks per-parameter momentum and variance, which matters at scale but adds 2x memory and complexity. For 4K parameters and 20 steps, SGD works. if you want Adam (which works great at scale and adds two extra moment terms per parameter) look at the actual lines in karpathy's repo.

Loss starts at ~3.3 (-log(1/27) ≈ 3.3, random model with 27 classes). It drops to ~2.37 over 20 steps. The model has seen 160 names (20 steps × 8 batch). Output at this point: names are phonetically plausible but rough - "aria", "eman", "marn". Train for 500 steps and they get much better.

$ python train.py
num docs: 32033
vocab size: 27
num params: 4192
step    1 / 1000 | loss 3.3660
step    6 / 1000 | loss 2.9452
step   11 / 1000 | loss 2.7964
step   13 / 1000 | loss 3.0544
...

Loss descends noisily from ~3.3 (random baseline) to ~2.37 over 20 steps. The spike at step 13 is real - gradient steps are noisy by design.

Checkpoint: training - Forward pass computes loss. backward() fills every param.grad. SGD subtracts lr*grad. grad=0 resets for next step. This is the complete training algorithm.

9. inference (lines 321-343)

Start from [BOS]. Run the full forward pass. Take logits at the last position. Apply temperature. Softmax. Sample. Append. Repeat until BOS comes back out.

Logits are unnormalized scores before softmax. A logit of 5.0 doesn't mean 'probability 5.0' - it just means this token is more likely than one with logit 2.0. Softmax turns the relative differences into a probability distribution.

That's all autoregressive generation is: run the model, get probabilities over next tokens, pick one, condition on it, repeat. The model never "sees" what it's about to generate - it only sees what it has generated so far.

Temperature is logit / T before softmax. If the model's top logit is 2.0 and second is 1.5:

T=0.5: logits become 4.0 and 3.0 - gap widens, top token dominates
T=1.0: unchanged
T=2.0: logits become 1.0 and 0.75 - gap narrows, more even distribution

Temperature controls the entropy of the output distribution.

T=0 gives argmax (always pick the most likely token, no diversity)
T=1 samples from the raw model distribution
T>1 flattens the distribution (more surprising outputs)

Sampling instead of argmax gives diversity: two runs with the same context produce different names. This is why when you enter the same prompt more than once in chatgpt; you get different results.

T=0.5 produces repetitive but valid looking names. T=1.5 produces novel but sometimes nonsensical ones.

Attention weights depend on the content of Q and K, not their positions. Position information enters only through the initial wpe embeddings. Once inside the transformer, everything is content-based.

Same underlying logits. T=0.5 concentrates probability on likely tokens. T=2.0 flattens toward uniform. Animated: cycles through T=0.5, 1.0, 2.0.

Name generation animation:

Token-by-token generation. Each token is sampled from the model's output distribution, appended to context, and fed back in. BOS = stop.

temperature = 0.5
for _ in range(10):
    context = [BOS]
    name = ""
    for _ in range(block_size):
        logits_seq = gpt(context)
        # apply temperature: divide logits before softmax
        # lower temperature = model picks safer, more common tokens
        logits_t = []
        for l in logits_seq[-1]:
            logits_t.append(l / temperature)
        probs = softmax(logits_t)
        # random.choices samples one item from range(vocab_size)
        # weighted by probability - higher prob tokens get picked more often
        # [0] because random.choices returns a list; we want the single item
        prob_data = []
        for p in probs:
            prob_data.append(p.data)
        next_token = random.choices(range(vocab_size), weights=prob_data)[0]
        if next_token == BOS:    # BOS doubles as end-of-name sentinel
            break
        context.append(next_token)
        name += chars[next_token]
    print(name)

Running this after 1000 training steps:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia
sample  7: yeran
sample  8: anna
sample  9: areli
sample 10: kaina
sample 11: konna
sample 12: keylen
sample 13: liole
sample 14: alerin
sample 15: earan
sample 16: lenne
sample 17: kana
sample 18: lara
sample 19: alela
sample 20: anton

The model has never seen most of these, it learns the pattern (vowel/consonant rhythm, common endings like -a, -an, -ina) from examples in 1000 steps.

10. from microgpt to chatgpt

microGPT	ChatGPT
32K names	trillions of internet tokens
27-token char vocab	~50K BPE subword tokens
4,192 params, 1 layer	175B params, 96 layers
single CPU	thousands of H100s
vanilla SGD	Adam + LR schedule + grad clipping
no post-training	SFT + RLHF + safety alignment
ReLU	GELU
RMSNorm	LayerNorm (GPT-2) / RMSNorm (LLaMA)

Same algorithm. Scale and engineering differ.

The core attention mechanism is from Attention paper (2017) and hasn't changed. Only thing from 2017-GPT4 now is data pipeline, post training (SFT + RLHF), infra, compute budget.

references

Karpathy's makemore - the dataset and spirit
Andrej Karpathy: The spelled-out intro to language modeling
Jay Alammar: The Illustrated Transformer
Jay Alammar: The Illustrated GPT-2
Vaswani et al. (2017): Attention Is All You Need