The previous post covered what happens when you press 'submit' on ChatGPT. This covers the actual code.
ChatGPT and the script below share the exact same algorithm. Scale and engineering differ; the math does not. ChatGPT runs on thousands of H100s trained on the internet; this runs on your laptop in a few minutes, trained on random names instead of the entire internet.
Prerequisites: know what a gradient is. If not: read neural networks, the math
Core idea: A language model has one job: predict the next token given the context so far
1. data (lines 1-23)
Dataset: 32,000 human first names from Karpathy's makemore project. Instead of using BytePairEncoding (which is what gippity uses as it's tokeniser) we use character level tokenizer so the alphabet maps to IDs 0-25, and one special token sits at ID 26: BOS (beginning-of-sequence). Vocab size = 27 (26 a-z + 1 BOS)
import urllib.request, os
url = "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt"
if not os.path.exists("names.txt"):
urllib.request.urlretrieve(url, "names.txt")
docs = open("names.txt").read().splitlines() # list of 32k names
# "".join(docs) concatenates every name into one giant string, set() removes duplicates, sorted() sorts them
chars = sorted(set("".join(docs))) # ['a','b',...,'z'] - 26 chars
BOS = len(chars) # ID 26
vocab_size = len(chars) + 1 # 27BOS serves as both beginning of sentence and end of sentence, instead of having two seperate tokens, we use one for simplicity. [BOS, c1, c2, ..., cn, BOS]. When the model produces BOS during inference - stop (here EOS and BOS are the same)
BOS doubling as EOS works because the model learns from examples: every name in the training data ends with BOS. After 20 steps it starts producing BOS as the termination signal naturally.
Tokenization for "emma":
Two things here: input targets and training targets. Input targets is everything except last token and training targets must predict everything starting from the first token. The model sees previous input tokens [:-1] and must predict [1:] at each position (next token). "emma" generates 5 training pairs:
doc = "emma"
char_tokens = []
for c in doc:
char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS] # tokens = [26, 4, 12, 12, 0, 26]
input_tokens = tokens[:-1] # [26, 4, 12, 12, 0]
target_tokens = tokens[1:] # [4, 12, 12, 0, 26]At position t, the model sees tokens[0..t] and must predict tokens[t+1]
2. autograd engine (lines 25-121)
Parameters is what you tune / nudge in the right direction. Parameters consists of weights and biases of the model (think of them as the neurons in our brain); in order to make sure they learn the right things -> we need to tune them as the model learns more and more data. What is the 'right' direction here? We check the impact of every weight/bias with respect to our Loss and see how increasing or decreasing the value for weights changes loss. In order to minimise loss and change parameters efficiently; we need to move in the negative direction of gradient (change in parameter) : This is called backpropagation and does this with chain rule. Entire of microgpt is just backpropagation with optimisations on top. Gradient descent needs dLoss/d(every parameter). For 4,192 parameters, that's 4,192 separate derivative calculations per step. Autograd is a library karpathy made which computes all of them by tracking every operation and reversing it by keeping a graph of it. You can see more about it here
the Value class
Every scalar is a Value node. It stores four things: data (the actual value of parameter), gradient (the slope of that parameter wrt loss)), local_gradients (slope of that particular parameter wrt to it's inputs), children (to keep track of it's inputs (for c = a+b, a and b are inputs with _op + to get c))
local_grads are stored as a tuple (each operation captures its derivative formula at construction time)
# Python "dunder" methods like __add__ let you define what happens when you write a + b: Python calls a.__add__(b) under the hood
# The derivative formula lim h-> 0 f(x+h) - f(x)/h tells us if we add small h to a function, how does it change? Does it increase/decrease. What is the impact. This is known as the derivative of that function.
# We need a way to compute impact of all parameters (inputs) with respect to Loss so we can change / minimise loss accordingly and get the best values of these parameters which makes our loss minimum. This is backprop btw.
class Value:
def __init__(self, data, _children=(), local_grads=()):
self.data = data
self.grad = 0 # dLoss/d(this) - general gradient or derivative, this gets updated when you call .backward()
self._children = _children # tuple of Value nodes that produced this node
self.local_grads = local_grads # d(this)/d(each child)
# d(a+b)/da = 1, d(a+b)/db = 1
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, _children=(self, other))
out.local_grads = (1.0, 1.0) # gradient of sum wrt each input is 1 (da/da = 1 (if you don't know this: change of one parameter with respect to itself will always be constant 1 if you think about it))
return out
# d(a*b)/da = b, d(a*b)/db = a
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, _children=(self, other))
out.local_grads = (other.data, self.data) # cross-multiply
return out
# d(x^n)/dx = n * x^(n-1)
def __pow__(self, other):
return Value(self.data**other, (self,), (other * self.data**(other-1),))
# d(log x)/dx = 1/x
def log(self):
return Value(math.log(self.data + 1e-10), (self,), (1/(self.data + 1e-10),))
# d(exp x)/dx = exp(x)
def exp(self):
return Value(math.exp(self.data), (self,), (math.exp(self.data),))
# d(relu x)/dx = 1 if x>0, else 0
def relu(self):
return Value(max(0, self.data), (self,), (float(self.data > 0),))
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
# __radd__, __rmul__ etc. are "reflected" operators.
# Python calls __radd__(b, a) when a.__add__(b) fails (e.g. int + Value) - without these, `3 * some_value` would raise TypeError.
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * otherbackward()
This method goes inside the Value class we defined above.
Topological sort first (DFS post-order): every node appears after its children! Then reverse the order - that's the backward order. Set the root's gradient to 1 (dLoss/dLoss = 1). Walk in reverse: at each node, multiply its local_grads by its own grad and add to children. This is the chain rule in calculus -> "If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man."
backward() propagates gradients through the ENTIRE computation graph in one pass but through every node that participated in computing the loss
def backward(self):
topological_graph = []
visited = set()
def build_top(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_top(child) # recurse into children first
topological_graph.append(v) # append AFTER children = post-order
build_top(self)
self.grad = 1 # dLoss/dLoss = 1
for v in reversed(topological_graph): # process root before leaves
for child, local_grad in zip(v._children, v.local_grads):
child.grad += local_grad * v.grad # chain rule3. hyperparameters and weight init (lines 124-160)
n_embed = 16 # every vector in the model has this length
n_head = 4 # attention heads (each sees head_dim=4 dimensions)
n_layer = 1 # one transformer block
block_size = 16 # max sequence length
head_dim = n_embed // n_head # = 4Random matrix is initialised. Small initialization (sigma=0.08) prevents exploding activations. If weights start large, each layer multiplies by a large number (values overflow) hence we start small, learn from there.
def matrix(rows, cols):
# Small Gaussian init (sigma=0.08) prevents activation explosion
result = []
for r in range(rows):
row = []
for c in range(cols):
row.append(Value(random.gauss(0, 0.08)))
result.append(row)
return resultThe full parameter list:
All six matrices are stored as lists so each transformer block gets its own independent copy:
wq = []
wk = []
wv = []
wo = []
w_fc1 = []
w_fc2 = []
for i in range(n_layer):
wq.append(matrix(n_embed, n_embed)) # 16 x 16
wk.append(matrix(n_embed, n_embed)) # 16 x 16
wv.append(matrix(n_embed, n_embed)) # 16 x 16
wo.append(matrix(n_embed, n_embed)) # 16 x 16
w_fc1.append(matrix(n_embed * 4, n_embed)) # 64 x 16 (output_dim, input_dim hence it's n*4,n and not the other way around)
w_fc2.append(matrix(n_embed, n_embed * 4)) # 16 x 64
lm_head = matrix(vocab_size, n_embed) # 27 x 16Every Value node gets collected into one flat list so the SGD loop can update all 4,192 parameters in a single pass:
def flatten(mat):
result = []
for row in mat:
for val in row:
result.append(val)
return result
params = flatten(wte) + flatten(wpe)
for i in range(n_layer):
params += flatten(wq[i]) + flatten(wk[i]) + flatten(wv[i]) + flatten(wo[i])
params += flatten(w_fc1[i]) + flatten(w_fc2[i])
params += flatten(lm_head)
print(f"num params: {len(params)}") # 4192This is what makes for param in params: param.data -= lr * param.grad in the training loop work (all parameters from all layers in one flat iterable)
Embeddings are vectors, not integer IDs, because vectors capture similarity. 'anna' and 'ana' share characters; their embeddings should be close in space. An integer ID for 'a' (=0) has no geometric relationship to 'n' (=13). A learned 16-dim vector does.
4. utility functions (lines 163-186)
linear: matrix-vector multiply. This is the forward pass of a single layer.
def linear(x, w):
# output[i] = dot(w[i], x) for each row of the weight matrix
result = []
for row in w: # each row = one output neuron's weights
total = Value(0.0)
for weight, inp in zip(row, x): # zip pairs each weight with its input
total = total + weight * inp # accumulate dot product
result.append(total)
return resultsoftmax: converts logits to probabilities. The numerical stability trick:
The constant cancels. Subtract max before exponentiating to avoid overflow.
def softmax(logits):
max_logit = max(l.data for l in logits) # stability: subtract max before exp
exp_logits = []
for logit in logits:
exp_logits.append((logit - max_logit).exp())
sum_exp = Value(0.0)
for e in exp_logits:
sum_exp = sum_exp + e
result = []
for e in exp_logits:
result.append(e / sum_exp)
return result
def rmsnorm(x):
# Normalize by root-mean-square.
# mean_sq = average of (xi squared)
# rms = sqrt(mean_sq)
# output = xi / rms
eps = 1e-8
sq_total = Value(0.0)
for xi in x:
sq_total = sq_total + xi ** 2
mean_sq = sq_total / len(x)
rms = (mean_sq + eps) ** 0.5
result = []
for xi in x:
result.append(xi / rms)
return resultAfter softmax, the output is a probability distribution - non-negative values that sum to exactly 1.0. In attention_block this turns raw dot-product scores into attention weights. In the training loop it turns the model's final logits into token probabilities.
Normalization stabilizes training by keeping activations in a reasonable range. It stablizes training; nothing fancy here.
5. attention (lines 189-235)
the problem attention solves
An MLP processes each position independently. Token at position 3 has no access to position 0. But predicting the 'a' in "emm_" is easier with the full context. Attention gives each token a channel to read from every past position.
Q, K, V
Before computing Q, K, V, ask: who should each token pay attention to? A token representing 'm' in position 2 of 'emma' should probably attend to position 1 ('e') more than position 0 (BOS). Q, K, V give the model the machinery to figure that out from content.
Each token projects its embedding through three learned matrices to produce three vectors:
- Q (query): "what am I looking for?"
- K (key): "what do I advertise?"
- V (value): "what do I carry?"
Formula:
This code implements exactly that, one scalar at a time.
causal masking
score(t, s) = dot(Q[t], K[s]) / sqrt(head_dim). Dot product measures alignment: if Q[t] and K[s] point in similar directions in 16-dimensional space, their dot product is large. Large score means high attention weight, which means s's value gets pulled into t's output. Basically the model routes information based on what the query is asking and what the key is advertising.
The sqrt(head_dim) scaling keeps dot products from growing proportionally to dimensionality. Without the sqrt(head_dim) scaling, dot products grow with the embedding dimension. At high dimension, a few scores dominate, softmax saturates (one weight goes to 1.0, rest go to 0), and gradients vanish. The scaling prevents this.
Without the causal mask every position can see all future tokens (the model learns to copy from the future rather than predict it; cheating)
Result? Remove range(t+1) and replace with range(len(x_seq)) (which means seeing all tokens) and the model immediately hits zero loss on training data while failing completely on generation.
Causal mask: the loop runs s in range(t+1), so future tokens never exist in the score list. No explicit mask matrix (just don't compute the scores)
This is O(T²) per head per layer. For T=16: 16×16=256 dot products per head, 1024 total. Fine here. At T=100K (long-context models), it'll crash, hence algorithms like flash attention is used.
multi-head
Split Q, K, V into n_head=4 slices of head_dim=4 each. Run attention independently per head. Heads specialize: one might track character repetition ("mm" in "emma"), another common name endings, another positional distance. This comes from training. Concatenate the 4 outputs back to 16 dims, then project through wo.
The output of each attention head is a weighted average of value vectors: head_out[d] = sum over s of weights[s] * V[s][d].
Attention is just a learnable weighted sum.
the residual connection
After attention: x = x + attn_out. This is called skip connection from ResNet. Without it, gradients multiply through every layer and hit zero before reaching early parameters. With it, the gradient path from loss to input includes a direct addition - the derivative of x + f(x) w.r.t. x is 1 + df/dx, so the gradient is never zero even if f learns nothing. See attention-residuals.
def attention_block(x_seq, wq_i, wk_i, wv_i, wo_i):
Q = []
for x in x_seq:
Q.append(linear(x, wq_i))
K = []
for x in x_seq:
K.append(linear(x, wk_i))
V = []
for x in x_seq:
V.append(linear(x, wv_i))
out_seq = []
for t in range(len(x_seq)):
x_attn = []
for h in range(n_head):
start = h * head_dim
end = (h + 1) * head_dim
qh = Q[t][start:end]
scores = []
for s in range(t + 1): # causal: no peeking at future
score = Value(0.0)
for qi, ki in zip(qh, K[s][start:end]):
score = score + qi * ki
score = score / math.sqrt(head_dim) # scale to stabilize softmax
scores.append(score)
weights = softmax(scores)
head_out = []
for d in range(head_dim):
# weighted sum of values across attended positions
val = Value(0.0)
for s in range(t + 1):
val = val + weights[s] * V[s][start + d]
head_out.append(val)
for elem in head_out:
x_attn.append(elem)
out_seq.append(linear(x_attn, wo_i)) # mix across heads
return out_seq6. the MLP (lines 238-248)
Attention mixes information across positions. MLP processes each position's vector independently. Expand 16 -> 64 (4x width), apply ReLU, project back 64 -> 16.
The logic is simple; more dimensions = more you know about the data (more capacity; intuitive when you think about any data and increase dimensions)
MLP = per-token compute. Attention = cross-token communication. The transformer alternates between these two operations: attention lets tokens talk to each other, MLP lets each token process what it heard.
In larger models, the MLP layers are where most factual associations are stored - things like 'anna' implies female name. Attention routes information; MLP transforms it.
def mlp_block(x, w_fc1_i, w_fc2_i):
x = linear(x, w_fc1_i) # 16 -> 64: project to wider space
relu_out = []
for xi in x:
relu_out.append(xi.relu()) # non-linearity (ReLU, not GELU)
x = linear(relu_out, w_fc2_i) # 64 -> 16: project back
return x7. the full forward pass (lines 251-271)
Positional embeddings are not optional. The attention mechanism is permutation equivariant: without position information, 'emma' and 'mame' produce identical representations because attention treats the set of tokens the same regardless of order. wpe encodes order so the model can distinguish position 0 from position 3.
def gpt(tokens):
# 1. Build input embeddings: add token embedding + position embedding
# enumerate(tokens) gives (0, tok0), (1, tok1), ...
# wte[tok] and wpe[pos] are each 16-dim lists; we add them element by element
x_seq = []
for pos, tok in enumerate(tokens):
emb = []
for te, pe in zip(wte[tok], wpe[pos]):
emb.append(te + pe)
x_seq.append(emb)
# x_seq is now a list of T vectors, each of length 16
for i in range(n_layer):
# 2. Normalize, run attention, add residual (x = x + attn_out)
normed = []
for x in x_seq:
normed.append(rmsnorm(x))
attn_out = attention_block(normed, wq[i], wk[i], wv[i], wo[i])
new_x_seq = []
for attn, x in zip(attn_out, x_seq):
combined = []
for a, b in zip(attn, x): # add skip connection
combined.append(a + b) # new_x[d] = attn_out[d] + old_x[d]
new_x_seq.append(combined)
x_seq = new_x_seq
# 3. Normalize, run MLP, add residual (same pattern)
new_x_seq = []
for x in x_seq:
mlp_out = mlp_block(rmsnorm(x), w_fc1[i], w_fc2[i])
combined = []
for a, b in zip(mlp_out, x):
combined.append(a + b)
new_x_seq.append(combined)
x_seq = new_x_seq
# 4. Project each position's 16-dim vector to 27 logits (one per vocab token)
logits_seq = []
for x in x_seq:
logits_seq.append(linear(x, lm_head))
return logits_seqShape walkthrough for input "em" (2 tokens, e=4, m=12):
tokens = [26, 4, 12] # BOS, 'e', 'm' shape: [3]
x_seq = [[...], [...], [...]] # after wte+wpe shape: [3, 16]
normed = [[...], [...], [...]] # after rmsnorm shape: [3, 16]
attn_out = [[...], [...], [...]] # after attention shape: [3, 16]
x_seq (residual) = x_seq + attn_out shape: [3, 16]
mlp_out = [[...], [...], [...]] # after mlp shape: [3, 16]
x_seq (residual) = x_seq + mlp_out shape: [3, 16]
logits = [[...], [...], [...]] # after lm_head shape: [3, 27]
Each token and its position each contribute a learned 16-dim vector. The two vectors are added element by element.
The result is the model's complete representation of "this token at this position."
Pre-norm (RMSNorm before each sub-layer) vs post-norm (after): pre-norm keeps the residual stream unmodified by the norm. The original Transformer used post-norm; GPT-2 switched to pre-norm after finding it more stable at scale. LLaMA kept pre-norm with RMSNorm.
Residual connections let the model start from identity (pass x through unchanged) and learn deltas. The derivative of x + f(x) with respect to x is 1 + df/dx (the gradient always has a clean path back through the 1 term) without it, gradients vanish through many layers.
The model is differentiable. Every operation (embeddings, attention scores, softmax, weighted sum, MLP, final projection) is composed of Value nodes. loss.backward() computes dLoss/d(every weight) in one pass.
8. training loop (lines 274-318)
loss: cross-entropy
Cross-entropy loss = -log(p[target]).
If the model assigns 90% probability to the correct token: loss = -log(0.9) = 0.11.
If it assigns 1%: loss = -log(0.01) = 4.6.
The log function penalizes confident wrong answers harshly and rewards confident correct ones. It's the only loss function with that property that's computationally stable.
Why log? Cross-entropy is the negative log-likelihood of the data under the model (log turns products of probabilities into sums, numerically stable and easier to differentiate)
backward + SGD
for step in range(num_steps):
total_loss = Value(0.0)
for b in range(batch_size):
# modulo wraps around so we cycle through all docs repeatedly
doc = docs[(step * batch_size + b) % len(docs)]
# chars.index(c) converts each character to its integer ID
char_tokens = []
for c in doc:
char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS]
tokens = tokens[:block_size + 1] # truncate if name > 16 chars
input_tokens = tokens[:-1] # model sees this
target_tokens = tokens[1:] # model must predict this
logits_seq = gpt(input_tokens)
loss = Value(0.0)
for logits, target in zip(logits_seq, target_tokens):
probs = softmax(logits)
loss = loss + (-probs[target].log()) # negative log-likelihood
total_loss = total_loss + loss / len(logits_seq)
avg_loss = total_loss / batch_size
avg_loss.backward() # backprop through the ENTIRE Value graph
# SGD: param -= lr * grad
for param in params:
param.data -= learning_rate * param.grad
param.grad = 0 # must reset or gradients accumulate
print(f"Step {step+1}/{num_steps}, Loss: {avg_loss.data:.4f}")Gradients accumulate by += in backward(). If you skip the zero step, gradients from step 1 add to gradients from step 2, tripling the effective learning rate and causing divergence. Zero every step.
Loss starts at ~3.3 (-log(1/27) ≈ 3.3, random model with 27 classes). It drops to ~2.37 over 20 steps. The model has seen 160 names (20 steps × 8 batch). Output at this point: names are phonetically plausible but rough - "aria", "eman", "marn". Train for 500 steps and they get much better.
$ python train.py
num docs: 32033
vocab size: 27
num params: 4192
step 1 / 1000 | loss 3.3660
step 6 / 1000 | loss 2.9452
step 11 / 1000 | loss 2.7964
step 13 / 1000 | loss 3.0544
...
9. inference (lines 321-343)
Start from [BOS]. Run the full forward pass. Take logits at the last position. Apply temperature. Softmax. Sample. Append. Repeat until BOS comes back out.
Logits are unnormalized scores before softmax. A logit of 5.0 doesn't mean 'probability 5.0' - it just means this token is more likely than one with logit 2.0. Softmax turns the relative differences into a probability distribution.
That's all autoregressive generation is: run the model, get probabilities over next tokens, pick one, condition on it, repeat. The model never "sees" what it's about to generate - it only sees what it has generated so far.
Temperature is logit / T before softmax. If the model's top logit is 2.0 and second is 1.5:
- T=0.5: logits become 4.0 and 3.0 - gap widens, top token dominates
- T=1.0: unchanged
- T=2.0: logits become 1.0 and 0.75 - gap narrows, more even distribution
Temperature controls the entropy of the output distribution.
- T=0 gives argmax (always pick the most likely token, no diversity)
- T=1 samples from the raw model distribution
- T>1 flattens the distribution (more surprising outputs)
Sampling instead of argmax gives diversity: two runs with the same context produce different names. This is why when you enter the same prompt more than once in chatgpt; you get different results.
T=0.5 produces repetitive but valid looking names. T=1.5 produces novel but sometimes nonsensical ones.
Attention weights depend on the content of Q and K, not their positions. Position information enters only through the initial wpe embeddings. Once inside the transformer, everything is content-based.
Name generation animation:
temperature = 0.5
for _ in range(10):
context = [BOS]
name = ""
for _ in range(block_size):
logits_seq = gpt(context)
# apply temperature: divide logits before softmax
# lower temperature = model picks safer, more common tokens
logits_t = []
for l in logits_seq[-1]:
logits_t.append(l / temperature)
probs = softmax(logits_t)
# random.choices samples one item from range(vocab_size)
# weighted by probability - higher prob tokens get picked more often
# [0] because random.choices returns a list; we want the single item
prob_data = []
for p in probs:
prob_data.append(p.data)
next_token = random.choices(range(vocab_size), weights=prob_data)[0]
if next_token == BOS: # BOS doubles as end-of-name sentinel
break
context.append(next_token)
name += chars[next_token]
print(name)Running this after 1000 training steps:
sample 1: kamon
sample 2: ann
sample 3: karai
sample 4: jaire
sample 5: vialan
sample 6: karia
sample 7: yeran
sample 8: anna
sample 9: areli
sample 10: kaina
sample 11: konna
sample 12: keylen
sample 13: liole
sample 14: alerin
sample 15: earan
sample 16: lenne
sample 17: kana
sample 18: lara
sample 19: alela
sample 20: anton
The model has never seen most of these, it learns the pattern (vowel/consonant rhythm, common endings like -a, -an, -ina) from examples in 1000 steps.
10. from microgpt to chatgpt
| microGPT | ChatGPT |
|---|---|
| 32K names | trillions of internet tokens |
| 27-token char vocab | ~50K BPE subword tokens |
| 4,192 params, 1 layer | 175B params, 96 layers |
| single CPU | thousands of H100s |
| vanilla SGD | Adam + LR schedule + grad clipping |
| no post-training | SFT + RLHF + safety alignment |
| ReLU | GELU |
| RMSNorm | LayerNorm (GPT-2) / RMSNorm (LLaMA) |
Same algorithm. Scale and engineering differ.
The core attention mechanism is from Attention paper (2017) and hasn't changed. Only thing from 2017-GPT4 now is data pipeline, post training (SFT + RLHF), infra, compute budget.
references
- Karpathy's makemore - the dataset and spirit
- Andrej Karpathy: The spelled-out intro to language modeling
- Jay Alammar: The Illustrated Transformer
- Jay Alammar: The Illustrated GPT-2
- Vaswani et al. (2017): Attention Is All You Need