My Notes on Karpathy's Micrograd: Building an Autograd Engine from Scratch
What is this?
This post breaks down the math and code behind Andrej Karpathy’s micrograd, a ~150 line autograd engine that implements backpropagation from scratch.
Karpathy also has a 2.5 hour YouTube video building micrograd from zero. If you prefer video, watch that. This post covers the same material in written form with the math spelled out.
The repo has two files:
engine.py(~95 lines): The autograd enginenn.py(~60 lines): Neural network building blocks
That’s enough to train a small neural network.
Quick history
Backpropagation has been around since the 1970s (Seppo Linnainmaa’s thesis), but took off in 1986 when Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors”. That paper showed you could train multi-layer networks with it.
The math you need
Before the code, let’s get the math straight.
Gradients
A gradient is just a vector of partial derivatives. If you have a function :
It points toward steepest ascent. We want to minimize loss, so we go the opposite direction.
The chain rule
This is the whole thing. If :
Neural networks are just nested functions, so we chain a bunch of these together:
We compute this from the end backwards, hence “backpropagation.”
Why “automatic” differentiation?
Three ways to compute derivatives:
- Symbolic: Manipulate formulas (Wolfram Alpha style). Gets messy fast with complex expressions.
- Numerical: Finite differences, . Slow and numerically unstable.
- Automatic: Track operations, apply chain rule. This is what micrograd does.
The key thing: autodiff doesn’t give you a formula for the derivative. It gives you the derivative’s value at a specific point. Big difference.
The Value class
This is the core of everything. Here’s what each Value stores:
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data # the actual number
self.grad = 0 # ∂L/∂(this), starts at 0
self._backward = lambda: None # how to propagate gradients
self._prev = set(_children) # what created this value
self._op = _op # for debugging
When you do c = a + b, micrograd builds a graph:
a ────┐
├──► c = a + b
b ────┘
Each node knows its parents. That’s how we traverse backwards.
Why += for gradients?
Look at this:
def _backward():
self.grad += out.grad # += not =
Why +=? Because a value might be used multiple times:
a = Value(3.0)
b = a + a # a appears twice
Both paths contribute to a’s gradient. If b = a + a, then . We need to sum contributions from all paths. That’s the multivariate chain rule.
Operations and their gradients
Let’s go through each operation. I’ll show the code first, then the math.
Addition
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
If :
Same for . Addition just passes the gradient through unchanged to both inputs.
Multiplication
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
If :
Each input’s gradient gets scaled by the other input’s value. Makes sense if you think about it: if is large, then small changes in have big effects on the output.
Power
def __pow__(self, other):
assert isinstance(other, (int, float))
out = Value(self.data**other, (self,), f'**{other}')
def _backward():
self.grad += (other * self.data**(other-1)) * out.grad
out._backward = _backward
return out
If :
Classic power rule from calc. This handles division too: , and the power rule gives us the derivative of .
ReLU
def relu(self):
out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')
def _backward():
self.grad += (out.data > 0) * out.grad
out._backward = _backward
return out
Gradient is 1 if positive, 0 if negative. It’s a gate: either lets the gradient through or blocks it.
Why ReLU instead of sigmoid or tanh? Those older activations have gradients that approach 0 for large inputs (vanishing gradient problem). ReLU’s gradient is always 0 or 1, which helps deep networks train. AlexNet (2012) popularized this.
(Yes, ReLU isn’t differentiable at 0. We just say the gradient is 0 there. Works fine in practice.)
The backward pass
Here’s where everything comes together:
def backward(self):
# Build topological order
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Backpropagate
self.grad = 1
for v in reversed(topo):
v._backward()
Why topological sort?
To compute , you need first (where depends on ). So you have to process nodes in the right order, outputs before inputs.
Example:
Forward: a, b → (a+b) → (a+b)·c → L
c ────────────┘
Topo order: [a, b, c, (a+b), L]
Process reversed: [L, (a+b), c, b, a]
The DFS builds this order. A node gets added only after all its dependencies are added. Reverse that, and you process outputs before inputs.
Helper operations
The rest of engine.py is boilerplate to make Python syntax work:
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
__radd__ and __rmul__ handle cases like 2 + Value(3). Python tries int.__add__ first, which fails, then falls back to Value.__radd__.
The neural network bits
Now nn.py. This builds on top of engine.py.
Neuron
class Neuron(Module):
def __init__(self, nin, nonlin=True):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(0)
self.nonlin = nonlin
def __call__(self, x):
act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
return act.relu() if self.nonlin else act
A neuron does:
Weights are initialized randomly in . Not the best initialization scheme (Xavier/He initialization is better), but it works for this demo.
Layer
class Layer(Module):
def __init__(self, nin, nout, **kwargs):
self.neurons = [Neuron(nin, **kwargs) for _ in range(nout)]
def __call__(self, x):
out = [n(x) for n in self.neurons]
return out[0] if len(out) == 1 else out
A layer is just multiple neurons that all see the same input.
MLP
class MLP(Module):
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1], nonlin=i!=len(nouts)-1)
for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
MLP(2, [16, 16, 1]) creates:
- Layer 1: 2 → 16 (with ReLU)
- Layer 2: 16 → 16 (with ReLU)
- Layer 3: 16 → 1 (no ReLU, we want raw scores)
The last layer being linear is important. For classification, we want unbounded scores that the loss function interprets.
Total parameters:
Training
The demo notebook trains on the “moons” dataset. Here’s the loss function:
# SVM hinge loss
losses = [(1 + -yi*scorei).relu() for yi, scorei in zip(yb, scores)]
data_loss = sum(losses) * (1.0 / len(losses))
# L2 regularization
alpha = 1e-4
reg_loss = alpha * sum((p*p for p in model.parameters()))
total_loss = data_loss + reg_loss
Hinge loss: where .
If the prediction has the right sign and is confident (magnitude > 1), loss is 0. Otherwise it’s penalized.
L2 regularization: Adds to prevent weights from exploding. The gradient is , which gently pulls weights toward zero.
The training loop:
for k in range(100):
total_loss, acc = loss()
model.zero_grad()
total_loss.backward()
learning_rate = 1.0 - 0.9*k/100
for p in model.parameters():
p.data -= learning_rate * p.grad
Standard SGD:
The learning rate decays from 1.0 to 0.1. Simple schedule, but it works here.
zero_grad() is necessary because gradients accumulate with +=. If you don’t reset them, you’re adding gradients from multiple backward passes.
Does it actually work?
The tests compare micrograd against PyTorch:
def test_sanity_check():
# Micrograd
x = Value(-4.0)
z = 2 * x + 2 + x
q = z.relu() + z * x
h = (z * z).relu()
y = h + q + q * x
y.backward()
xmg, ymg = x, y
# PyTorch
x = torch.Tensor([-4.0]).double()
x.requires_grad = True
z = 2 * x + 2 + x
q = z.relu() + z * x
h = (z * z).relu()
y = h + q + q * x
y.backward()
xpt, ypt = x, y
assert ymg.data == ypt.data.item()
assert xmg.grad == xpt.grad.item()
Same computation, same gradients. A 100-line implementation matches a production framework.
What’s missing for production
Micrograd operates on scalars. Real frameworks use tensors and run on GPUs. It also only has ~5 operations; PyTorch has hundreds. And the optimizer here is manual SGD. Production code uses Adam, which adapts learning rates automatically.
But the core algorithm is identical. The rest is engineering and optimization.
Resources
If you want to go deeper:
Karpathy’s video: The spelled-out intro to neural networks and backpropagation. Walks through building micrograd step by step.
3Blue1Brown: What is backpropagation really doing? and Backpropagation calculus. Great visual explanations.
The original paper: Rumelhart, Hinton, Williams (1986). Worth skimming for historical context.
Autodiff survey: Baydin et al. if you want the full academic treatment.
Wrapping up
The core of deep learning fits in 150 lines. Forward pass, build a graph, backward pass with chain rule, update weights. That’s it.
Everything else (tensors, GPUs, batch norm, transformers) is built on top of this. The scale changes, the fundamentals don’t.
Code snippets from karpathy/micrograd.