Flashcards for Reiner Pope on Dwarkesh Podcast

Wrote some practice problems to help myself and my audience retain Reiner's blackboard lecture.

00:00:00

How batch size affects token cost and speed

  • T=max(tcompute, tmem)T = \max(t_{\text{compute}},\ t_{\text{mem}})

  • tcompute=BNactiveFLOPst_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}

    where BB is batch size, NactiveN_{\text{active}} is active parameters, and FLOPs is the compute throughput of the hardware.

  • tmem=Ntotal+BlenctxKVbytes/tokenmem_bwt_{\text{mem}} = \frac{N_{\text{total}} + B \cdot \text{len}_{\text{ctx}} \cdot \text{KV}_{\text{bytes/token}}}{\text{mem\_bw}}

  • Latency vs. batch size

  • Because you still have to load all the active parameters into memory.

  • Compute time, and memory time for KV cache fetches, cannot be amortized with batch size.

  • 300\sim 300 FLOPs / byte.

  • Set compute time = memory time (at equality, both resources are fully saturated):

    BNactiveFLOPs=Ntotalmem_bw\frac{B \cdot N_{\text{active}}}{\text{FLOPs}} = \frac{N_{\text{total}}}{\text{mem\_bw}}

    Solve for BB:

    B=FLOPsmem_bwNtotalNactive=3001sparsityB = \frac{\text{FLOPs}}{\text{mem\_bw}} \cdot \frac{N_{\text{total}}}{N_{\text{active}}} = 300 \cdot \frac{1}{\text{sparsity}}

    So B300/sparsityB \geq 300 / \text{sparsity}.

    Why: compute scales with BB (each token needs its own matmul), but weight fetches don't (load once, reuse across batch). Need enough tokens to amortize the fetch.

    DeepSeek V3: 32/25632/256 active → B300×8=2,400B \geq 300 \times 8 = 2{,}400.

  • 20ms is the HBM drain time — memory capacity ÷ memory bandwidth. E.g. Rubin: 288 GB/20 TB/s15ms288\text{ GB} / 20\text{ TB/s} \approx 15\text{ms}.

    Faster than 20ms is impossible because you physically can't read all the weights from HBM in less time than bandwidth allows.

    Slower than 20ms means you're just leaving the FLOPs idle, because there's nothing left to read.

00:32:09

How MoE models are laid out across GPU racks

  • MoE communication is all-to-all (any GPU's tokens may route to any other GPU's experts).

    Within a rack, NVLink connects every GPU to every other at full bandwidth, which is a perfect fit for all-to-all. Across racks, scale-out is 8×\sim 8\times slower and bottlenecks the all-to-all.

00:47:12

How pipeline parallelism moves model layers across racks

  • At the beginning of the batch, the GPUs dedicated to the final layers are not being used, and conversely at the end of the batch, the GPUs dedicated to the first layers are not being used.

    Pipeline bubbles diagram

  • You need to consolidate gradients and update the model before you process the next batch.

  • Keeping PP stages busy requires PP micro-batches in flight, so concurrent sequences scale with PP.

    Given that KV cache often dominates memory at long context lengths, pipelining's value is limited.

01:03:37

Why Ilya said, "As we now know, pipelining is not wise."

  • You're adding architecture constraints — things like Kimi's attention-to-residuals (where each block attends to all previous layers' residuals) become very difficult when those residuals live on different pipeline stages. Similarly, interleaving sliding-window and global attention layers could cause load imbalance across stages. Dealing with all this slows down research iteration, which is the greatest sin you can commit.

01:18:59

Because of RL, models may be 100× over-trained beyond Chinchilla-optimal

  • 2 FLOPs per parameter per token for the forward pass (multiply + add). Backward pass is 2×2\times forward because you compute gradients w.r.t. both input matrices. So 2+4=62 + 4 = 6.

  • Ctotal=Cpretrain+CRL+CinferenceC_{\text{total}} = C_{\text{pretrain}} + C_{\text{RL}} + C_{\text{inference}}

    Cpretrain=6×Nactive×DpretrainC_{\text{pretrain}} = 6 \times N_{\text{active}} \times D_{\text{pretrain}} (the 6ND6ND formula — forward + backward)

    CRL=(2 to 6)×Nactive×DRL×inefficiencyC_{\text{RL}} = (2 \text{ to } 6) \times N_{\text{active}} \times D_{\text{RL}} \times \text{inefficiency} (2 if you don't train on the rollout and do forward only, up to 6 if you do; inefficiency from low MFU during decode)

    Cinference=2×Nactive×Dinference×inefficiencyC_{\text{inference}} = 2 \times N_{\text{active}} \times D_{\text{inference}} \times \text{inefficiency} (forward pass only; lower MFU during decode)

  • If pre-training, RL, and inference costs trade off (more pre-training → less RL/inference needed for same quality, and vice versa), the optimum is approximately where all three are equal.

  • 6×Dpretrain=3×DRL×3×inefficiency=2×Dinference×3×inefficiency6 \times D_{\text{pretrain}} = 3 \times D_{\text{RL}} \times 3 \times \text{inefficiency} = 2 \times D_{\text{inference}} \times 3 \times \text{inefficiency}

    Dpretrain=1.5DRL=DinferenceD_{\text{pretrain}} = 1.5\, D_{\text{RL}} = D_{\text{inference}}

  • Dinference50M tokens/sec×60 days×86,400 sec/day200T tokensD_{\text{inference}} \approx 50\text{M tokens/sec} \times 60\text{ days} \times 86{,}400\text{ sec/day} \approx 200\text{T tokens}

    DpretrainDinference200T tokensD_{\text{pretrain}} \approx D_{\text{inference}} \approx 200\text{T tokens}

  • Dchinchilla20×100B=2T tokensD_{\text{chinchilla}} \approx 20 \times 100\text{B} = 2\text{T tokens}

    200T/2T=100×200\text{T} / 2\text{T} = 100\times

01:33:02

Deducing inference memory costs from API pricing

  • Below this point, you're compute bound, whose cost is flat as context length increases.

    Above this point, you're memory time bound, thanks to KV cache growing, and that increases linearly with context length.

  • Cost vs. context length

  • At the crossover, tcompute=tKV fetcht_{\text{compute}} = t_{\text{KV fetch}}:

    BNactiveFLOPs=Blenctxbytes/tokenmem_bw\frac{B \cdot N_{\text{active}}}{\text{FLOPs}} = \frac{B \cdot \text{len}_{\text{ctx}} \cdot \text{bytes/token}}{\text{mem\_bw}}

    Solve for bytes/token:

    bytes/token=mem_bwFLOPsNactivelenctx=1300Nactivelenctx\text{bytes/token} = \frac{\text{mem\_bw}}{\text{FLOPs}} \cdot \frac{N_{\text{active}}}{\text{len}_{\text{ctx}}} = \frac{1}{300} \cdot \frac{N_{\text{active}}}{\text{len}_{\text{ctx}}}

    Plug in: Nactive100BN_{\text{active}} \approx 100\text{B}, lenctx=200K\text{len}_{\text{ctx}} = 200\text{K}bytes/token1.7 KB\text{bytes/token} \approx 1.7\text{ KB}.

  • MFU during decode is about 15\tfrac{1}{5} that during prefill.

    This is because in prefill, you're processing the whole sequence in parallel, so the weight fetch can be amortized across lots of compute, whereas in decode, you have to load all the weights in just to process one more token, which means you're wasting FLOPs while you're waiting for the weights to show up from memory.

  • Loading KVs from memory is much cheaper than recomputing.

02:04:02

Convergent evolution between neural nets and cryptography

  • They've both had this convergent evolution where cryptographic protocols need every output bit to depend on every input bit in complicated ways, and similarly, NNs need output to make connections between inputs.

  • Cryptographic protocols take something which has a lot of structure and make it seem indistinguishable from random. Whereas NNs take something which may look random and extract structure from it.