TL;DR: Training teaches a model to think. Inference runs it thousands of times per second under a GPU budget. The whole stack exists to fight one problem: during token-by-token generation, your GPU is mostly idle waiting for memory. Every optimization below is a different way of fixing that.
1. what inference actually is
You finished training. You have a 16 GB blob of weights sitting on disk. A user types a prompt. You need to:
- load the weights into GPU memory
- run the prompt through the model to get a probability distribution over the next token
- pick one token, append it, repeat until done
That last loop is the painful part. Each output token = another full pass through the model. A 100-token reply = 100 sequential forward passes. Token N+1 depends on token N, so you can't parallelize this.
the iron triangle
Three things you trade off:
- Latency - how fast one user sees output
- Throughput - how many users per second per GPU
- Cost - $/million tokens
At batch=1, an H100 pushes ~200 tok/s from an 8B model. At batch=128: ~25,000 tok/s. Same GPU, same hourly rate. Per-token cost drops 128x. But more batching = each user waits longer.
2. GPU 101 - the hardware vocab
A few terms, because every optimization below is just moving data between different kinds of memory.
VRAM and HBM - same thing, two names
Your CPU has RAM. The GPU has its own separate memory called VRAM. On AI GPUs, VRAM uses a technology called HBM (High Bandwidth Memory). "HBM" and "VRAM" mean the same thing in this post - the GPU's main memory pool.
- H100 has 80 GB of HBM
- Separate from CPU RAM. Data copies over PCIe (slow)
- HBM bandwidth: ~3 TB/sec on H100. This number bottlenecks decode.
SRAM - the GPU's tiny on-chip cache
Inside the GPU chip: tiny pools of much faster memory called SRAM (shared memory / L1 cache). Each compute unit has ~256 KB.
- ~6x faster than HBM (~19 TB/s vs ~3 TB/s)
- ~2400x smaller (256 KB vs 80 GB)
- FlashAttention's whole trick: do as much work as possible in SRAM before touching HBM
memory bandwidth - the number that matters
How fast data moves between memory and compute. On an H100:
- Compute: ~1000 TFLOPS
- HBM bandwidth: ~3 TB/sec
That ratio (~300 ops per byte) is everything. If your workload reads 1 byte and does 10 ops with it, you waste 290 op-cycles waiting for the next byte. That's decode. Welcome to memory-bound.
tensor cores
Specialized units that do small matrix multiplications fast. Busy during prefill (compute-bound), idle during decode (memory-bound).
GPU names you'll see
| Name | Generation | VRAM | Typical use |
|---|---|---|---|
| H100 / H200 | Hopper (NVIDIA) | 80 / 141 GB | Current flagship for inference |
| B200 | Blackwell | 192 GB | Newer, faster, expensive |
| A100 | Ampere | 40 / 80 GB | Previous gen, still common |
| RTX 4090 / 5090 | Consumer | 24 / 32 GB | Local / hobby inference |
SRAM = the GPU's on-chip cache. It's tiny but fast.
Memory bandwidth = how fast data moves between them. When this is the bottleneck, your GPU sits idle.
3. prefill vs decode
Two phases, completely different bottlenecks. This split is the single most important mental model in this post.
prefill
Your prompt has 500 tokens. The model processes all 500 at once - one big matrix (prompt embeddings) times another big matrix (weights). Tensor cores light up. Every multiply-add unit does useful work. This is what GPUs are built for.
Compute-bound. Limited by FLOPS. Determines TTFT - time to first token. Long prompt = long prefill = user waits.
decode
Now you're generating. You have one token. You multiply its hidden state (a tiny vector) by the entire 16 GB of model weights to predict the next token.
The math is trivial - a vector-matrix multiply. The pain is memory: the GPU reads all 16 GB from HBM for that one tiny computation. Then does it again for the next token. And again.
Memory-bandwidth-bound. More compute doesn't help. Only reading less from memory helps. Determines TPOT - time per output token.
4. the metrics that matter
Four numbers. If you can't recite these about your deployment, you don't understand it.
TTFT - Time To First Token. Request arrival to first character visible. Driven by prefill + queuing. Anything > 1s feels laggy. Target < 500ms P99 for chat.
TPOT - Time Per Output Token. Gap between consecutive tokens during decode. Driven by HBM bandwidth + batch size. < 50ms feels smooth (~20 tok/s). Agent loops can tolerate < 100ms.
Goodput. Requests/second that meet your SLO. Raw throughput is misleading - 1000 req/s means nothing if half miss the latency target.
MFU - Model FLOPS Utilization. Fraction of peak FLOPS you use. Training hits 40-60%. Inference hits 10-30% - normal, because decode is memory-bound.
# measuring with genai-perf (production load test)
$ genai-perf profile \
--model Qwen/Qwen3-8B-AWQ \
--endpoint-type chat \
--url localhost:8000 \
--concurrency 16 \
--input-tokens-mean 512 \
--output-tokens-mean 128
# reports P50/P90/P99 for TTFT, TPOT, throughput, goodput
# run before AND after each optimization5. the KV cache
The central data structure of LLM inference.
why it exists
To generate token N, attention looks at the K and V of every previous token (1 through N-1).
Naively, you'd recompute K and V for every previous token at every step. Generating the 100th token = redoing work for tokens 1-99. 99% wasted compute.
why it dominates memory
The KV cache grows with every token. The formula:
B=batch, L=layers, =KV heads, d=head dim, N=seq len. The 2 is for storing both K and V.
Plug in Llama 70B (L=80, =8, d=128) at FP16, one user, 4K context:
That's per user. Bump to batch=32: 342 GB. Llama 70B itself is only 140 GB.
how to fight it
Three levers, in order of frequency:
- PagedAttention - eliminates wasted reservation (next section)
- Prefix caching - reuse cache across requests with shared prefixes
- KV cache quantization - store K, V in INT8/FP8 instead of FP16. Halves memory. < 0.5% quality loss.
# turn on KV quantization in vLLM
llm = LLM(
model="meta-llama/Llama-3-70B",
kv_cache_dtype="fp8", # half the KV memory
gpu_memory_utilization=0.90,
)MHA / MQA / GQA / MLA
You don't choose this - the model architect did. But it explains why two similar-sized models can have 10x different KV cache sizes:
| Variant | What it shares | Used by |
|---|---|---|
| MHA | Nothing - each head has its own K, V (original transformer) | GPT-2 |
| MQA | All heads share one K, V pair. Aggressive savings. | PaLM |
| GQA | Heads share K, V in groups of 4-8. Sweet spot. | Llama 2/3, Mistral, Qwen |
| MLA | Projects K, V into a tiny latent space, reconstructs at attention time. ~10x compression. | DeepSeek V2/V3 |
This is why Llama 3 70B and DeepSeek V3 (671B) have similar serving costs. DeepSeek's MLA + MoE keeps the active KV cache tiny despite being 10x larger.
6. FlashAttention
This is why long-context inference works at all.
the problem
Standard attention computes S = Q x K^T. That's an N x N matrix. At N=4K: 64 MB. At N=128K: 32 GB - bigger than most GPUs.
The naive algorithm writes the full matrix to HBM, reads it back, applies softmax, writes again, reads again, multiplies by V. Four HBM round trips for a matrix that exists only to be immediately consumed.
the fix (Dao et al., 2022)
Never build the full matrix. Instead:
- Tile Q, K, V into blocks that fit in SRAM
- For each Q block, iterate over K, V blocks. Compute attention scores in SRAM, never writing the intermediate matrix to HBM
- Online softmax keeps running statistics so softmax computes correctly as blocks arrive. Exact result, not approximate
Automatically enabled in vLLM, SGLang, and PyTorch's scaled_dot_product_attention. You don't configure it.
7. vLLM's superpowers
Together these four give 5-10x throughput over a naive HuggingFace .generate() loop. All configurable, and at scale you'll tune them.
7.1 - PagedAttention (OS virtual memory for the KV cache)
Naive KV cache allocation: reserve max possible context length per request. Model supports 128K context? Reserve gigabytes. Even if the user sends "hi".
If you've taken an OS class, you know this story:
- Split VRAM into fixed-size blocks (16 tokens each)
- Allocate on demand as the sequence grows
- Block table maps logical positions to scattered physical blocks (just like OS page tables)
- Copy-on-write: two requests share a system prompt? Same physical blocks. Fork only when they diverge.
This is vLLM's whole reason for existing. On by default. Explains why it packs 2-3x more requests per GPU than a naive setup.
7.2 - continuous batching
Static batching: collect N requests, pad to same length, process as one batch, return together. Short responses sit idle waiting for the longest. Massive waste.
Continuous batching: at every decode step, check if any request finished. If so, evict it and slot in a new request from the queue. Batch stays full. GPU stays busy.
On by default. You don't configure it; you benefit from it.
7.3 - chunked prefill
Even with continuous batching: a long prompt enters the queue, its 500ms prefill stalls every decode-phase request in the batch. Prefill piracy. Latency spikes for everyone.
Chunked prefill: break long prefills into 512-token chunks, interleave decode steps between them. The long prefill still finishes. But no one else's TPOT spikes.
7.4 - prefix caching
Most production traffic shares prefixes: same system prompt, same few-shot examples, same RAG context. Without caching, every request recomputes KV for the shared prefix. Thousands of tokens of duplicate work.
Automatic prefix caching: store KV blocks keyed by token content. New requests with matching prefix reuse cached blocks, skip that portion of prefill. TTFT drops to near-zero for the shared part.
# turn on all four in vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen3-8B-AWQ",
quantization="awq",
enable_prefix_caching=True, # 7.4 - reuse shared prefixes
enable_chunked_prefill=True, # 7.3 - no more prefill piracy
max_model_len=4096,
gpu_memory_utilization=0.90,
# PagedAttention (7.1) and continuous batching (7.2) are always on
)Three flag flips. ~5x over raw .generate(). These are table stakes.
8. quantization
The single biggest lever you have.
why it works
Weights default to FP16/BF16 - 16 bits per number. A 70B model = 70 billion x 2 bytes = 140 GB.
Quantization: store those numbers in fewer bits. INT4 = 4 bits = 35 GB for the same model. 4x less memory, 4x less bandwidth consumed during decode.
Decode is memory-bound. Read 4x less from HBM = up to 4x faster generation. That's it.
how it works
The simplest version: symmetric (absmax) quantization. You have a tensor of FP16 weights. You want INT8.
- Find the largest absolute value in the tensor:
- Compute a scale factor: where b = target bits (8 for INT8, so you're mapping to [-127, 127])
- Quantize:
- At inference, dequantize:
The rounding introduces error, but it's tiny per-weight and averages out across billions of parameters.
Asymmetric (zero-point) quantization handles distributions that aren't centered around zero. It adds a zero-point offset: , where z shifts the integer range to cover the actual value distribution. More accurate for skewed weights, slightly more compute at inference.
the outlier problem
Symmetric quantization works well for vision models (ResNet quantizes to INT8 with zero quality loss). Transformers are harder.
The problem: ~0.1% of activation dimensions produce values 10-100x larger than the rest. When you compute the scale factor, those outliers stretch the integer range. Every non-outlier value gets squeezed into a tiny band and loses precision. One extreme value ruins thousands of normal values.
This is why naive per-tensor INT8 fails for large language models and why the techniques below exist.
PTQ vs QAT
Two paths to a quantized model:
Post-Training Quantization (PTQ): train in FP16, quantize after. Fast, cheap, no retraining. This is what you'll use 99% of the time. The catch: needs careful handling of outliers.
Quantization-Aware Training (QAT): train with fake-quantized weights so the model learns to tolerate rounding error. More accurate at extreme compression (2-3 bit), but requires full training runs. Rarely worth it unless you're pushing below 4-bit.
techniques you'll see on HuggingFace
GPTQ - quantizes weights one column at a time, adjusting remaining weights to compensate for each rounding error. Uses the Hessian (, where X is the layer's input activations) to figure out which weights are most sensitive. Gets 4-bit with minimal perplexity loss.
AWQ (Activation-Aware Weight Quantization) - instead of treating all weights equally, identifies which weights matter most by looking at activation magnitudes. Protects the ~1% of salient weights, quantizes the rest aggressively. Often slightly better than GPTQ at 4-bit.
SmoothQuant - solves the outlier problem for W8A8 (both weights AND activations in INT8). Migrates the quantization difficulty from activations to weights: divides activations by a per-channel scale, multiplies weights by the same scale. Activations become smooth (easy to quantize), weights absorb the variance (but weights are easier to quantize anyway).
number formats you'll meet
| Format | Bits | When to use |
|---|---|---|
| FP32 | 32 | Training default. Don't use for inference. |
| BF16 | 16 | Inference default. Baseline. |
| FP8 | 8 | Hopper+ GPUs. Near-lossless. Zero effort. |
| INT8 (W8A8) | 8 | Weights AND activations in 8-bit. Faster prefill too. |
| INT4 (AWQ/GPTQ) | 4 | Biggest decode speedup. ~5% quality loss. |
| INT2/INT3 | <=3 | Model breaks. Research only. |
# FP8 if you have a Hopper GPU
llm = LLM(
model="meta-llama/Llama-3-70B",
quantization="fp8", # on-the-fly, no separate model
kv_cache_dtype="fp8", # bonus: KV in FP8 too
)9. when one GPU isn't enough
Llama 70B at FP16 = 140 GB. H100 = 80 GB. Doesn't fit. Even at 4-bit (35 GB), production KV cache pushes past 80 GB. You need multiple GPUs.
tensor parallelism (TP) - the one you'll use
Split each weight matrix across GPUs. Each GPU holds 1/N of every weight. At inference time, each computes its slice, then they exchange partial results via AllReduce (every GPU contributes its partial, ends up with the sum).
- AllReduce happens on every layer. Lots of communication.
- Needs NVLink (~900 GB/s, GPU-to-GPU within one box). InfiniBand across nodes is too slow (~50 GB/s).
- Max
TP=8because that's how many H100s fit in one NVLink domain.
TP=2 is faster than TP=4 for the same model because there's half the AllReduce overhead.
# vLLM with tensor parallelism
llm = LLM(
model="meta-llama/Llama-3-70B",
tensor_parallel_size=4, # split across 4 GPUs in the same node
gpu_memory_utilization=0.90,
)the other two (safe to skip)
Pipeline parallelism (PP) - split by layers. Point-to-point communication, works across nodes on InfiniBand. Catch: pipeline bubbles. Only matters for 400B+ models.
Expert parallelism (EP) - for MoE models. Distributes experts across GPUs via All-to-All. Skip unless deploying MoE.
Production combo: TP=8 within a node + PP=N across nodes. For 70B or smaller: just TP=2 or TP=4.
10. serving the model
vLLM is configured. Now your app talks to it. This is where SWE work lives.
the one-line server
vLLM ships an OpenAI-compatible HTTP server:
$ vllm serve Qwen/Qwen3-8B-AWQ \
--quantization awq \
--enable-prefix-caching \
--enable-chunked-prefill \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--port 8000
# now you have an OpenAI-compatible endpoint at localhost:8000"OpenAI-compatible" = any client library that talks to the OpenAI API talks to your vLLM server unchanged. Switch base_url and you're done. Drop-in replacement.
request lifecycle
streaming
Wait for the full response = user stares at a spinner for 5+ seconds. Stream tokens via SSE (Server-Sent Events) instead - server keeps the connection open, pushes tokens as they generate. Text appears word-by-word like ChatGPT.
The SDK handles this with stream=True:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy", # vLLM doesn't check, but SDK requires non-empty
)
stream = client.chat.completions.create(
model="Qwen/Qwen3-8B-AWQ",
messages=[{"role": "user", "content": "explain CAP theorem"}],
stream=True,
max_tokens=512,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)tuning
Two knobs:
--max-num-seqs- max concurrent requests in batch. Higher = more throughput + more KV memory + longer TPOT. Start at 64.--max-num-batched-tokens- total tokens per scheduler step. Caps prefill-decode packing.
scaling out
One vLLM process saturates one GPU (or TP group). More traffic = more replicas behind a load balancer. Use least-pending-requests routing, not round-robin - LLM requests have wildly varying durations.
-
/health endpoint for the LB-
/metrics (Prometheus format, built-in)- generous request timeouts (long context = long requests)
- auth via API gateway in front, never on the model server
11. the deployment playbook
Latency bad. Throughput bad. Bill bad. Where to start? This order:
torch.profiler or NSight Systems before touching anything.1. Switch to vLLM/SGLang. If you're on raw
.generate(), this alone is 5-10x.2. Quantize. AWQ 4-bit or FP8. Biggest single jump.
3. Prefix caching + chunked prefill. Free wins for shared system prompts.
4. Right-size GPU count. Smallest TP that fits.
TP=2 beats TP=4.5. KV cache quantization.
kv_cache_dtype="fp8" - doubles effective batch size.6. Speculative decoding. 2-3x for predictable outputs. Skip for creative generation.
7. Disaggregate prefill/decode. Different GPU pools per phase. Worth it at 100+ GPUs.
12. watching the real ceiling
The playbook tells you what to turn on. This tells you when you've hit the wall.
Section 5 showed the KV cache eats VRAM. Here's the consequence in production: compute is almost never the limit, the KV cache is. On an 80GB A100 serving a 7B model at 4K context, the FLOPS budget covers ~141 concurrent requests. The KV cache runs out at ~13. That order-of-magnitude gap is the real ceiling. (source)
preemption: the cliff
vLLM admits requests until the KV cache fills. Push past 100% and it evicts running requests to make room: their KV cache gets dropped and recomputed when they resume, and latency roughly doubles. The sweet spot sits just below 100% cache utilization. Full enough to keep the batch busy, not so full that a burst tips you over.
/metrics):-
kv_cache_usage_pct - your headroom. Tune --max-num-seqs so this sits high but stable.-
num_preemptions - should hover near zero. A climbing count means you crossed the line and you're paying 2x latency for it.
One more, from the same teardown: across both phases the MLP block is ~71% of the compute. Hunting for a kernel to optimize? Start there, not attention.
- LLM inference throughput - the roofline and KV-ceiling math above, worked out.
- inside vLLM - the scheduler, paged attention, and prefix caching read at source level.
- fast matmul - why the GPU sits idle one level down: coalesced loads, SMEM tiling, tensor cores, wave quantization.