Inference engineering

1. what inference actually is

You finished training. You have a 16 GB blob of weights sitting on disk. A user types a prompt. You need to:

load the weights into GPU memory
run the prompt through the model to get a probability distribution over the next token
pick one token, append it, repeat until done

That last loop is the painful part. Each output token = another full pass through the model. A 100-token reply = 100 sequential forward passes. Token N+1 depends on token N, so you can't parallelize this.

analogy: The model is a 16 GB lookup table. To answer one question, the GPU reads every entry. To answer the next question, it reads every entry again. Thousands of cores sit idle waiting for the next chunk of data to arrive from memory. Inference engineering = keeping those cores busy.

the iron triangle

Three things you trade off:

Latency - how fast one user sees output
Throughput - how many users per second per GPU
Cost - $/million tokens

At batch=1, an H100 pushes ~200 tok/s from an 8B model. At batch=128: ~25,000 tok/s. Same GPU, same hourly rate. Per-token cost drops 128x. But more batching = each user waits longer.

The serving game: pack as many concurrent requests onto each GPU as possible without blowing latency past your SLO (e.g., "P99 TTFT under 500ms"). Every technique below either lets you pack more requests, or makes each one faster. That is the whole idea.

2. GPU 101 - the hardware vocab

A few terms, because every optimization below is just moving data between different kinds of memory.

VRAM and HBM - same thing, two names

Your CPU has RAM. The GPU has its own separate memory called VRAM. On AI GPUs, VRAM uses a technology called HBM (High Bandwidth Memory). "HBM" and "VRAM" mean the same thing in this post - the GPU's main memory pool.

H100 has 80 GB of HBM
Separate from CPU RAM. Data copies over PCIe (slow)
HBM bandwidth: ~3 TB/sec on H100. This number bottlenecks decode.

SRAM - the GPU's tiny on-chip cache

Inside the GPU chip: tiny pools of much faster memory called SRAM (shared memory / L1 cache). Each compute unit has ~256 KB.

~6x faster than HBM (~19 TB/s vs ~3 TB/s)
~2400x smaller (256 KB vs 80 GB)
FlashAttention's whole trick: do as much work as possible in SRAM before touching HBM

Data flows up this hierarchy. The bottleneck in inference is the HBM bandwidth, not compute.

memory bandwidth - the number that matters

How fast data moves between memory and compute. On an H100:

Compute: ~1000 TFLOPS
HBM bandwidth: ~3 TB/sec

That ratio (~300 ops per byte) is everything. If your workload reads 1 byte and does 10 ops with it, you waste 290 op-cycles waiting for the next byte. That's decode. Welcome to memory-bound.

tensor cores

Specialized units that do small matrix multiplications fast. Busy during prefill (compute-bound), idle during decode (memory-bound).

GPU names you'll see

Name	Generation	VRAM	Typical use
H100 / H200	Hopper (NVIDIA)	80 / 141 GB	Current flagship for inference
B200	Blackwell	192 GB	Newer, faster, expensive
A100	Ampere	40 / 80 GB	Previous gen, still common
RTX 4090 / 5090	Consumer	24 / 32 GB	Local / hobby inference

HBM = the GPU's main memory. It's big but slow.
SRAM = the GPU's on-chip cache. It's tiny but fast.
Memory bandwidth = how fast data moves between them. When this is the bottleneck, your GPU sits idle.

3. prefill vs decode

Two phases, completely different bottlenecks.

Prefill: matrix-matrix multiply (compute-bound). Decode: vector-matrix multiply (memory-bound).

prefill

Your prompt has 500 tokens. The model processes all 500 at once - one big matrix (prompt embeddings) times another big matrix (weights). Tensor cores light up. Every multiply-add unit does useful work. This is what GPUs are built for.

Compute-bound. Limited by FLOPS. Determines TTFT - time to first token. Long prompt = long prefill = user waits.

decode

Now you're generating. You have one token. You multiply its hidden state (a tiny vector) by the entire 16 GB of model weights to predict the next token.

The math is trivial - a vector-matrix multiply. The pain is memory: the GPU reads all 16 GB from HBM for that one tiny computation. Then does it again for the next token. And again.

why memory-bound? The ~300 ops-per-byte ratio from section 2. Decode does far fewer than 300 ops per byte read. The GPU sits idle waiting for memory. That is the entire problem.

Memory-bandwidth-bound. More compute doesn't help. Only reading less from memory helps. Determines TPOT - time per output token.

the central insight: Prefill and decode have different bottlenecks. What helps prefill (more FLOPS) doesn't help decode. What helps decode (less memory traffic, bigger batches) doesn't help prefill. Production systems treat them as separate problems.

4. the metrics that matter

Four numbers. If you can't recite these about your deployment, you don't understand it.

TTFT - Time To First Token. Request arrival to first character visible. Driven by prefill + queuing. Anything > 1s feels laggy. Target < 500ms P99 for chat.

TPOT - Time Per Output Token. Gap between consecutive tokens during decode. Driven by HBM bandwidth + batch size. < 50ms feels smooth (~20 tok/s). Agent loops can tolerate < 100ms.

Goodput. Requests/second that meet your SLO. Raw throughput is misleading - 1000 req/s means nothing if half miss the latency target.

MFU - Model FLOPS Utilization. Fraction of peak FLOPS you use. Training hits 40-60%. Inference hits 10-30% - normal, because decode is memory-bound.

averages lie. One slow request in ten makes an app feel broken. Track P50, P95, P99. Never the mean. P99 TTFT in a chat app with a million DAU = 10,000 unhappy users.

# measuring with genai-perf (production load test)
$ genai-perf profile \
    --model Qwen/Qwen3-8B-AWQ \
    --endpoint-type chat \
    --url localhost:8000 \
    --concurrency 16 \
    --input-tokens-mean 512 \
    --output-tokens-mean 128
 
# reports P50/P90/P99 for TTFT, TPOT, throughput, goodput
# run before AND after each optimization

5. the KV cache

The central data structure of LLM inference.

why it exists

attention, briefly: Transformers work via attention. For every token, the model computes three vectors - Query (Q), Key (K), and Value (V). Don't worry about the math. What matters:

To generate token N, attention looks at the K and V of every previous token (1 through N-1).

Naively, you'd recompute K and V for every previous token at every step. Generating the 100th token = redoing work for tokens 1-99. 99% wasted compute.

the fix: K and V for a token never change once computed. Compute them once, cache them in HBM. That cache - one entry per token, per layer, per attention head - is the KV cache. Just memoization.

why it dominates memory

The KV cache grows with every token. The formula:

$\text{KV size} = 2 \times B \times L \times H_{kv} \times d \times N \times \text{bytes}$

B=batch, L=layers, $H_{kv}$ =KV heads, d=head dim, N=seq len. The 2 is for storing both K and V.

Plug in Llama 70B (L=80, $H_{kv}$ =8, d=128) at FP16, one user, 4K context:

$2 \times 1 \times 80 \times 8 \times 128 \times 4096 \times 2 = \textbf{10.7 GB}$

That's per user. Bump to batch=32: 342 GB. Llama 70B itself is only 140 GB.

The KV cache, not the model weights, is the memory bottleneck of LLM serving.

The KV cache, not the model weights, is the memory bottleneck. Every "fit more requests on this GPU" technique is really just "make the KV cache smaller."

how to fight it

Three levers, in order of frequency:

PagedAttention - eliminates wasted reservation (next section)
Prefix caching - reuse cache across requests with shared prefixes
KV cache quantization - store K, V in INT8/FP8 instead of FP16. Halves memory. < 0.5% quality loss.

# turn on KV quantization in vLLM
llm = LLM(
    model="meta-llama/Llama-3-70B",
    kv_cache_dtype="fp8",    # half the KV memory
    gpu_memory_utilization=0.90,
)

MHA / MQA / GQA / MLA

You don't choose this - the model architect did. But it explains why two similar-sized models can have 10x different KV cache sizes:

Variant	What it shares	Used by
MHA	Nothing - each head has its own K, V (original transformer)	GPT-2
MQA	All heads share one K, V pair. Aggressive savings.	PaLM
GQA	Heads share K, V in groups of 4-8. Sweet spot.	Llama 2/3, Mistral, Qwen
MLA	Projects K, V into a tiny latent space, reconstructs at attention time. ~10x compression.	DeepSeek V2/V3

This is why Llama 3 70B and DeepSeek V3 (671B) have similar serving costs. DeepSeek's MLA + MoE keeps the active KV cache tiny despite being 10x larger.

6. FlashAttention

This is why long-context inference works at all.

the problem

Standard attention computes S = Q x K^T. That's an N x N matrix. At N=4K: 64 MB. At N=128K: 32 GB - bigger than most GPUs.

The naive algorithm writes the full matrix to HBM, reads it back, applies softmax, writes again, reads again, multiplies by V. Four HBM round trips for a matrix that exists only to be immediately consumed.

Left: standard attention materializes N x N in HBM. Right: FlashAttention tiles it in SRAM, block by block.

the fix (Dao et al., 2022)

Never build the full matrix. Instead:

Tile Q, K, V into blocks that fit in SRAM
For each Q block, iterate over K, V blocks. Compute attention scores in SRAM, never writing the intermediate matrix to HBM
Online softmax keeps running statistics so softmax computes correctly as blocks arrive. Exact result, not approximate

net effect: O(N^2) memory drops to O(N). Same exact output. Way fewer memory trips. At long context, this is "works" vs "OOM crash."

Automatically enabled in vLLM, SGLang, and PyTorch's scaled_dot_product_attention. You don't configure it.

7. vLLM's superpowers

what's vLLM? The most popular open-source LLM serving engine. Nginx for language models - point it at a model, get a high-throughput inference server. SGLang and TGI are alternatives. All bundle the four optimizations below.

Together these four give 5-10x throughput over a naive HuggingFace .generate() loop. All configurable, and at scale you'll tune them.

7.1 - PagedAttention (OS virtual memory for the KV cache)

Naive KV cache allocation: reserve max possible context length per request. Model supports 128K context? Reserve gigabytes. Even if the user sends "hi".

If you've taken an OS class, you know this story:

PagedAttention: a block table maps logical positions to scattered physical blocks. Shared prefixes use copy-on-write.

Split VRAM into fixed-size blocks (16 tokens each)
Allocate on demand as the sequence grows
Block table maps logical positions to scattered physical blocks (just like OS page tables)
Copy-on-write: two requests share a system prompt? Same physical blocks. Fork only when they diverge.

This is vLLM's whole reason for existing. On by default. Explains why it packs 2-3x more requests per GPU than a naive setup.

7.2 - continuous batching

Static batching: collect N requests, pad to same length, process as one batch, return together. Short responses sit idle waiting for the longest. Massive waste.

Static batching wastes GPU cycles on padding. Continuous batching keeps the batch full.

Continuous batching: at every decode step, check if any request finished. If so, evict it and slot in a new request from the queue. Batch stays full. GPU stays busy.

SWE analogy: Static batching = thread-per-request server waiting for the slowest connection. Continuous batching = event loop. Admit new work as soon as a slot opens.

Watch it run. Eleven requests arrive over time; the scheduler admits up to 6 at once into the KV cache. Prefill is one step, decode is many.

On by default. You don't configure it; you benefit from it.

7.3 - chunked prefill

Even with continuous batching: a long prompt enters the queue, its 500ms prefill stalls every decode-phase request in the batch. Prefill piracy. Latency spikes for everyone.

Chunked prefill: break long prefills into 512-token chunks, interleave decode steps between them. The long prefill still finishes. But no one else's TPOT spikes.

SWE analogy: Preemptive scheduling. No single process starves others.

7.4 - prefix caching

Most production traffic shares prefixes: same system prompt, same few-shot examples, same RAG context. Without caching, every request recomputes KV for the shared prefix. Thousands of tokens of duplicate work.

Automatic prefix caching: store KV blocks keyed by token content. New requests with matching prefix reuse cached blocks, skip that portion of prefill. TTFT drops to near-zero for the shared part.

# turn on all four in vLLM
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen3-8B-AWQ",
    quantization="awq",
    enable_prefix_caching=True,    # 7.4 - reuse shared prefixes
    enable_chunked_prefill=True,   # 7.3 - no more prefill piracy
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    # PagedAttention (7.1) and continuous batching (7.2) are always on
)

Three flag flips. ~5x over raw .generate(). These are table stakes.

8. quantization

The single biggest lever you have.

why it works

Weights default to FP16/BF16 - 16 bits per number. A 70B model = 70 billion x 2 bytes = 140 GB.

Quantization: store those numbers in fewer bits. INT4 = 4 bits = 35 GB for the same model. 4x less memory, 4x less bandwidth consumed during decode.

Decode is memory-bound. Read 4x less from HBM = up to 4x faster generation. That's it.

won't quality crash? No - up to a point. Weights have redundancy. 8-bit loses < 1%. 4-bit loses ~5%. Below 4-bit, the model breaks.

how is quality measured? Perplexity - how "surprised" the model is by held-out text. Lower = better. 5% increase is invisible to users.

how it works

The simplest version: symmetric (absmax) quantization. You have a tensor of FP16 weights. You want INT8.

Find the largest absolute value in the tensor: $\alpha = \max(|W|)$
Compute a scale factor: $s = \frac{\alpha}{2^{b-1} - 1}$ where b = target bits (8 for INT8, so you're mapping to [-127, 127])
Quantize: $q = \text{round}(W / s)$
At inference, dequantize: $\hat{W} = q \times s$

The rounding introduces error, but it's tiny per-weight and averages out across billions of parameters.

Asymmetric (zero-point) quantization handles distributions that aren't centered around zero. It adds a zero-point offset: $q = \text{round}(W / s + z)$ , where z shifts the integer range to cover the actual value distribution. More accurate for skewed weights, slightly more compute at inference.

the outlier problem

Symmetric quantization works well for vision models (ResNet quantizes to INT8 with zero quality loss). Transformers are harder.

The problem: ~0.1% of activation dimensions produce values 10-100x larger than the rest. When you compute the scale factor, those outliers stretch the integer range. Every non-outlier value gets squeezed into a tiny band and loses precision. One extreme value ruins thousands of normal values.

example: your activations range from [-1, 1] except one dimension hits 50. The scale maps [-127, 127] to [-50, 50]. Your [-1, 1] values now occupy just 2-3 integer bins instead of 254. Most of your information is gone.

This is why naive per-tensor INT8 fails for large language models and why the techniques below exist.

PTQ vs QAT

Two paths to a quantized model:

Post-Training Quantization (PTQ): train in FP16, quantize after. Fast, cheap, no retraining. This is what you'll use 99% of the time. The catch: needs careful handling of outliers.

Quantization-Aware Training (QAT): train with fake-quantized weights so the model learns to tolerate rounding error. More accurate at extreme compression (2-3 bit), but requires full training runs. Rarely worth it unless you're pushing below 4-bit.

techniques you'll see on HuggingFace

GPTQ - quantizes weights one column at a time, adjusting remaining weights to compensate for each rounding error. Uses the Hessian ( $H \approx X^TX$ , where X is the layer's input activations) to figure out which weights are most sensitive. Gets 4-bit with minimal perplexity loss.

AWQ (Activation-Aware Weight Quantization) - instead of treating all weights equally, identifies which weights matter most by looking at activation magnitudes. Protects the ~1% of salient weights, quantizes the rest aggressively. Often slightly better than GPTQ at 4-bit.

SmoothQuant - solves the outlier problem for W8A8 (both weights AND activations in INT8). Migrates the quantization difficulty from activations to weights: divides activations by a per-channel scale, multiplies weights by the same scale. Activations become smooth (easy to quantize), weights absorb the variance (but weights are easier to quantize anyway).

which to pick: For 4-bit weight-only: AWQ or GPTQ (AWQ is the current default). For W8A8 (weights + activations): SmoothQuant or FP8. For zero-effort on Hopper GPUs: FP8.

number formats you'll meet

Format	Bits	When to use
FP32	32	Training default. Don't use for inference.
BF16	16	Inference default. Baseline.
FP8	8	Hopper+ GPUs. Near-lossless. Zero effort.
INT8 (W8A8)	8	Weights AND activations in 8-bit. Faster prefill too.
INT4 (AWQ/GPTQ)	4	Biggest decode speedup. ~5% quality loss.
INT2/INT3	<=3	Model breaks. Research only.

# FP8 if you have a Hopper GPU
llm = LLM(
    model="meta-llama/Llama-3-70B",
    quantization="fp8",            # on-the-fly, no separate model
    kv_cache_dtype="fp8",         # bonus: KV in FP8 too
)

first thing to try: Out of everything in this post, quantization gives the biggest single jump. Try it before anything else.

9. when one GPU isn't enough

Llama 70B at FP16 = 140 GB. H100 = 80 GB. Doesn't fit. Even at 4-bit (35 GB), production KV cache pushes past 80 GB. You need multiple GPUs.

tensor parallelism (TP) - the one you'll use

Split each weight matrix across GPUs. Each GPU holds 1/N of every weight. At inference time, each computes its slice, then they exchange partial results via AllReduce (every GPU contributes its partial, ends up with the sum).

TP splits weight matrices column-wise. An AllReduce synchronizes results on every layer.

AllReduce happens on every layer. Lots of communication.
Needs NVLink (~900 GB/s, GPU-to-GPU within one box). InfiniBand across nodes is too slow (~50 GB/s).
Max TP=8 because that's how many H100s fit in one NVLink domain.

Rule of thumb: use the smallest TP that fits your model + KV cache. TP=2 is faster than TP=4 for the same model because there's half the AllReduce overhead.

# vLLM with tensor parallelism
llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,    # split across 4 GPUs in the same node
    gpu_memory_utilization=0.90,
)

the other two (safe to skip)

Pipeline parallelism (PP) - split by layers. Point-to-point communication, works across nodes on InfiniBand. Catch: pipeline bubbles. Only matters for 400B+ models.

Expert parallelism (EP) - for MoE models. Distributes experts across GPUs via All-to-All. Skip unless deploying MoE.

Production combo: TP=8 within a node + PP=N across nodes. For 70B or smaller: just TP=2 or TP=4.

10. serving the model

vLLM is configured. Now your app talks to it. This is where SWE work lives.

the one-line server

vLLM ships an OpenAI-compatible HTTP server:

$ vllm serve Qwen/Qwen3-8B-AWQ \
    --quantization awq \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --port 8000
 
# now you have an OpenAI-compatible endpoint at localhost:8000

"OpenAI-compatible" = any client library that talks to the OpenAI API talks to your vLLM server unchanged. Switch base_url and you're done. Drop-in replacement.

request lifecycle

streaming

Wait for the full response = user stares at a spinner for 5+ seconds. Stream tokens via SSE (Server-Sent Events) instead - server keeps the connection open, pushes tokens as they generate. Text appears word-by-word like ChatGPT.

The SDK handles this with stream=True:

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",    # vLLM doesn't check, but SDK requires non-empty
)
 
stream = client.chat.completions.create(
    model="Qwen/Qwen3-8B-AWQ",
    messages=[{"role": "user", "content": "explain CAP theorem"}],
    stream=True,
    max_tokens=512,
)
 
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

tuning

Two knobs:

--max-num-seqs - max concurrent requests in batch. Higher = more throughput + more KV memory + longer TPOT. Start at 64.
--max-num-batched-tokens - total tokens per scheduler step. Caps prefill-decode packing.

scaling out

One vLLM process saturates one GPU (or TP group). More traffic = more replicas behind a load balancer. Use least-pending-requests routing, not round-robin - LLM requests have wildly varying durations.

shipping checklist:
- /health endpoint for the LB
- /metrics (Prometheus format, built-in)
- generous request timeouts (long context = long requests)
- auth via API gateway in front, never on the model server

11. the deployment playbook

Latency bad. Throughput bad. Bill bad. Where to start? This order:

0. Profile first. Bottleneck might be tokenization, queuing, or network - not the model. torch.profiler or NSight Systems before touching anything.

1. Switch to vLLM/SGLang. If you're on raw .generate(), this alone is 5-10x.

2. Quantize. AWQ 4-bit or FP8. Biggest single jump.

3. Prefix caching + chunked prefill. Free wins for shared system prompts.

4. Right-size GPU count. Smallest TP that fits. TP=2 beats TP=4.

5. KV cache quantization. kv_cache_dtype="fp8" - doubles effective batch size.

6. Speculative decoding. 2-3x for predictable outputs. Skip for creative generation.

7. Disaggregate prefill/decode. Different GPU pools per phase. Worth it at 100+ GPUs.

12. watching the real ceiling

The playbook tells you what to turn on. This tells you when you've hit the wall.

Compute is almost never the limit. The KV cache is. 80GB A100, 7B model, 4K context: FLOPS cover ~141 concurrent requests, KV cache runs out at ~13. Order of magnitude. (source)

arithmetic intensity: prefill ~4500 FLOPs/byte (compute-bound), decode ~1 FLOP/byte. A100 roofline crossover is ~156. Decode leaves the cores idle.

preemption: the cliff

vLLM admits requests until the KV cache fills. Past 100%, it evicts running requests: their KV gets dropped, recomputed on resume, latency roughly doubles. Sweet spot sits just under 100% - full enough to keep the batch busy, not so full a burst tips you over.

two numbers in vLLM's /metrics:
- kv_cache_usage_pct - headroom. Tune --max-num-seqs so this sits high but stable.
- num_preemptions - should be near zero. Climbing means you crossed the line.

MLP is ~71% of the compute across both phases. Optimizing a kernel? Start there, not attention.

further reading:
- LLM inference throughput - the math above, worked out.
- inside vLLM - scheduler, paged attention, prefix caching at source level.
- fast matmul - coalesced loads, SMEM tiling, tensor cores, wave quantization.