Posts/inference engineering

inference engineering

Your GPU is mostly idle during text generation. The entire inference stack exists to fix that.

TL;DR: Training teaches a model to think. Inference runs it thousands of times per second under a GPU budget. The whole stack exists to fight one problem: during token-by-token generation, your GPU is mostly idle waiting for memory. Every optimization below is a different way of fixing that.


1. what inference actually is

You finished training. You have a 16 GB blob of weights sitting on disk. A user types a prompt. You need to:

  • load the weights into GPU memory
  • run the prompt through the model to get a probability distribution over the next token
  • pick one token, append it, repeat until done

That last loop is the painful part. Each output token = another full pass through the model. A 100-token reply = 100 sequential forward passes. Token N+1 depends on token N, so you can't parallelize this.

analogy: The model is a 16 GB lookup table. To answer one question, the GPU reads every entry. To answer the next question, it reads every entry again. Thousands of cores sit idle waiting for the next chunk of data to arrive from memory. Inference engineering = keeping those cores busy.

the iron triangle

Three things you trade off:

  • Latency - how fast one user sees output
  • Throughput - how many users per second per GPU
  • Cost - $/million tokens

At batch=1, an H100 pushes ~200 tok/s from an 8B model. At batch=128: ~25,000 tok/s. Same GPU, same hourly rate. Per-token cost drops 128x. But more batching = each user waits longer.

The serving game: pack as many concurrent requests onto each GPU as possible without blowing latency past your SLO (e.g., "P99 TTFT under 500ms"). Every technique below either lets you pack more requests, or makes each one faster. That is the whole idea.

2. GPU 101 - the hardware vocab

A few terms, because every optimization below is just moving data between different kinds of memory.

VRAM and HBM - same thing, two names

Your CPU has RAM. The GPU has its own separate memory called VRAM. On AI GPUs, VRAM uses a technology called HBM (High Bandwidth Memory). "HBM" and "VRAM" mean the same thing in this post - the GPU's main memory pool.

  • H100 has 80 GB of HBM
  • Separate from CPU RAM. Data copies over PCIe (slow)
  • HBM bandwidth: ~3 TB/sec on H100. This number bottlenecks decode.

SRAM - the GPU's tiny on-chip cache

Inside the GPU chip: tiny pools of much faster memory called SRAM (shared memory / L1 cache). Each compute unit has ~256 KB.

  • ~6x faster than HBM (~19 TB/s vs ~3 TB/s)
  • ~2400x smaller (256 KB vs 80 GB)
  • FlashAttention's whole trick: do as much work as possible in SRAM before touching HBM
GPU memory hierarchy - speed vs size SRAM (on-chip) 256 KB / SM ~19 TB/s tiny FASTEST L2 cache ~50 MB total ~12 TB/s HBM (= VRAM) - the GPU's main memory model weights live here. KV cache lives here. 80 GB ~3 TB/s huge SLOW CPU RAM - over PCIe (~30 GB/s, way too slow for inference)
Data flows up this hierarchy. The bottleneck in inference is the HBM bandwidth, not compute.

memory bandwidth - the number that matters

How fast data moves between memory and compute. On an H100:

  • Compute: ~1000 TFLOPS
  • HBM bandwidth: ~3 TB/sec

That ratio (~300 ops per byte) is everything. If your workload reads 1 byte and does 10 ops with it, you waste 290 op-cycles waiting for the next byte. That's decode. Welcome to memory-bound.

tensor cores

Specialized units that do small matrix multiplications fast. Busy during prefill (compute-bound), idle during decode (memory-bound).

GPU names you'll see

NameGenerationVRAMTypical use
H100 / H200Hopper (NVIDIA)80 / 141 GBCurrent flagship for inference
B200Blackwell192 GBNewer, faster, expensive
A100Ampere40 / 80 GBPrevious gen, still common
RTX 4090 / 5090Consumer24 / 32 GBLocal / hobby inference
HBM = the GPU's main memory. It's big but slow.
SRAM = the GPU's on-chip cache. It's tiny but fast.
Memory bandwidth = how fast data moves between them. When this is the bottleneck, your GPU sits idle.

3. prefill vs decode

Two phases, completely different bottlenecks. This split is the single most important mental model in this post.

PREFILL process whole prompt at once prompt x weights COMPUTE-BOUND tensor cores saturated drives TTFT KV cache built here DECODE one token at a time, in a loop tok x whole model MEMORY-BOUND GPU idle, waiting on HBM drives TPOT
Prefill: matrix-matrix multiply (compute-bound). Decode: vector-matrix multiply (memory-bound).

prefill

Your prompt has 500 tokens. The model processes all 500 at once - one big matrix (prompt embeddings) times another big matrix (weights). Tensor cores light up. Every multiply-add unit does useful work. This is what GPUs are built for.

Compute-bound. Limited by FLOPS. Determines TTFT - time to first token. Long prompt = long prefill = user waits.

decode

Now you're generating. You have one token. You multiply its hidden state (a tiny vector) by the entire 16 GB of model weights to predict the next token.

The math is trivial - a vector-matrix multiply. The pain is memory: the GPU reads all 16 GB from HBM for that one tiny computation. Then does it again for the next token. And again.

why memory-bound? The ~300 ops-per-byte ratio from section 2. Decode does far fewer than 300 ops per byte read. The GPU sits idle waiting for memory. That is the entire problem.

Memory-bandwidth-bound. More compute doesn't help. Only reading less from memory helps. Determines TPOT - time per output token.

the central insight: Prefill and decode have different bottlenecks. What helps prefill (more FLOPS) doesn't help decode. What helps decode (less memory traffic, bigger batches) doesn't help prefill. Production systems treat them as separate problems.

4. the metrics that matter

Four numbers. If you can't recite these about your deployment, you don't understand it.

TTFT - Time To First Token. Request arrival to first character visible. Driven by prefill + queuing. Anything > 1s feels laggy. Target < 500ms P99 for chat.

TPOT - Time Per Output Token. Gap between consecutive tokens during decode. Driven by HBM bandwidth + batch size. < 50ms feels smooth (~20 tok/s). Agent loops can tolerate < 100ms.

Goodput. Requests/second that meet your SLO. Raw throughput is misleading - 1000 req/s means nothing if half miss the latency target.

MFU - Model FLOPS Utilization. Fraction of peak FLOPS you use. Training hits 40-60%. Inference hits 10-30% - normal, because decode is memory-bound.

averages lie. One slow request in ten makes an app feel broken. Track P50, P95, P99. Never the mean. P99 TTFT in a chat app with a million DAU = 10,000 unhappy users.
# measuring with genai-perf (production load test)
$ genai-perf profile \
    --model Qwen/Qwen3-8B-AWQ \
    --endpoint-type chat \
    --url localhost:8000 \
    --concurrency 16 \
    --input-tokens-mean 512 \
    --output-tokens-mean 128
 
# reports P50/P90/P99 for TTFT, TPOT, throughput, goodput
# run before AND after each optimization

5. the KV cache

The central data structure of LLM inference.

why it exists

attention, briefly: Transformers work via attention. For every token, the model computes three vectors - Query (Q), Key (K), and Value (V). Don't worry about the math. What matters:

To generate token N, attention looks at the K and V of every previous token (1 through N-1).

Naively, you'd recompute K and V for every previous token at every step. Generating the 100th token = redoing work for tokens 1-99. 99% wasted compute.

the fix: K and V for a token never change once computed. Compute them once, cache them in HBM. That cache - one entry per token, per layer, per attention head - is the KV cache. Just memoization.

why it dominates memory

The KV cache grows with every token. The formula:

KV size=2×B×L×Hkv×d×N×bytes\text{KV size} = 2 \times B \times L \times H_{kv} \times d \times N \times \text{bytes}

B=batch, L=layers, HkvH_{kv}=KV heads, d=head dim, N=seq len. The 2 is for storing both K and V.

Plug in Llama 70B (L=80, HkvH_{kv}=8, d=128) at FP16, one user, 4K context:

2×1×80×8×128×4096×2=10.7 GB2 \times 1 \times 80 \times 8 \times 128 \times 4096 \times 2 = \textbf{10.7 GB}

That's per user. Bump to batch=32: 342 GB. Llama 70B itself is only 140 GB.

VRAM usage on H100 (80GB) - Llama 70B model weights (140GB needs 2 GPUs) 140 GB (constant) KV cache, batch=1, 4K context 10.7 GB KV cache, batch=32, 4K context 342 GB - bigger than the model itself
The KV cache, not the model weights, is the memory bottleneck of LLM serving.
The plot twist: the KV cache, not the model weights, is the memory bottleneck. Every "fit more requests on this GPU" technique is really just "make the KV cache smaller."

how to fight it

Three levers, in order of frequency:

  • PagedAttention - eliminates wasted reservation (next section)
  • Prefix caching - reuse cache across requests with shared prefixes
  • KV cache quantization - store K, V in INT8/FP8 instead of FP16. Halves memory. < 0.5% quality loss.
# turn on KV quantization in vLLM
llm = LLM(
    model="meta-llama/Llama-3-70B",
    kv_cache_dtype="fp8",    # half the KV memory
    gpu_memory_utilization=0.90,
)

MHA / MQA / GQA / MLA

You don't choose this - the model architect did. But it explains why two similar-sized models can have 10x different KV cache sizes:

VariantWhat it sharesUsed by
MHANothing - each head has its own K, V (original transformer)GPT-2
MQAAll heads share one K, V pair. Aggressive savings.PaLM
GQAHeads share K, V in groups of 4-8. Sweet spot.Llama 2/3, Mistral, Qwen
MLAProjects K, V into a tiny latent space, reconstructs at attention time. ~10x compression.DeepSeek V2/V3

This is why Llama 3 70B and DeepSeek V3 (671B) have similar serving costs. DeepSeek's MLA + MoE keeps the active KV cache tiny despite being 10x larger.


6. FlashAttention

This is why long-context inference works at all.

the problem

Standard attention computes S = Q x K^T. That's an N x N matrix. At N=4K: 64 MB. At N=128K: 32 GB - bigger than most GPUs.

The naive algorithm writes the full matrix to HBM, reads it back, applies softmax, writes again, reads again, multiplies by V. Four HBM round trips for a matrix that exists only to be immediately consumed.

standard attention full N x N matrix in HBM (slow) O(N^2) memory

vs

FlashAttention tiled blocks in SRAM

Left: standard attention materializes N x N in HBM. Right: FlashAttention tiles it in SRAM, block by block.

the fix (Dao et al., 2022)

Never build the full matrix. Instead:

  • Tile Q, K, V into blocks that fit in SRAM
  • For each Q block, iterate over K, V blocks. Compute attention scores in SRAM, never writing the intermediate matrix to HBM
  • Online softmax keeps running statistics so softmax computes correctly as blocks arrive. Exact result, not approximate
net effect: O(N^2) memory drops to O(N). Same exact output. Way fewer memory trips. At long context, this is "works" vs "OOM crash."

Automatically enabled in vLLM, SGLang, and PyTorch's scaled_dot_product_attention. You don't configure it.


7. vLLM's superpowers

what's vLLM? The most popular open-source LLM serving engine. Nginx for language models - point it at a model, get a high-throughput inference server. SGLang and TGI are alternatives. All bundle the four optimizations below.

Together these four give 5-10x throughput over a naive HuggingFace .generate() loop. All configurable, and at scale you'll tune them.

7.1 - PagedAttention (OS virtual memory for the KV cache)

Naive KV cache allocation: reserve max possible context length per request. Model supports 128K context? Reserve gigabytes. Even if the user sends "hi".

If you've taken an OS class, you know this story:

logical (block table) Req 1: B0 -> B3 -> B7 -> B9 Req 2: B0 -> B3 -> B5 -> B8 B0, B3 shared (copy-on-write)

physical VRAM B0 B1 free B3 B5 free

B8 B7 B9 free

scattered, not contiguous - near-zero waste

PagedAttention: a block table maps logical positions to scattered physical blocks. Shared prefixes use copy-on-write.
  • Split VRAM into fixed-size blocks (16 tokens each)
  • Allocate on demand as the sequence grows
  • Block table maps logical positions to scattered physical blocks (just like OS page tables)
  • Copy-on-write: two requests share a system prompt? Same physical blocks. Fork only when they diverge.

This is vLLM's whole reason for existing. On by default. Explains why it packs 2-3x more requests per GPU than a naive setup.

7.2 - continuous batching

Static batching: collect N requests, pad to same length, process as one batch, return together. Short responses sit idle waiting for the longest. Massive waste.

static batching (bad) grey = idle (padding) all wait for slowest

continuous batching (good) new requests slot in when old ones finish 2-5x throughput

Static batching wastes GPU cycles on padding. Continuous batching keeps the batch full.

Continuous batching: at every decode step, check if any request finished. If so, evict it and slot in a new request from the queue. Batch stays full. GPU stays busy.

SWE analogy: Static batching = thread-per-request server waiting for the slowest connection. Continuous batching = event loop. Admit new work as soon as a slot opens.

On by default. You don't configure it; you benefit from it.

7.3 - chunked prefill

Even with continuous batching: a long prompt enters the queue, its 500ms prefill stalls every decode-phase request in the batch. Prefill piracy. Latency spikes for everyone.

Chunked prefill: break long prefills into 512-token chunks, interleave decode steps between them. The long prefill still finishes. But no one else's TPOT spikes.

SWE analogy: Preemptive scheduling. No single process starves others.

7.4 - prefix caching

Most production traffic shares prefixes: same system prompt, same few-shot examples, same RAG context. Without caching, every request recomputes KV for the shared prefix. Thousands of tokens of duplicate work.

Automatic prefix caching: store KV blocks keyed by token content. New requests with matching prefix reuse cached blocks, skip that portion of prefill. TTFT drops to near-zero for the shared part.

# turn on all four in vLLM
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen3-8B-AWQ",
    quantization="awq",
    enable_prefix_caching=True,    # 7.4 - reuse shared prefixes
    enable_chunked_prefill=True,   # 7.3 - no more prefill piracy
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    # PagedAttention (7.1) and continuous batching (7.2) are always on
)

Three flag flips. ~5x over raw .generate(). These are table stakes.


8. quantization

The single biggest lever you have.

why it works

Weights default to FP16/BF16 - 16 bits per number. A 70B model = 70 billion x 2 bytes = 140 GB.

Quantization: store those numbers in fewer bits. INT4 = 4 bits = 35 GB for the same model. 4x less memory, 4x less bandwidth consumed during decode.

Decode is memory-bound. Read 4x less from HBM = up to 4x faster generation. That's it.

won't quality crash? No - up to a point. Weights have redundancy. 8-bit loses < 1%. 4-bit loses ~5%. Below 4-bit, the model breaks.
how is quality measured? Perplexity - how "surprised" the model is by held-out text. Lower = better. 5% increase is invisible to users.

how it works

The simplest version: symmetric (absmax) quantization. You have a tensor of FP16 weights. You want INT8.

  1. Find the largest absolute value in the tensor: α=max(W)\alpha = \max(|W|)
  2. Compute a scale factor: s=α2b11s = \frac{\alpha}{2^{b-1} - 1} where b = target bits (8 for INT8, so you're mapping to [-127, 127])
  3. Quantize: q=round(W/s)q = \text{round}(W / s)
  4. At inference, dequantize: W^=q×s\hat{W} = q \times s

The rounding introduces error, but it's tiny per-weight and averages out across billions of parameters.

Asymmetric (zero-point) quantization handles distributions that aren't centered around zero. It adds a zero-point offset: q=round(W/s+z)q = \text{round}(W / s + z), where z shifts the integer range to cover the actual value distribution. More accurate for skewed weights, slightly more compute at inference.

the outlier problem

Symmetric quantization works well for vision models (ResNet quantizes to INT8 with zero quality loss). Transformers are harder.

The problem: ~0.1% of activation dimensions produce values 10-100x larger than the rest. When you compute the scale factor, those outliers stretch the integer range. Every non-outlier value gets squeezed into a tiny band and loses precision. One extreme value ruins thousands of normal values.

example: your activations range from [-1, 1] except one dimension hits 50. The scale maps [-127, 127] to [-50, 50]. Your [-1, 1] values now occupy just 2-3 integer bins instead of 254. Most of your information is gone.

This is why naive per-tensor INT8 fails for large language models and why the techniques below exist.

PTQ vs QAT

Two paths to a quantized model:

Post-Training Quantization (PTQ): train in FP16, quantize after. Fast, cheap, no retraining. This is what you'll use 99% of the time. The catch: needs careful handling of outliers.

Quantization-Aware Training (QAT): train with fake-quantized weights so the model learns to tolerate rounding error. More accurate at extreme compression (2-3 bit), but requires full training runs. Rarely worth it unless you're pushing below 4-bit.

techniques you'll see on HuggingFace

GPTQ - quantizes weights one column at a time, adjusting remaining weights to compensate for each rounding error. Uses the Hessian (HXTXH \approx X^TX, where X is the layer's input activations) to figure out which weights are most sensitive. Gets 4-bit with minimal perplexity loss.

AWQ (Activation-Aware Weight Quantization) - instead of treating all weights equally, identifies which weights matter most by looking at activation magnitudes. Protects the ~1% of salient weights, quantizes the rest aggressively. Often slightly better than GPTQ at 4-bit.

SmoothQuant - solves the outlier problem for W8A8 (both weights AND activations in INT8). Migrates the quantization difficulty from activations to weights: divides activations by a per-channel scale, multiplies weights by the same scale. Activations become smooth (easy to quantize), weights absorb the variance (but weights are easier to quantize anyway).

which to pick: For 4-bit weight-only: AWQ or GPTQ (AWQ is the current default). For W8A8 (weights + activations): SmoothQuant or FP8. For zero-effort on Hopper GPUs: FP8.

number formats you'll meet

FormatBitsWhen to use
FP3232Training default. Don't use for inference.
BF1616Inference default. Baseline.
FP88Hopper+ GPUs. Near-lossless. Zero effort.
INT8 (W8A8)8Weights AND activations in 8-bit. Faster prefill too.
INT4 (AWQ/GPTQ)4Biggest decode speedup. ~5% quality loss.
INT2/INT3<=3Model breaks. Research only.
# FP8 if you have a Hopper GPU
llm = LLM(
    model="meta-llama/Llama-3-70B",
    quantization="fp8",            # on-the-fly, no separate model
    kv_cache_dtype="fp8",         # bonus: KV in FP8 too
)
first thing to try: Out of everything in this post, quantization gives the biggest single jump. Try it before anything else.

9. when one GPU isn't enough

Llama 70B at FP16 = 140 GB. H100 = 80 GB. Doesn't fit. Even at 4-bit (35 GB), production KV cache pushes past 80 GB. You need multiple GPUs.

tensor parallelism (TP) - the one you'll use

Split each weight matrix across GPUs. Each GPU holds 1/N of every weight. At inference time, each computes its slice, then they exchange partial results via AllReduce (every GPU contributes its partial, ends up with the sum).

Tensor Parallelism, TP=4 full weight matrix W [M x N]

split across 4 GPUs (column-wise) GPU 0 GPU 1 GPU 2 GPU 3

AllReduce (every layer)

needs NVLink between GPUs (inside one server box)

TP splits weight matrices column-wise. An AllReduce synchronizes results on every layer.
  • AllReduce happens on every layer. Lots of communication.
  • Needs NVLink (~900 GB/s, GPU-to-GPU within one box). InfiniBand across nodes is too slow (~50 GB/s).
  • Max TP=8 because that's how many H100s fit in one NVLink domain.
Rule of thumb: use the smallest TP that fits your model + KV cache. TP=2 is faster than TP=4 for the same model because there's half the AllReduce overhead.
# vLLM with tensor parallelism
llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,    # split across 4 GPUs in the same node
    gpu_memory_utilization=0.90,
)

the other two (safe to skip)

Pipeline parallelism (PP) - split by layers. Point-to-point communication, works across nodes on InfiniBand. Catch: pipeline bubbles. Only matters for 400B+ models.

Expert parallelism (EP) - for MoE models. Distributes experts across GPUs via All-to-All. Skip unless deploying MoE.

Production combo: TP=8 within a node + PP=N across nodes. For 70B or smaller: just TP=2 or TP=4.


10. serving the model

vLLM is configured. Now your app talks to it. This is where SWE work lives.

the one-line server

vLLM ships an OpenAI-compatible HTTP server:

$ vllm serve Qwen/Qwen3-8B-AWQ \
    --quantization awq \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --port 8000
 
# now you have an OpenAI-compatible endpoint at localhost:8000

"OpenAI-compatible" = any client library that talks to the OpenAI API talks to your vLLM server unchanged. Switch base_url and you're done. Drop-in replacement.

request lifecycle

how a request flows through vLLM your app openai SDK HTTP POST queue scheduler GPU batch continuous batching SSE stream your app tokens flow in

first token = TTFT. each next token = TPOT later.

streaming

Wait for the full response = user stares at a spinner for 5+ seconds. Stream tokens via SSE (Server-Sent Events) instead - server keeps the connection open, pushes tokens as they generate. Text appears word-by-word like ChatGPT.

The SDK handles this with stream=True:

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",    # vLLM doesn't check, but SDK requires non-empty
)
 
stream = client.chat.completions.create(
    model="Qwen/Qwen3-8B-AWQ",
    messages=[{"role": "user", "content": "explain CAP theorem"}],
    stream=True,
    max_tokens=512,
)
 
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

tuning

Two knobs:

  • --max-num-seqs - max concurrent requests in batch. Higher = more throughput + more KV memory + longer TPOT. Start at 64.
  • --max-num-batched-tokens - total tokens per scheduler step. Caps prefill-decode packing.

scaling out

One vLLM process saturates one GPU (or TP group). More traffic = more replicas behind a load balancer. Use least-pending-requests routing, not round-robin - LLM requests have wildly varying durations.

shipping checklist:
- /health endpoint for the LB
- /metrics (Prometheus format, built-in)
- generous request timeouts (long context = long requests)
- auth via API gateway in front, never on the model server

11. the deployment playbook

Latency bad. Throughput bad. Bill bad. Where to start? This order:

0. Profile first. Bottleneck might be tokenization, queuing, or network - not the model. torch.profiler or NSight Systems before touching anything.

1. Switch to vLLM/SGLang. If you're on raw .generate(), this alone is 5-10x.

2. Quantize. AWQ 4-bit or FP8. Biggest single jump.

3. Prefix caching + chunked prefill. Free wins for shared system prompts.

4. Right-size GPU count. Smallest TP that fits. TP=2 beats TP=4.

5. KV cache quantization. kv_cache_dtype="fp8" - doubles effective batch size.

6. Speculative decoding. 2-3x for predictable outputs. Skip for creative generation.

7. Disaggregate prefill/decode. Different GPU pools per phase. Worth it at 100+ GPUs.

12. watching the real ceiling

The playbook tells you what to turn on. This tells you when you've hit the wall.

Section 5 showed the KV cache eats VRAM. Here's the consequence in production: compute is almost never the limit, the KV cache is. On an 80GB A100 serving a 7B model at 4K context, the FLOPS budget covers ~141 concurrent requests. The KV cache runs out at ~13. That order-of-magnitude gap is the real ceiling. (source)

arithmetic intensity, per phase: prefill runs at ~4500 FLOPs/byte (compute-saturated), decode at ~1 FLOP/byte. On an A100 the roofline crossover sits at ~156 FLOPs/byte, so decode leaves the compute cores nearly idle. Same prefill/decode split as section 3, now with the numbers.

preemption: the cliff

vLLM admits requests until the KV cache fills. Push past 100% and it evicts running requests to make room: their KV cache gets dropped and recomputed when they resume, and latency roughly doubles. The sweet spot sits just below 100% cache utilization. Full enough to keep the batch busy, not so full that a burst tips you over.

two numbers to watch (both in vLLM's /metrics):
- kv_cache_usage_pct - your headroom. Tune --max-num-seqs so this sits high but stable.
- num_preemptions - should hover near zero. A climbing count means you crossed the line and you're paying 2x latency for it.

One more, from the same teardown: across both phases the MLP block is ~71% of the compute. Hunting for a kernel to optimize? Start there, not attention.

further reading - three posts that go a level deeper than this one:
- LLM inference throughput - the roofline and KV-ceiling math above, worked out.
- inside vLLM - the scheduler, paged attention, and prefix caching read at source level.
- fast matmul - why the GPU sits idle one level down: coalesced loads, SMEM tiling, tensor cores, wave quantization.