Jul 23, 2025
Batched Inference, Demystified: Hit Your p95 While 3–5× Throughput
TL;DR: Batching is the easiest lever to improve tokens/sec—if you guard tail latency. Here are working defaults, trade-offs, and the observability to run it safely.
When batching pays off
Best for homogeneous traffic (similar max tokens/prompt sizes), higher QPS, and cost pressure. For spiky or variable traffic, use priority lanes and early flush.
Dynamic vs. static batching
Static = fixed size; simple but wastes capacity or adds latency.
Dynamic = short window (e.g., 5–20 ms), group by shape/limits, then launch. Dynamic wins for public APIs.
The five knobs that matter
Batch window (ms)
Max batch size
Similarity rule (model, prompt len, max tokens buckets)
Early flush conditions (priority lane, wait cap)
Timeout/SLOs (p95 budget)
Ready-to-use presets
Low-latency
Balanced
High-throughput
Tip: bucket max_tokens/prompt_len in powers of two (256/512/1024) to match tile sizes.
Admission control
Use per-workspace circuit breakers; send 429 with Retry-After under pressure; reserve a realtime lane for interactive UIs.
Observability that matters
Queue depth, admitted vs. dropped (with reasons), p50/p95/p99 split by class, tokens/sec, SM/TensorCore occupancy, batch size distribution.
