Jul 23, 2025

Batched Inference

Batched Inference, Demystified: Hit Your p95 While 3–5× Throughput

TL;DR: Batching is the easiest lever to improve tokens/sec—if you guard tail latency. Here are working defaults, trade-offs, and the observability to run it safely.

When batching pays off

Best for homogeneous traffic (similar max tokens/prompt sizes), higher QPS, and cost pressure. For spiky or variable traffic, use priority lanes and early flush.

Dynamic vs. static batching

Static = fixed size; simple but wastes capacity or adds latency.
Dynamic = short window (e.g., 5–20 ms), group by shape/limits, then launch. Dynamic wins for public APIs.

The five knobs that matter

Batch window (ms)
Max batch size
Similarity rule (model, prompt len, max tokens buckets)
Early flush conditions (priority lane, wait cap)
Timeout/SLOs (p95 budget)

Ready-to-use presets

Low-latency

batch_window_ms: 6
max_batch_size: 8
group_by: ["model_id","max_tokens_bucket"]
early_flush: [{if_priority: "realtime"}, {if_queue_depth_gt: 16}]
p95_budget_ms: 250

Balanced

batch_window_ms: 12
max_batch_size: 16
group_by: ["model_id","max_tokens_bucket","prompt_len_bucket"]
early_flush: [{if_priority: "realtime"}, {if_wait_ms_gt: 10}]
p95_budget_ms: 350

High-throughput

batch_window_ms: 20
max_batch_size: 32
group_by: ["model_id","prompt_len_bucket"]
early_flush: [{if_wait_ms_gt: 18}]
p95_budget_ms: 450

Tip: bucket max_tokens/prompt_len in powers of two (256/512/1024) to match tile sizes.

Admission control

Use per-workspace circuit breakers; send 429 with Retry-After under pressure; reserve a realtime lane for interactive UIs.

Observability that matters

Queue depth, admitted vs. dropped (with reasons), p50/p95/p99 split by class, tokens/sec, SM/TensorCore occupancy, batch size distribution.

From Graph to GPU

Compliance-First AI Serving