Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Blog

Posts

Aug 28, 2025

From Graph to GPU

Inside the Optimization Compiler: Agentic Tuning, Memory Planning, and Hardware-Aware Codegen

TL;DR: Inceptron’s compiler turns generic model graphs into device-tuned binaries. We combine graph-level rewrites, hardware-aware codegen, agentic auto-tuning, and memory layout planning. The result: lower p95 latency, higher tokens/sec, and predictable costs—without hand-written kernels.

Why generic runtimes leave performance on the table

Most inference stacks do three things well: load weights, launch kernels, and scale horizontally. The gap is what they launch and how they schedule memory. Vendor libraries cover common ops, but real workloads mix shapes, custom layers, and attention variants that don’t map cleanly. A compiler reshapes the graph, fuses ops, and chooses device-specific implementations.

The multi-level pipeline

Graph IR (SSA) — normalize shapes/dtypes; run constant folding, CSE, DCE.
Fusion & scheduling — pattern-match producer→consumer chains (e.g., matmul→bias→activation), tile for locality, and choose cooperative thread blocks.
Hardware-aware codegen — lower to device intrinsics (tensor cores, async copy) and pick algorithms per GPU family.
Agentic auto-tuning — Bayesian + rule-based search over tile sizes, warps, unroll factors; cache winners by {op, shape, dtype, device, driver}.
Memory planning — pack weights, promote to shared memory, plan static reuse, keep access coalesced.
Validation & fallbacks — numerical parity checks, canary deploys, instant rollback on regression.

What agentic tuning actually does

Seeds candidates from architecture heuristics.
Early-stops weak configs; focuses search on promising regions.
Persists results so future compiles reuse known winners.
Outcome: 10–30% latency cuts on hot paths without app changes, and faster compiles over time.

Memory planning: the quiet unlock

AOT compilation enables layout-level wins: tiling/padding for vectorization, promotion to shared memory, static buffer reuse, and coalesced access. These stabilize p95 during bursty load.

Hardware awareness without lock-in

Keep high-level intent in IR; isolate vendor specifics in backends; gate advanced features behind capability flags; ship portable fallbacks.

Reproducible benchmarking

Warm the runtime; fix seeds/prompts; measure p50/p95 and tokens/sec across single-shot/bursty/steady profiles; enforce numerical parity thresholds; report deltas with confidence intervals.

Jul 23, 2025

Batched Inference

Batched Inference, Demystified: Hit Your p95 While 3–5× Throughput

TL;DR: Batching is the easiest lever to improve tokens/sec—if you guard tail latency. Here are working defaults, trade-offs, and the observability to run it safely.

When batching pays off

Best for homogeneous traffic (similar max tokens/prompt sizes), higher QPS, and cost pressure. For spiky or variable traffic, use priority lanes and early flush.

Dynamic vs. static batching

Static = fixed size; simple but wastes capacity or adds latency.
Dynamic = short window (e.g., 5–20 ms), group by shape/limits, then launch. Dynamic wins for public APIs.

The five knobs that matter

Batch window (ms)
Max batch size
Similarity rule (model, prompt len, max tokens buckets)
Early flush conditions (priority lane, wait cap)
Timeout/SLOs (p95 budget)

Ready-to-use presets

Low-latency

batch_window_ms: 6
max_batch_size: 8
group_by: ["model_id","max_tokens_bucket"]
early_flush: [{if_priority: "realtime"}, {if_queue_depth_gt: 16}]
p95_budget_ms: 250

Balanced

batch_window_ms: 12
max_batch_size: 16
group_by: ["model_id","max_tokens_bucket","prompt_len_bucket"]
early_flush: [{if_priority: "realtime"}, {if_wait_ms_gt: 10}]
p95_budget_ms: 350

High-throughput

batch_window_ms: 20
max_batch_size: 32
group_by: ["model_id","prompt_len_bucket"]
early_flush: [{if_wait_ms_gt: 18}]
p95_budget_ms: 450

Tip: bucket max_tokens/prompt_len in powers of two (256/512/1024) to match tile sizes.

Admission control

Use per-workspace circuit breakers; send 429 with Retry-After under pressure; reserve a realtime lane for interactive UIs.

Observability that matters

Queue depth, admitted vs. dropped (with reasons), p50/p95/p99 split by class, tokens/sec, SM/TensorCore occupancy, batch size distribution.

Jun 12, 2025

Compliance-First AI Serving

GDPR-Ready AI Inference: Data Residency, Retention, and Auditability in Multi-Cloud

TL;DR: GDPR for inference APIs boils down to region pinning, minimal retention, and provable access controls. This turns legalese into engineering checklists.

What GDPR means in practice

You’re typically a processor of user data; you need mechanisms to keep data in the chosen region and erase it within defined windows; you must maintain records of processing (who accessed what, when).

Region & residency

Region selection per workspace/environment; region lock to prevent cross-region failover.
Keep weights, caches, logs in region-scoped storage.
DR plan that keeps copies within the same legal area.

Data minimization & retention

Redact payloads at ingress when possible.
Default short retention (7–30 days) with per-workspace overrides.
Verified deletes across primary + backups; API to erase by workspace/request ID.
Avoid logging full prompts/responses by default.

Access control that proves itself

SSO (SAML/OIDC) with MFA; RBAC scoped by org/workspace; least-privilege keys.
Audit trails for console actions, API calls, and exports, including actor, scope, and before/after.

DPAs & sub-processors

Publish a list; allow EU-only telemetry or disable it; provide audit-log export for evidence.

Example EU-only setup

Create EU workspace → enable Region Lock → set Retention = 14 days → SSO + RBAC → export audit logs weekly to EU SIEM.

Shared responsibility

You own data you send, redaction choices, and tenant keys. We provide regional isolation, encryption in transit/at rest, access controls, auditability, and deletion APIs.