Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Blog

Posts

Jun 24, 2026

Green Fern

Helping Kovant bring the first foundation model for industrial operations into production

Inceptron is partnering with Kovant on the infrastructure layer behind a new AI model built specifically for industrial work.

Kovant is building what is designed to become the world’s first foundation model purpose-built for industrial operations.

Their focus is not general AI. It is the work that keeps industrial companies running: procurement, supplier onboarding, supply chain workflows, and the back-office processes behind manufacturing networks.

These are hard environments for AI.

The data is messy. The workflows are long. The rules are strict. The cost of being wrong is real.

A model working in this setting has to understand invoices, supplier portals, ERP records, certificates, purchase orders, technical specifications, and operational constraints. It also has to run across days or weeks, not just inside a single chat session.

That puts pressure on the full stack.

Industrial AI does not stop at the model. It needs infrastructure that can run reliably, securely, and at production cost.

That is where Inceptron comes in.

Our role: making the model run in production

Inceptron is working with Kovant on the compiler and inference layer.

Our role is to help optimize how the model runs on real hardware, and to support deployment options for customers that need European infrastructure and data control.

This includes the parts of production inference that matter once a model leaves the lab:

  • routing

  • batching

  • caching

  • latency

  • throughput

  • hardware-level optimization

  • dedicated and sovereign deployment options

For industrial AI, inference cost is not a side issue.

A production system can run tens of thousands of decisions per day. Each decision may depend on documents, rules, context, and tool calls. If the inference layer is not optimized, the economics break quickly.

Kovant is building a domain-specific model for that reality. Inceptron is helping make it efficient enough to run at that volume.

Why infrastructure matters for industrial operations

Kovant’s model is one part of a larger system.

The full platform also includes context, operational knowledge, tools, orchestration, and long-running process state. That matters because industrial work does not happen in one prompt.

A supplier onboarding flow can take days. A procurement cycle can run across systems, documents, approvals, and exceptions. The model has to fit into that process, not the other way around.

The infrastructure layer has to support that.

It has to keep latency under control. It has to make cost predictable. It has to run close to the customer’s data. It has to support dedicated environments when required.

The next step for enterprise AI is not just larger models. It is models built for specific work, running on infrastructure designed for production.

This is the part Inceptron is focused on.

A European stack for industrial AI

Kovant is building the operational model and platform.

Inceptron is providing the optimized inference and deployment infrastructure behind it.

Together, the goal is a European stack for industrial AI: a domain-specific model for industrial operations, production-grade inference, and deployment options for customers that care about data control.

For some customers, that means running through optimized Inceptron infrastructure. For others, it may mean sovereign or dedicated deployments where operational data stays inside the required region or environment.

The first version is planned for September 2026.

We are looking forward to working with Kovant as this moves from model development into production deployment.

Jun 22, 2026

Green Fern

Inceptron Partners with Kilo to Bring EU-Hosted Inference to Engineering Teams

We’re excited to share that Inceptron is now partnered with Kilo to bring high-performance, EU-hosted inference to teams building with agentic engineering.

Kilo users can now access Inceptron-hosted open-weight models directly through Kilo. This gives teams a way to use strong models for coding and agent workflows while keeping inference workloads on European infrastructure.

For many companies, AI adoption is no longer blocked by model quality. It is blocked by where data goes, who processes it, and whether the setup can pass internal security review.

That is the problem this partnership is built around.

EU-hosted models inside Kilo

Through the partnership, Kilo users can access Inceptron-hosted models such as:

  • Kimi K2.7 from MoonshotAI

  • GLM 5.2 from Z.ai

  • MiniMax M2.5 from MiniMax

These models are used for coding, agent workflows, and production inference where teams need performance, cost control, and data residency.

Inceptron hosts the models on infrastructure built for AI workloads in the EU. Kilo makes them available where developers already work.

Built for teams that need control

A lot of engineering teams want to move faster with AI, but cannot send source code, prompts, or customer data through infrastructure they cannot govern.

That matters most for companies with EU data residency requirements, GDPR obligations, or strict internal security processes.

With Inceptron and Kilo, teams can route inference through European infrastructure and use open-weight models without adding unnecessary data exposure.

Two ways to use Inceptron in Kilo

Teams can use Inceptron through the Kilo Gateway or through BYOK.

With Kilo Gateway, teams can access Inceptron-hosted models directly from Kilo. The gateway handles routing and makes it easier to switch between models.

With BYOK, teams that already work with Inceptron can add their Inceptron API key in Kilo and route requests through their existing setup.

Both options are designed to make model access simple without changing how engineers work.

Available across the Kilo workflow

Once enabled in Kilo, Inceptron-hosted models can be used across the Kilo ecosystem, including:

  • Kilo CLI

  • Cloud agents

  • VS Code and JetBrains extensions

That means teams can use EU-hosted inference from the terminal, inside their IDE, or as part of automated agent workflows.

Why we partnered with Kilo

We use Kilo ourselves for agentic engineering.

That matters to us. We do not want to offer infrastructure for tools we would not use in our own workflows.

Kilo gives developers a strong interface for agentic engineering. Inceptron gives teams the infrastructure layer for fast, compliant inference in Europe.

Together, we can support teams that want to ship with AI while keeping control of their data and deployment setup.

If your team wants to run open-weight models through EU-hosted infrastructure, you can now access Inceptron directly through Kilo.

Aug 28, 2025

Green Fern

Inside the Optimization Compiler: Agentic Tuning, Memory Planning, and Hardware-Aware Codegen

TL;DR: Inceptron’s compiler turns generic model graphs into device-tuned binaries. We combine graph-level rewrites, hardware-aware codegen, agentic auto-tuning, and memory layout planning. The result: lower p95 latency, higher tokens/sec, and predictable costs—without hand-written kernels.


Why generic runtimes leave performance on the table

Most inference stacks do three things well: load weights, launch kernels, and scale horizontally. The gap is what they launch and how they schedule memory. Vendor libraries cover common ops, but real workloads mix shapes, custom layers, and attention variants that don’t map cleanly. A compiler reshapes the graph, fuses ops, and chooses device-specific implementations.

The multi-level pipeline

  1. Graph IR (SSA) — normalize shapes/dtypes; run constant folding, CSE, DCE.

  2. Fusion & scheduling — pattern-match producer→consumer chains (e.g., matmul→bias→activation), tile for locality, and choose cooperative thread blocks.

  3. Hardware-aware codegen — lower to device intrinsics (tensor cores, async copy) and pick algorithms per GPU family.

  4. Agentic auto-tuning — Bayesian + rule-based search over tile sizes, warps, unroll factors; cache winners by {op, shape, dtype, device, driver}.

  5. Memory planning — pack weights, promote to shared memory, plan static reuse, keep access coalesced.

  6. Validation & fallbacks — numerical parity checks, canary deploys, instant rollback on regression.

What agentic tuning actually does

  • Seeds candidates from architecture heuristics.

  • Early-stops weak configs; focuses search on promising regions.

  • Persists results so future compiles reuse known winners.
    Outcome: 10–30% latency cuts on hot paths without app changes, and faster compiles over time.

Memory planning: the quiet unlock

AOT compilation enables layout-level wins: tiling/padding for vectorization, promotion to shared memory, static buffer reuse, and coalesced access. These stabilize p95 during bursty load.

Hardware awareness without lock-in

Keep high-level intent in IR; isolate vendor specifics in backends; gate advanced features behind capability flags; ship portable fallbacks.

Reproducible benchmarking

Warm the runtime; fix seeds/prompts; measure p50/p95 and tokens/sec across single-shot/bursty/steady profiles; enforce numerical parity thresholds; report deltas with confidence intervals.

Jul 23, 2025

Yellow Flower

Batched Inference, Demystified: Hit Your p95 While 3–5× Throughput

TL;DR: Batching is the easiest lever to improve tokens/sec—if you guard tail latency. Here are working defaults, trade-offs, and the observability to run it safely.

When batching pays off

Best for homogeneous traffic (similar max tokens/prompt sizes), higher QPS, and cost pressure. For spiky or variable traffic, use priority lanes and early flush.

Dynamic vs. static batching

Static = fixed size; simple but wastes capacity or adds latency.
Dynamic = short window (e.g., 5–20 ms), group by shape/limits, then launch. Dynamic wins for public APIs.

The five knobs that matter

  1. Batch window (ms)

  2. Max batch size

  3. Similarity rule (model, prompt len, max tokens buckets)

  4. Early flush conditions (priority lane, wait cap)

  5. Timeout/SLOs (p95 budget)

Ready-to-use presets

Low-latency

batch_window_ms: 6
max_batch_size: 8
group_by: ["model_id","max_tokens_bucket"]
early_flush: [{if_priority: "realtime"}, {if_queue_depth_gt: 16}]
p95_budget_ms: 250

Balanced

batch_window_ms: 12
max_batch_size: 16
group_by: ["model_id","max_tokens_bucket","prompt_len_bucket"]
early_flush: [{if_priority: "realtime"}, {if_wait_ms_gt: 10}]
p95_budget_ms: 350

High-throughput

batch_window_ms: 20
max_batch_size: 32
group_by: ["model_id","prompt_len_bucket"]
early_flush: [{if_wait_ms_gt: 18}]
p95_budget_ms: 450

Tip: bucket max_tokens/prompt_len in powers of two (256/512/1024) to match tile sizes.

Admission control

Use per-workspace circuit breakers; send 429 with Retry-After under pressure; reserve a realtime lane for interactive UIs.

Observability that matters

Queue depth, admitted vs. dropped (with reasons), p50/p95/p99 split by class, tokens/sec, SM/TensorCore occupancy, batch size distribution.

Jun 12, 2025

Orange Flower

GDPR-Ready AI Inference: Data Residency, Retention, and Auditability in Multi-Cloud

TL;DR: GDPR for inference APIs boils down to region pinning, minimal retention, and provable access controls. This turns legalese into engineering checklists.

What GDPR means in practice

You’re typically a processor of user data; you need mechanisms to keep data in the chosen region and erase it within defined windows; you must maintain records of processing (who accessed what, when).

Region & residency

  • Region selection per workspace/environment; region lock to prevent cross-region failover.

  • Keep weights, caches, logs in region-scoped storage.

  • DR plan that keeps copies within the same legal area.

Data minimization & retention

  • Redact payloads at ingress when possible.

  • Default short retention (7–30 days) with per-workspace overrides.

  • Verified deletes across primary + backups; API to erase by workspace/request ID.

  • Avoid logging full prompts/responses by default.

Access control that proves itself

  • SSO (SAML/OIDC) with MFA; RBAC scoped by org/workspace; least-privilege keys.

  • Audit trails for console actions, API calls, and exports, including actor, scope, and before/after.

DPAs & sub-processors

Publish a list; allow EU-only telemetry or disable it; provide audit-log export for evidence.

Example EU-only setup

Create EU workspace → enable Region Lock → set Retention = 14 days → SSO + RBAC → export audit logs weekly to EU SIEM.

Shared responsibility

You own data you send, redaction choices, and tenant keys. We provide regional isolation, encryption in transit/at rest, access controls, auditability, and deletion APIs.