Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Models

Llama-3.3-70B-Instruct

Meta

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.10

Output tokens, 1M

$0.30

Tokens per sec

100

Quantization

fp8

Size

70B

128K context

chat

trivia

marketing

reasoning

Kimi-K2-Instruct

Moonshotai

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

1T

131K context

JSON mode

reasoning

math

reasoning

gpt-oss-120b

OpenAI

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

120B

131K context

JSON mode

code

math

reasoning

Enterprise-grade inference

Deploy and scale models like Llama, Qwen, Kimi and DeepSeek with guaranteed uptime, zero-retention data flow, and usage-based pricing, no GPU wrangling required.

gpt-oss-20b

OpenAI

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

20B

131K context

JSON mode

code

math

reasoning

DeepSeek-V3-0324

DeepSeek

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

685B

128K context

JSON mode

MoE

code

math

reasoning

DeepSeek-V3.1

DeepSeek

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

685B

128K context

JSON mode

MoE

code

math

reasoning

DeepSeek-R1-0528

DeepSeek

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

685B

164K context

JSON mode

MoE

code

reasoning

Qwen3-Coder-30B-
A3B-Instruct

Qwen

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

30B

262K context

JSON mode

code

math

Qwen/Qwen3-235B-
A22B-Instruct-2507

Qwen

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

235B

262K context

JSON mode

code

math

reasoning

Qwen2.5-VL-72B-
Instruct7B-fast

Qwen

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

72B

128K context

JSON mode

code

math

GLM-4.6

Zai-Org

Mode

Inceptron Optimized

Region

Input tokens, 1M

TBA

Output tokens, 1M

TBA

Tokens per sec

TBA

Quantization

fp8

Size

357B

262K context

JSON mode

code

math

reasoning

Run any model on the fastest endpoints

Use our API to deploy any open-source model on the fastest inference stack available with optimal cost efficiency.

Scale into a dedicated deployment anytime with a custom number of instances to get optimal throughput.

Curl

Python

Typescript

curl -X POST "https://api.inceptron.io/v1/chat/completions" \
  -H "Authorization: Bearer $INCEPTRON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
 "model": "meta-llama/Llama-Vision-Free",
 "messages": [{"role": "user", "content": "What are some fun things to do in New York?"}]
  }'

Curl

Python

Typescript

curl -X POST "https://api.inceptron.io/v1/chat/completions" \
  -H "Authorization: Bearer $INCEPTRON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
 "model": "meta-llama/Llama-Vision-Free",
 "messages": [{"role": "user", "content": "What are some fun things to do in New York?"}]
  }'

Curl

Python

Typescript

curl -X POST "https://api.inceptron.io/v1/chat/completions" \
  -H "Authorization: Bearer $INCEPTRON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
 "model": "meta-llama/Llama-Vision-Free",
 "messages": [{"role": "user", "content": "What are some fun things to do in New York?"}]
  }'