Infer

02 / EXPLORE MODELS

100+ models. One interface.

View all models →

seedance-1.0

Video generation from text or image

~8sVIDEO

kling-1.5

High-quality video synthesis

~6sVIDEO

whisper-v3

Speech-to-text transcription

~0.3sAUDIO

elevenlabs

Voice synthesis and cloning

~1sAUDIO

“

Infer cut our inference costs by 60% while improving latency. The unified API means we ship features 3× faster.

Sarah ChenCTO · Runway Labs

ALSO SUPPORTSrunway-gen3/luma-dream/stable-audio/bark-tts/+80 more

03 / WHY INFER

Built for speed.
Designed for scale.

50ms

Sub-second cold starts

Models are always warm. No waiting for containers to spin up.

10B+

Serverless, by default

From 1 to 10 million calls. No provisioning, no config — just hit the API.

99.9%

Enterprise reliability

SOC2 compliant. Multi-region redundancy. Real-time monitoring.

100+

Models, one interface

Flux, SDXL, Runway, Whisper, ElevenLabs. Same API shape.

$0.00 min

Pay only for what you use

No commitments. Transparent per-inference pricing.

6 regions

Global edge network

Inference runs closest to your users. US, EU, APAC.

04 / INFER RUNTIME

A custom inference engine. Built for throughput.

Always-warm replicas, co-located weights, and a streaming protocol that returns tokens as they're decoded — not after the full response lands. No cold-starts, no queueing, no babysitting.

→ PERF PANEL · FLUX-SCHNELL · 1024×1024 · STEPS=4
→ WINDOW · ROLLING LAST 24 HOURS · ALL REGIONS
→ MEASURED · END-TO-END (API CALL → FINAL BYTE)

LIVE · LAST 24H

P50·P95·P99

LATENCY DISTRIBUTION

P50

1.24s

P95

1.78s

P99

2.31s

THROUGHPUT · REQ / SEC

24H AGONOW

0ms

COLD START

Replicas kept warm across regions

118ms

TIME TO FIRST TOKEN

Streaming from the first chunk

∞

AUTO-BATCHING

Dynamic batching, no config

6 regions

EDGE ROUTING

Nearest warm replica, always

05 / DEVELOPER EXPERIENCE

Three lines
to production.// no setup required

Type-safe SDKs. OpenAPI spec. Streaming. Webhooks. Everything you need to ship fast — and nothing you don't.

< 50ms

COLD START

4 SDKs

FIRST-PARTY

OpenAPI

SPEC

example.pySTREAMING
 1import infer 2 3# Initialize with your API key 4client = infer.Client() 5 6# Generate with any of 100+ models 7result = client.run("flux-schnell", { 8    "prompt": "A futuristic city at sunset", 9    "width":  1024,10    "height": 102411})1213# That's it. No infra, no queues.14print(result.url)
 1import Infer from "@infer/sdk"; 2 3// Initialize with your API key 4const client = new Infer(); 5 6// Generate with any of 100+ models 7const result = await client.run("flux-schnell", { 8    prompt: "A futuristic city at sunset", 9    width:  1024,10    height: 1024,11});1213// Type-safe. Streaming-ready.14console.log(result.url);
 1package main 2 3import "github.com/infer/sdk-go" 4 5func main() { 6    client := infer.NewClient() 7 8    // Run any of 100+ models 9    result, _ := client.Run("flux-schnell", infer.Params{10        Prompt: "A futuristic city at sunset",11        Width:  1024,12        Height: 1024,13    })14    fmt.Println(result.URL)15}
 1# Works with any HTTP client 2curl https://api.infer.sh/v1/run \ 3  -H "Authorization: Bearer $INFER_KEY" \ 4  -H "Content-Type: application/json" \ 5  -d '{ 6    "model":  "flux-schnell", 7    "prompt": "A futuristic city at sunset", 8    "width":  1024, 9    "height": 102410  }'1112# Response streams back over HTTP/213# X-Infer-Latency: 1243ms

06 / ENTERPRISE

When self-serve
isn't enough.

Reserved capacity, private endpoints, compliance packages — available as an add-on for teams operating at serious scale. Not included in Pro; talk to us and we'll scope what you need.

Book a scoping call →team@infer.sh

TYPICAL ENTERPRISE ENGAGEMENT

01Scope & compliance review~1 wk
02Reserved capacity provisioning~3 days
03Integration & private endpoints~1 wk
04Go-live with named engineerday 1

AVAILABLE ON ENTERPRISE

SOC 2 Type II

Audited annually. Report under NDA.

ISO 27001

Certified ISMS & audit packages.

HIPAA · GDPR

BAA + EU data residency on request.

Private endpoints

Dedicated IPs, VPC peering, no shared tenancy.

SSO · SAML · SCIM

Okta, Azure AD, Google — auto-provisioning.

Custom SLA

Named engineer, incident credits, audit logs.

07 / CASE STUDY

×infer

MAREY · MOONVALLEY

SHOT 041 · TAKE 12

How Moonvalley powers the world's top film studios.

Marey is Moonvalley's foundational video model — delivering director-grade cinematic video from text, image, and pose input. It powers work at some of the biggest names in Hollywood.

To hit feature-film SLAs at production scale, Moonvalley's platform runs on Infer — tapping reserved inference capacity, private model hosting, and sub-200ms edge routing across four regions.

MAJOR STUDIOS
SHIPPING ON MAREY

180k+

SHOTS GENERATED
LAST QUARTER

4.2×

FASTER ITERATION
VS PREVIOUS STACK

“Marey has to run at the quality bar of a film set — and on the timelines of one. Infer lets us push Hollywood-grade video through the API without thinking about the infrastructure underneath.”
NTNaeem TalukdarCEO · Moonvalley

moonvalley.com / marey

08 / FAQ

Things teams ask
before signing.

The security-review and procurement questions, handled up front.

Still have questions? Book a 30-minute call →
Or email team@infer.sh — usually within an hour.

Q.01How is Infer different from Replicate, Fal, or Runware?

Three things: speed (always-warm replicas + streaming protocol), predictability (P99 latency SLAs, not just P50 averages), and pricing (pure per-call, no minimums, no idle charges). Same API shape as the others, so the migration path is one file.

Q.02Do you train on my prompts or outputs?

No. Zero data retention by default. We don't log prompts, we don't store outputs past the response, and we don't use any inputs for training. Enterprise gets a BAA and private VPC on top.

Q.03Can I run my own fine-tuned model?

Yes. Upload a LoRA weight file or full checkpoint through the API, and it's live behind a versioned endpoint in under 90 seconds. Pay only for the calls it serves.

Q.04What happens at 10M+ requests / month?

You get pulled into our volume tier automatically — up to 60% off list pricing — plus a named support engineer and quarterly architecture reviews. No contract gymnastics.

Q.05Where does inference run, and can I pin a region?

Six regions: us-east, us-west, eu-west, eu-central, ap-south, ap-northeast. Pin via a single header; we route to the nearest warm replica. EU-only residency available on enterprise.

Q.06What's your uptime story?

99.97% measured over the last 12 months across all regions. Live status at status.infer.sh, with per-region P50/P99 latency and incident history all the way back.

AI video. Image. Audio.

One API.Instantly.

100+ models. One interface.

Built for speed.
Designed for scale.

A custom inference engine. Built for throughput.

Three lines
to production.// no setup required

When self-serve
isn't enough.

How Moonvalley powers the world's top film studios.

Things teams ask
before signing.

Ready toship?

AI video. Image. Audio.

One API.Instantly.

100+ models. One interface.

Built for speed.Designed for scale.

A custom inference engine. Built for throughput.

Three linesto production.// no setup required

When self-serveisn't enough.

How Moonvalley powers the world's top film studios.

Things teams askbefore signing.

Ready toship?

Built for speed.
Designed for scale.

Three lines
to production.// no setup required

When self-serve
isn't enough.

Things teams ask
before signing.