GENERATIVE · MEDIA · INFERENCE520 × 560
GENERATIVE MEDIA INFERENCE

AI video. Image. Audio.

One API.Instantly.

The unified inference API for generative video,
image, and audio. Sub-second cold starts, 100+ models.

OPERATIONAL · 99.97% UPTIME
v2.4 · RELEASED APR 2026
TOPOLOGY.MAPv2.4 / UNIFIEDN=8 / 100+→ ACTIVE▶ VIDEOkling-1.5VIDEO · ~6sseedance-1.0VIDEO · ~8s◉ IMAGEflux-schnellIMAGE · ~1ssdxl-turboIMAGE · ~2s♪ AUDIOwhisper-v3AUDIO · ~0.3selevenlabsAUDIO · ~1s♪ VOICEbark-ttsAUDIO · ~0.8srunway-gen3VIDEO · ~12sUNIFIED APIONLINE· v2.4
01 / TRY IT NOW
INTERACTIVE · LIVE INFERENCE
Model 12 AVAILABLE
Prompt142 / 500
1024 × 1024
4
847291
Generate in the app →
~1.2s AVG$0.003 / IMAGE
Output
16:9LIVE
FLUX-SCHNELL1.24s
4/4 STEPS
10B+
INFERENCES SERVED
2,400+
TEAMS BUILDING
127ms
P99 COLD START
$0.0008/img
STARTING AT

LAST 30 DAYS
UPDATED 5 MIN

TRUSTED BY
teams shipping fast
NVIDIAamazonMicrosoftGoogleOpenAIAnthropic
+2,400 TEAMS
02 / EXPLORE MODELS

100+ models. One interface.

View all models →
S
seedance-1.0

Video generation from text or image

~8sVIDEO
K
kling-1.5

High-quality video synthesis

~6sVIDEO
W
whisper-v3

Speech-to-text transcription

~0.3sAUDIO
E
elevenlabs

Voice synthesis and cloning

~1sAUDIO

Infer cut our inference costs by 60% while improving latency. The unified API means we ship features 3× faster.

Sarah ChenCTO · Runway Labs
ALSO SUPPORTSrunway-gen3/luma-dream/stable-audio/bark-tts/+80 more
03 / WHY INFER

Built for speed.
Designed for scale.

Production-grade inference that gets out of your way. Models always warm, capacity auto-scaling, edge network routing to the nearest replica — so you can focus on the product, not the plumbing.
50ms
Sub-second cold starts

Models are always warm. No waiting for containers to spin up.

10B+
Serverless, by default

From 1 to 10 million calls. No provisioning, no config — just hit the API.

99.9%
Enterprise reliability

SOC2 compliant. Multi-region redundancy. Real-time monitoring.

100+
Models, one interface

Flux, SDXL, Runway, Whisper, ElevenLabs. Same API shape.

$0.00 min
Pay only for what you use

No commitments. Transparent per-inference pricing.

6 regions
Global edge network

Inference runs closest to your users. US, EU, APAC.

04 / INFER RUNTIME

A custom inference engine. Built for throughput.

Always-warm replicas, co-located weights, and a streaming protocol that returns tokens as they're decoded — not after the full response lands. No cold-starts, no queueing, no babysitting.

PERF PANEL · FLUX-SCHNELL · 1024×1024 · STEPS=4
WINDOW · ROLLING LAST 24 HOURS · ALL REGIONS
MEASURED · END-TO-END (API CALL → FINAL BYTE)

LIVE · LAST 24H
P50·P95·P99
LATENCY DISTRIBUTION
P50
1.24s
P95
1.78s
P99
2.31s
THROUGHPUT · REQ / SEC
24H AGONOW
0ms
COLD START
Replicas kept warm across regions
118ms
TIME TO FIRST TOKEN
Streaming from the first chunk
AUTO-BATCHING
Dynamic batching, no config
6 regions
EDGE ROUTING
Nearest warm replica, always
05
05 / DEVELOPER EXPERIENCE

Three lines
to production.// no setup required

Type-safe SDKs. OpenAPI spec. Streaming. Webhooks. Everything you need to ship fast — and nothing you don't.

< 50ms
COLD START
4 SDKs
FIRST-PARTY
OpenAPI
SPEC
example.pySTREAMING
 1import infer 2 3# Initialize with your API key 4client = infer.Client() 5 6# Generate with any of 100+ models 7result = client.run("flux-schnell", { 8    "prompt": "A futuristic city at sunset", 9    "width":  1024,10    "height": 102411})1213# That's it. No infra, no queues.14print(result.url)
 1import Infer from "@infer/sdk"; 2 3// Initialize with your API key 4const client = new Infer(); 5 6// Generate with any of 100+ models 7const result = await client.run("flux-schnell", { 8    prompt: "A futuristic city at sunset", 9    width:  1024,10    height: 1024,11});1213// Type-safe. Streaming-ready.14console.log(result.url);
 1package main 2 3import "github.com/infer/sdk-go" 4 5func main() { 6    client := infer.NewClient() 7 8    // Run any of 100+ models 9    result, _ := client.Run("flux-schnell", infer.Params{10        Prompt: "A futuristic city at sunset",11        Width:  1024,12        Height: 1024,13    })14    fmt.Println(result.URL)15}
 1# Works with any HTTP client 2curl https://api.infer.sh/v1/run \ 3  -H "Authorization: Bearer $INFER_KEY" \ 4  -H "Content-Type: application/json" \ 5  -d '{ 6    "model":  "flux-schnell", 7    "prompt": "A futuristic city at sunset", 8    "width":  1024, 9    "height": 102410  }'1112# Response streams back over HTTP/213# X-Infer-Latency: 1243ms
06 / ENTERPRISE

When self-serve
isn't enough.

Reserved capacity, private endpoints, compliance packages — available as an add-on for teams operating at serious scale. Not included in Pro; talk to us and we'll scope what you need.

TYPICAL ENTERPRISE ENGAGEMENT
  • 01Scope & compliance review~1 wk
  • 02Reserved capacity provisioning~3 days
  • 03Integration & private endpoints~1 wk
  • 04Go-live with named engineerday 1
AVAILABLE ON ENTERPRISE
SOC 2 Type II
Audited annually. Report under NDA.
ISO 27001
Certified ISMS & audit packages.
HIPAA · GDPR
BAA + EU data residency on request.
Private endpoints
Dedicated IPs, VPC peering, no shared tenancy.
SSO · SAML · SCIM
Okta, Azure AD, Google — auto-provisioning.
Custom SLA
Named engineer, incident credits, audit logs.
07 / CASE STUDY
×infer
MAREY · MOONVALLEY
SHOT 041 · TAKE 12

How Moonvalley powers the world's top film studios.

Marey is Moonvalley's foundational video model — delivering director-grade cinematic video from text, image, and pose input. It powers work at some of the biggest names in Hollywood.

To hit feature-film SLAs at production scale, Moonvalley's platform runs on Infer — tapping reserved inference capacity, private model hosting, and sub-200ms edge routing across four regions.

3
MAJOR STUDIOS
SHIPPING ON MAREY
180k+
SHOTS GENERATED
LAST QUARTER
4.2×
FASTER ITERATION
VS PREVIOUS STACK
“Marey has to run at the quality bar of a film set — and on the timelines of one. Infer lets us push Hollywood-grade video through the API without thinking about the infrastructure underneath.”
NTNaeem TalukdarCEO · Moonvalley
moonvalley.com / marey
08 / FAQ

Things teams ask
before signing.

The security-review and procurement questions, handled up front.

Still have questions? Book a 30-minute call →
Or email team@infer.sh — usually within an hour.
Q.01How is Infer different from Replicate, Fal, or Runware?
Three things: speed (always-warm replicas + streaming protocol), predictability (P99 latency SLAs, not just P50 averages), and pricing (pure per-call, no minimums, no idle charges). Same API shape as the others, so the migration path is one file.
Q.02Do you train on my prompts or outputs?
No. Zero data retention by default. We don't log prompts, we don't store outputs past the response, and we don't use any inputs for training. Enterprise gets a BAA and private VPC on top.
Q.03Can I run my own fine-tuned model?
Yes. Upload a LoRA weight file or full checkpoint through the API, and it's live behind a versioned endpoint in under 90 seconds. Pay only for the calls it serves.
Q.04What happens at 10M+ requests / month?
You get pulled into our volume tier automatically — up to 60% off list pricing — plus a named support engineer and quarterly architecture reviews. No contract gymnastics.
Q.05Where does inference run, and can I pin a region?
Six regions: us-east, us-west, eu-west, eu-central, ap-south, ap-northeast. Pin via a single header; we route to the nearest warm replica. EU-only residency available on enterprise.
Q.06What's your uptime story?
99.97% measured over the last 12 months across all regions. Live status at status.infer.sh, with per-region P50/P99 latency and incident history all the way back.
TAP TO UNMUTE
TAP TO UNMUTE
TAP TO UNMUTE
TAP TO UNMUTE
FINAL CALL

Ready toship?

Start building for free. No credit card required.
Scale when you're ready.