OpenAI Compatible

Meet AI Gateway

Route requests to OpenAI, Anthropic, Google, Ollama, and more through a single, OpenAI-compatible endpoint with intelligent failover, load balancing, and semantic caching.

Get Started

# Use any OpenAI-compatible client
curl https://ai-gateway.koder.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

# Switch providers instantly
curl https://ai-gateway.koder.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "anthropic/claude-sonnet-4-20250514",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Built for Production

Everything you need to run AI workloads reliably at scale.

🔀
Intelligent Routing

Automatically detects the right provider from model names. Use explicit prefixes like openai/gpt-4o or let the gateway figure it out.

🛡
Automatic Failover

If a provider goes down, requests are seamlessly rerouted to healthy alternatives. Unhealthy providers auto-recover after 30 seconds.

⚖
Load Balancing

Distribute requests across providers with configurable weights. Optimize for cost, latency, or availability.

⚡
Semantic Caching

Cache identical requests to reduce latency and API costs. Supports in-memory and Redis backends with configurable TTL.

🔒
Auth & Rate Limiting

Bearer token authentication with per-key rate limiting using the token bucket algorithm. Protect your API budget.

📊
Observability

Structured logging with latency, token counts, provider used, and cache hit/miss metrics for every request.

🌐
Multi-Provider

Connect OpenAI, Anthropic, Google Gemini, Ollama, Hugging Face, vLLM, and more through a single unified endpoint.

🔄
Streaming Support

Full SSE streaming passthrough for real-time token generation. Works with all supported providers transparently.

🔧
Zero Config

Works out of the box with environment variables. No YAML, no dashboards. Just set your provider keys and go.

Drop-in Compatible

Works with any OpenAI client library. Just change the base URL and you're done.

Python, Node.js, Go, Rust — any OpenAI SDK works
Same /v1/chat/completions endpoint
Same request and response format
Switch providers by changing the model prefix

# Python — just change base_url
from openai import OpenAI

client = OpenAI(
    base_url="https://ai-gateway.koder.dev/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing"
    }]
)
print(response.choices[0].message.content)

Automatic Failover

When a provider fails, the gateway seamlessly retries with the next healthy provider. No code changes needed.

Configurable retry policies per provider
Health checks with automatic recovery
Weighted round-robin load balancing
Circuit breaker pattern for fast failure detection

# Configure failover chain
GATEWAY_PROVIDERS="openai,anthropic,google"
GATEWAY_WEIGHTS="50,30,20"
GATEWAY_RETRY_MAX=3
GATEWAY_HEALTH_CHECK_INTERVAL=30s

# Request flow:
# 1. Try OpenAI (weight: 50%)
# 2. OpenAI down? Try Anthropic
# 3. Anthropic down? Try Google
# 4. All down? Return 503

# Health auto-recovers in 30s
INFO  provider=openai status=unhealthy
INFO  failover to=anthropic latency=12ms
INFO  provider=openai status=recovered

Semantic Caching

Identical requests are served from cache instantly. Save money and reduce latency with zero configuration.

SHA-256 hash of model + messages
In-memory or Redis backends
Configurable TTL (default: 5 minutes)
Cache hit/miss reported in response headers

# First request — cache miss (320ms)
INFO  model=gpt-4o cache=miss latency=320ms tokens=847

# Same request — cache hit (2ms)
INFO  model=gpt-4o cache=hit  latency=2ms   tokens=0

# Response headers
X-Cache: HIT
X-Cache-TTL: 298
X-Provider: openai
X-Request-Id: req_abc123

# Configure cache
GATEWAY_CACHE_BACKEND=redis
GATEWAY_CACHE_TTL=300
GATEWAY_CACHE_MAX_SIZE=10000

How It Compares

See how AI Gateway stacks up against the alternatives.

Feature	AI Gateway	LiteLLM	OpenRouter	Portkey	Direct API
OpenAI-compatible API	✓	✓	✓	✓	—
Self-hosted option	✓	✓	—	—	✓
Automatic failover	✓	Partial	✓	✓	—
Semantic caching	✓	—	—	✓	—
Load balancing	✓	Partial	—	✓	—
Per-key rate limiting	✓	—	✓	✓	—
Open source (MIT)	✓	✓	—	—	N/A
No vendor lock-in	✓	✓	—	—	—
Zero-config setup	✓	—	✓	✓	✓

Frequently Asked Questions

Yes. The gateway exposes an OpenAI-compatible API at /v1/chat/completions. Any client that works with OpenAI will work with AI Gateway — just change the base URL.

Yes. Full SSE (Server-Sent Events) streaming is supported. Set "stream": true in your request and the gateway will passthrough tokens from the upstream provider in real-time.

When a provider returns an error or times out, the gateway automatically retries with the next available provider based on configured weights and health status. Unhealthy providers are temporarily removed from rotation and auto-recover after 30 seconds.

OpenAI (GPT-4o, o1, o3), Anthropic (Claude Opus, Sonnet, Haiku), Google (Gemini 2.0 Flash/Pro), Ollama (Llama 3, Mistral, Phi-3), Hugging Face, and vLLM. Adding a new provider is as simple as setting an environment variable.

The cache creates a SHA-256 hash of the model name and messages array. Identical requests return cached responses instantly. Cache entries expire based on the configured TTL (default: 5 minutes). Supports both in-memory and Redis backends.

Yes, under the MIT license. You can self-host, modify, and redistribute freely. No vendor lock-in, no usage fees, no strings attached.

Meet AI Gateway

Built for Production

🔀Intelligent Routing

🛡Automatic Failover

⚖Load Balancing

⚡Semantic Caching

🔒Auth & Rate Limiting

📊Observability

🌐Multi-Provider

🔄Streaming Support

🔧Zero Config