Route requests to OpenAI, Anthropic, Google, Ollama, and more through a single, OpenAI-compatible endpoint with intelligent failover, load balancing, and semantic caching.
# Use any OpenAI-compatible client curl https://ai-gateway.koder.dev/v1/chat/completions \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o", "messages": [ {"role": "user", "content": "Hello!"} ] }' # Switch providers instantly curl https://ai-gateway.koder.dev/v1/chat/completions \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "anthropic/claude-sonnet-4-20250514", "messages": [ {"role": "user", "content": "Hello!"} ] }'
Everything you need to run AI workloads reliably at scale.
Automatically detects the right provider from model names. Use explicit prefixes like openai/gpt-4o or let the gateway figure it out.
If a provider goes down, requests are seamlessly rerouted to healthy alternatives. Unhealthy providers auto-recover after 30 seconds.
Distribute requests across providers with configurable weights. Optimize for cost, latency, or availability.
Cache identical requests to reduce latency and API costs. Supports in-memory and Redis backends with configurable TTL.
Bearer token authentication with per-key rate limiting using the token bucket algorithm. Protect your API budget.
Structured logging with latency, token counts, provider used, and cache hit/miss metrics for every request.
Connect OpenAI, Anthropic, Google Gemini, Ollama, Hugging Face, vLLM, and more through a single unified endpoint.
Full SSE streaming passthrough for real-time token generation. Works with all supported providers transparently.
Works out of the box with environment variables. No YAML, no dashboards. Just set your provider keys and go.
Works with any OpenAI client library. Just change the base URL and you're done.
# Python — just change base_url from openai import OpenAI client = OpenAI( base_url="https://ai-gateway.koder.dev/v1", api_key="YOUR_API_KEY" ) response = client.chat.completions.create( model="anthropic/claude-sonnet-4-20250514", messages=[{ "role": "user", "content": "Explain quantum computing" }] ) print(response.choices[0].message.content)
When a provider fails, the gateway seamlessly retries with the next healthy provider. No code changes needed.
# Configure failover chain GATEWAY_PROVIDERS="openai,anthropic,google" GATEWAY_WEIGHTS="50,30,20" GATEWAY_RETRY_MAX=3 GATEWAY_HEALTH_CHECK_INTERVAL=30s # Request flow: # 1. Try OpenAI (weight: 50%) # 2. OpenAI down? Try Anthropic # 3. Anthropic down? Try Google # 4. All down? Return 503 # Health auto-recovers in 30s INFO provider=openai status=unhealthy INFO failover to=anthropic latency=12ms INFO provider=openai status=recovered
Identical requests are served from cache instantly. Save money and reduce latency with zero configuration.
# First request — cache miss (320ms) INFO model=gpt-4o cache=miss latency=320ms tokens=847 # Same request — cache hit (2ms) INFO model=gpt-4o cache=hit latency=2ms tokens=0 # Response headers X-Cache: HIT X-Cache-TTL: 298 X-Provider: openai X-Request-Id: req_abc123 # Configure cache GATEWAY_CACHE_BACKEND=redis GATEWAY_CACHE_TTL=300 GATEWAY_CACHE_MAX_SIZE=10000
See how AI Gateway stacks up against the alternatives.
| Feature | AI Gateway | LiteLLM | OpenRouter | Portkey | Direct API |
|---|---|---|---|---|---|
| OpenAI-compatible API | ✓ | ✓ | ✓ | ✓ | — |
| Self-hosted option | ✓ | ✓ | — | — | ✓ |
| Automatic failover | ✓ | Partial | ✓ | ✓ | — |
| Semantic caching | ✓ | — | — | ✓ | — |
| Load balancing | ✓ | Partial | — | ✓ | — |
| Per-key rate limiting | ✓ | — | ✓ | ✓ | — |
| Open source (MIT) | ✓ | ✓ | — | — | N/A |
| No vendor lock-in | ✓ | ✓ | — | — | — |
| Zero-config setup | ✓ | — | ✓ | ✓ | ✓ |
Yes. The gateway exposes an OpenAI-compatible API at /v1/chat/completions. Any client that works with OpenAI will work with AI Gateway — just change the base URL.
Yes. Full SSE (Server-Sent Events) streaming is supported. Set "stream": true in your request and the gateway will passthrough tokens from the upstream provider in real-time.
When a provider returns an error or times out, the gateway automatically retries with the next available provider based on configured weights and health status. Unhealthy providers are temporarily removed from rotation and auto-recover after 30 seconds.
OpenAI (GPT-4o, o1, o3), Anthropic (Claude Opus, Sonnet, Haiku), Google (Gemini 2.0 Flash/Pro), Ollama (Llama 3, Mistral, Phi-3), Hugging Face, and vLLM. Adding a new provider is as simple as setting an environment variable.
The cache creates a SHA-256 hash of the model name and messages array. Identical requests return cached responses instantly. Cache entries expire based on the configured TTL (default: 5 minutes). Supports both in-memory and Redis backends.
Yes, under the MIT license. You can self-host, modify, and redistribute freely. No vendor lock-in, no usage fees, no strings attached.
One API. Every provider. Intelligent failover. Drop-in compatible.