← Back to Projects

AI API Gateway: Unified LLM Traffic Routing Infrastructure

> Private repository. Available for code review on request.

▍ Problem Space

Organizations using multiple Large Language Model providers (OpenAI, Anthropic, Google, etc.) face systemic infrastructure challenges:

  • Protocol Fragmentation: Each provider has a proprietary request/response format, streaming semantics (SSE), error handling, and authentication model.
  • Rate Limits & Quotas: Strict limits at the account/minute/token level requiring intelligent load balancing across account pools.
  • Unpredictable Latency: "Thinking" phases for frontier models can last up to 2-3 minutes, causing idle timeout disconnects at the load balancer level.
  • Lack of Unified Observability: Token usage, latency distribution, and error rates are fragmented without centralized control.

Businesses need a single Gateway that provides a unified OpenAI-compatible API, transparent routing between providers, resilience to network anomalies, and strict consistency of distributed state across nodes.

▍ Architecture

The system is a high-load reverse proxy and API gateway written entirely in Rust. It's structured as a Cargo workspace with 15+ crates, enforcing a strict separation between domain, infrastructure, and API layers.

┌─────────────────────────────────────────────────────────┐
│                     CLIENTS                             │
│         (OpenAI SDK, curl, any HTTP client)             │
└───────────────────────┬─────────────────────────────────┘
                        │ OpenAI-compatible API
                        ▼
┌─────────────────────────────────────────────────────────┐
│                   GATEWAY LAYER                         │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │ Protocol │  │   Session    │  │  Load Balancer    │  │
│  │ Adapter  │  │   Affinity   │  │  (least-loaded)   │  │
│  │ (transl.)│  │   Manager    │  │                   │  │
│  └────┬─────┘  └──────┬───────┘  └────────┬──────────┘  │
│       │               │                   │             │
│  ┌────▼───────────────▼───────────────────▼──────────┐  │
│  │              STATE LAYER                          │  │
│  │  ArcSwap (lock-free config)  +  CRDT/LWW sync     │  │
│  │  PostgreSQL (Event Sourcing + streaming replica)  │  │
│  └───────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────┘
                        │ Managed connection pool
                        ▼
┌─────────────────────────────────────────────────────────┐
│               UPSTREAM PROVIDERS                        │
│     OpenAI    │    Anthropic    │    Google    │  ...   │
└─────────────────────────────────────────────────────────┘

Key Components:

  • Protocol Adapter: Bidirectional format translation (OpenAI ↔ proprietary provider APIs). Clients interact through a single OpenAI-compatible interface regardless of which provider handles the request.
  • Session Affinity Manager: Persistent binding of "session → provider account", surviving service restarts. Maximizes context cache utilization (up to 128K tokens) and ensures predictable behavior for stateful dialogs.
  • Load Balancer: Least-loaded distribution with anti-thundering-herd algorithms during initial session assignment. Distributes load across provider account pools.
  • State Layer: `ArcSwap` for lock-free hot reloading of configuration (zero contention on the hot path). CRDT/LWW with tombstone records for state synchronization across nodes. PostgreSQL with Event Sourcing and streaming replication acts as the single source of truth.
  • Managed Connection Pool: RAII-controlled connection pool with aggressive HTTP/2 keepalive to prevent idle timeout disconnects during extended generation phases.

Infrastructure:

  • Frontend: Administrative dashboard in Rust (Leptos + WASM) — real-time monitoring, account management, cost analytics.
  • DevOps: Nix Flakes (reproducible builds) + systemd socket activation (zero-downtime deploy).

▍ Metrics (Production Data)

The system operates under real production load:

73,000+ / mo
API Requests
2.7B / wk
Tokens Processed
124 hrs / wk
GPU Inference Time
28.3 req/min
Peak Load
~94%
Uptime Target
45-81%
Prompt Cache Hit Rate
<30s (incl. thinking)
Avg Latency
~6% (resolved)
Error Rate
900+ tests
Codebase

▍ Key Engineering Decisions

Problem
Account, quota, and session states must be consistent across nodes without a central coordinator.
Solution
LWW (Last-Write-Wins) CRDT with tombstone records. Each node is autonomous; conflicts are resolved by timestamp. Tombstones prevent the "resurrection" of deleted records during merges.
Alternative Rejected
Raft/Paxos — Excessive complexity for an eventually-consistent workload; CRDT doesn't require leader election.
Problem
Configuration (provider list, quotas, routing rules) changes at runtime. A classic RwLock creates contention with thousands of concurrent requests.
Solution
ArcSwap — atomic replacement of Arc<Config> without locks. Readers get a snapshot in O(1), the writer publishes a new version atomically. Zero contention on the hot path.
Problem
When an upstream connection drops during SSE streaming, the client receives an incomplete stream, breaking SDK parsing.
Solution
The Gateway intercepts network errors and generates a synthetic `[DONE]` chunk with `finish_reason: "error"`, converting a transport failure into a graceful stream termination. Client code handles this as a normal completion, not an exception.
Problem
Frontier models "think" for 60-180 seconds. Upstream load balancers drop idle connections due to timeouts (30-60s) even though the request is still processing.
Solution
Aggressive HTTP/2 PING keepalive at the multiplexer level. Keeps the connection active for intermediate load balancers without disrupting model execution.
Problem
LLM providers cache dialog context (up to 128K tokens) per account. Switching accounts = losing cache = paying for prompt tokens again.
Solution
Persistent "client session → provider account" binding stored in PostgreSQL. Survives restarts. Upon assigning a new session, a least-loaded algorithm with anti-thundering-herd protection is used.

▍ Tech Stack

Backend
Rust, Axum, Tokio, SQLx, PostgreSQL, ArcSwap, DashMap
Frontend
Rust, Leptos, WebAssembly (WASM)
DevOps
Nix (Flakes), Systemd (socket activation), Podman

▍ Demonstrated Competencies

Systems Architecture
Designing a distributed stateful service resilient to network anomalies, partial failures, and prolonged upstream latency.
Distributed Systems
Practical application of CRDT, Event Sourcing, and PostgreSQL streaming replication in production.
Performance Engineering
Lock-free hot paths, zero-copy streaming, connection pool management without leaks under constant load.
Production Operations
Zero-downtime deployment, deep instrumentation (metrics, tracing), graceful degradation during upstream provider outages.
Rust Ecosystem Mastery
Workspace with 15+ crates, 900+ tests, type-safe domain models with exhaustive pattern matching.

Ready to build something like this?

Start a Project