AI API Gateway: Unified Provider Access Infrastructure

> Private repository. Available for code review on request.

▍ Problem Space

Organizations using multiple Large Language Model providers (OpenAI, Anthropic, Google, etc.) face systemic infrastructure challenges:

Protocol Fragmentation: Each provider has a proprietary request/response format, streaming semantics (SSE), error handling, and authentication model.
Provider Capacity Planning: Provider limits and service-level objectives require compliant routing, capacity monitoring, and graceful degradation before saturation.
Unpredictable Latency: "Thinking" phases for frontier models can last up to 2-3 minutes, causing idle timeout disconnects at the load balancer level.
Lack of Unified Observability: Usage volume, latency distribution, and error rates are fragmented without centralized control.

Businesses need a single Gateway that provides a unified OpenAI-compatible API, transparent routing between providers, resilience to network anomalies, and strict consistency of distributed state across nodes.

▍ Architecture

The system is a high-load reverse proxy and API gateway written entirely in Rust. It's structured as a Cargo workspace with 15+ crates, enforcing a strict separation between domain, infrastructure, and API layers.

┌─────────────────────────────────────────────────────────┐
│                     CLIENTS                             │
│         (OpenAI SDK, curl, any HTTP client)             │
└───────────────────────┬─────────────────────────────────┘
                        │ OpenAI-compatible API
                        ▼
┌─────────────────────────────────────────────────────────┐
│                   GATEWAY LAYER                         │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │ Protocol │  │   Session    │  │  Load Balancer    │  │
│  │ Adapter  │  │   Affinity   │  │  (least-loaded)   │  │
│  │ (transl.)│  │   Manager    │  │                   │  │
│  └────┬─────┘  └──────┬───────┘  └────────┬──────────┘  │
│       │               │                   │             │
│  ┌────▼───────────────▼───────────────────▼──────────┐  │
│  │              STATE LAYER                          │  │
│  │  ArcSwap (lock-free config)  +  CRDT/LWW sync     │  │
│  │  PostgreSQL (Event Sourcing + streaming replica)  │  │
│  └───────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────┘
                        │ Managed connection pool
                        ▼
┌─────────────────────────────────────────────────────────┐
│               UPSTREAM PROVIDERS                        │
│     OpenAI    │    Anthropic    │    Google    │  ...   │
└─────────────────────────────────────────────────────────┘

Key Components:

Protocol Adapter: Bidirectional format translation (OpenAI ↔ proprietary provider APIs). Clients interact through a single OpenAI-compatible interface regardless of which provider handles the request.
Session Affinity Manager: Persistent binding of "client session → upstream provider", surviving service restarts. Improves cache locality and ensures predictable behavior for stateful dialogs.
Load Balancer: Least-loaded routing with anti-thundering-herd protection during initial session assignment. Balances traffic across approved provider integrations and preserves service-level objectives.
State Layer: `ArcSwap` for lock-free hot reloading of configuration (zero contention on the hot path). CRDT/LWW with tombstone records for state synchronization across nodes. PostgreSQL with Event Sourcing and streaming replication acts as the single source of truth.
Managed Connection Pool: RAII-controlled connection pool with aggressive HTTP/2 keepalive to prevent idle timeout disconnects during extended generation phases.

Infrastructure:

Frontend: Administrative dashboard in Rust (Leptos + WASM) — real-time monitoring, account management, cost analytics.
DevOps: Nix Flakes (reproducible builds) + systemd socket activation (zero-downtime deploy).

▍ Metrics (Production Data)

The system operates under real production load:

OpenAI-compatible

Client Interface

3+ approved

Provider Integrations

zero downtime

Deployment Continuity

multi-node CRDT

State Consistency

~94%

Availability Target

graceful SSE close

Failure Handling

centralized analytics

Cost Visibility

real-time dashboard

Admin Surface

observability + controls

Production Ownership

▍ Key Engineering Decisions

Problem

Provider capacity, routing, and session states must be consistent across nodes without a central coordinator.

Solution

LWW (Last-Write-Wins) CRDT with tombstone records. Nodes replicate state independently; conflicts are resolved by timestamp. Tombstones prevent the "resurrection" of deleted records during merges.

Alternative Rejected

Raft/Paxos — Excessive complexity for an eventually-consistent workload; CRDT doesn't require leader election.

Problem

Configuration (provider list, quotas, routing rules) changes at runtime. A classic RwLock creates contention with thousands of concurrent requests.

Solution

ArcSwap — atomic replacement of Arc<Config> without locks. Readers get a snapshot in O(1), the writer publishes a new version atomically. Zero contention on the hot path.

Problem

When an upstream connection drops during SSE streaming, the client receives an incomplete stream, breaking SDK parsing.

Solution

The Gateway intercepts network errors and generates a synthetic `[DONE]` chunk with `finish_reason: "error"`, converting a transport failure into a graceful stream termination. Client code handles this as a normal completion, not an exception.

Problem

Frontier models "think" for 60-180 seconds. Upstream load balancers drop idle connections due to timeouts (30-60s) even though the request is still processing.

Solution

Aggressive HTTP/2 PING keepalive at the multiplexer level. Keeps the connection active for intermediate load balancers without disrupting model execution.

Problem

Stateful provider dialogs benefit from cache locality. Random upstream switching increases cost variance and can make long-running conversations less predictable.

Solution

Persistent "client session → upstream provider" binding stored in PostgreSQL. Survives restarts. Upon assigning a new session, a least-loaded algorithm with anti-thundering-herd protection is used.

▍ Tech Stack

Backend

Rust, Axum, Tokio, SQLx, PostgreSQL, ArcSwap, DashMap

Frontend

Rust, Leptos, WebAssembly (WASM)

DevOps

Nix (Flakes), Systemd (socket activation), Podman

▍ Demonstrated Competencies

Systems Architecture

Designing a distributed stateful service resilient to network anomalies, partial failures, and prolonged upstream latency.

Distributed Systems

Practical application of CRDT, Event Sourcing, and PostgreSQL streaming replication in production.

Performance Engineering

Lock-free hot paths, zero-copy streaming, connection pool management without leaks under constant load.

Production Operations

Zero-downtime deployment, deep instrumentation (metrics, tracing), graceful degradation during upstream provider outages.

Rust Ecosystem Mastery

Workspace with 15+ crates, type-safe domain models, and exhaustive pattern matching across failure boundaries.

Ready to build something like this?

Start a Project