NEXUS: AI-Powered Video Processing Pipeline

> Private repository. Available for code review on request.

▍ Problem Space

Content creators producing video for social media face a production bottleneck that scales linearly with volume:

Manual Editing Cost: Professional short-form video editing costs $50-200 per video.
Format Fragmentation: Each platform requires different aspect ratios (9:16, 1:1, 16:9).
Quality Inconsistency: Human editors produce variable quality depending on fatigue and skill.
Turnaround Time: Traditional editing takes hours per video.

▍ Architecture

INPUT STAGE
  Video Decode (FFmpeg) → Scene Detection → Audio Extraction (Whisper)
          ↓
  ML INFERENCE
  Face Detection + Tracking (ONNX, CUDA) | 12,000+ keyframes/video
          ↓
  OUTPUT PIPELINE
  Smart Crop (face-centered) → Format Adaptation (9:16/1:1/16:9) → GPU Encode (NVENC)

Key Components:

Video Decoder: FFmpeg-based frame extraction with hardware-accelerated decoding (CUVID).
Face Detection & Tracking: ONNX neural network with CUDA. Processes 12,000+ keyframes per video.
Smart Crop Engine: Face-centered framing with dynamic crop region adjustment.
Format Adaptation: Automated aspect ratio conversion (16:9 → 9:16/1:1) with intelligent reframing.
GPU Encoder: NVENC H.264 encoding, 4-10x faster than CPU.

▍ Key Engineering Decisions

Problem

CPU-based ONNX inference for face tracking takes 15+ minutes per video.

Solution

CUDA Execution Provider for ONNX Runtime. Face detection on GPU reduces inference time 4-10x. Combined with NVENC encoding, entire pipeline runs on GPU.

Alternative Rejected

Cloud GPU inference — Network latency, egress costs, vendor dependency. Local GPU amortizes to near-zero cost.

Problem

Raw bounding box detections per-frame produce jittery camera movement.

Solution

Bounding box interpolation with exponential moving average. Detections sampled at keyframes, interpolated for intermediate frames, producing cinematic smooth panning.

▍ Metrics

~3 min / 10-min video

Processing Speed

12,000+ / video

Face Keyframes

9:16, 1:1, 16:9

Output Formats

4-10x vs CPU

GPU Acceleration

Constant (streaming)

Memory Usage

~$0 (local GPU)

Cost per Video

▍ Tech Stack

Core

Rust, FFmpeg, ONNX Runtime

GPU

CUDA (NVENC encode, CUVID decode, ONNX inference)

Face detection neural network (ONNX format)

Audio

Whisper (speech-to-text for subtitles)

▍ Demonstrated Competencies

GPU Pipeline Engineering

End-to-end GPU-accelerated video: decode → inference → encode.

ML Inference in Production

ONNX Runtime with CUDA Execution Provider at scale.

Streaming Architecture

Constant-memory processing of large videos via chunked pipeline stages.

Computer Vision

Face detection, tracking, bounding box interpolation, intelligent cropping.

FFmpeg Integration

Programmatic FFmpeg for hardware-accelerated decode/encode.

Ready to build something like this?

Start a Project