← Back to Projects

NEXUS: AI-Powered Video Processing Pipeline

> Private repository. Available for code review on request.

▍ Problem Space

Content creators producing video for social media face a production bottleneck that scales linearly with volume:

  • Manual Editing Cost: Professional short-form video editing costs $50-200 per video.
  • Format Fragmentation: Each platform requires different aspect ratios (9:16, 1:1, 16:9).
  • Quality Inconsistency: Human editors produce variable quality depending on fatigue and skill.
  • Turnaround Time: Traditional editing takes hours per video.

▍ Architecture

INPUT STAGE
  Video Decode (FFmpeg) → Scene Detection → Audio Extraction (Whisper)
          ↓
  ML INFERENCE
  Face Detection + Tracking (ONNX, CUDA) | 12,000+ keyframes/video
          ↓
  OUTPUT PIPELINE
  Smart Crop (face-centered) → Format Adaptation (9:16/1:1/16:9) → GPU Encode (NVENC)

Key Components:

  • Video Decoder: FFmpeg-based frame extraction with hardware-accelerated decoding (CUVID).
  • Face Detection & Tracking: ONNX neural network with CUDA. Processes 12,000+ keyframes per video.
  • Smart Crop Engine: Face-centered framing with dynamic crop region adjustment.
  • Format Adaptation: Automated aspect ratio conversion (16:9 → 9:16/1:1) with intelligent reframing.
  • GPU Encoder: NVENC H.264 encoding, 4-10x faster than CPU.

▍ Key Engineering Decisions

Problem
CPU-based ONNX inference for face tracking takes 15+ minutes per video.
Solution
CUDA Execution Provider for ONNX Runtime. Face detection on GPU reduces inference time 4-10x. Combined with NVENC encoding, entire pipeline runs on GPU.
Alternative Rejected
Cloud GPU inference — Network latency, egress costs, vendor dependency. Local GPU amortizes to near-zero cost.
Problem
Raw bounding box detections per-frame produce jittery camera movement.
Solution
Bounding box interpolation with exponential moving average. Detections sampled at keyframes, interpolated for intermediate frames, producing cinematic smooth panning.

▍ Metrics

~3 min / 10-min video
Processing Speed
12,000+ / video
Face Keyframes
9:16, 1:1, 16:9
Output Formats
4-10x vs CPU
GPU Acceleration
Constant (streaming)
Memory Usage
~$0 (local GPU)
Cost per Video

▍ Tech Stack

Core
Rust, FFmpeg, ONNX Runtime
GPU
CUDA (NVENC encode, CUVID decode, ONNX inference)
ML
Face detection neural network (ONNX format)
Audio
Whisper (speech-to-text for subtitles)

▍ Demonstrated Competencies

GPU Pipeline Engineering
End-to-end GPU-accelerated video: decode → inference → encode.
ML Inference in Production
ONNX Runtime with CUDA Execution Provider at scale.
Streaming Architecture
Constant-memory processing of large videos via chunked pipeline stages.
Computer Vision
Face detection, tracking, bounding box interpolation, intelligent cropping.
FFmpeg Integration
Programmatic FFmpeg for hardware-accelerated decode/encode.

Ready to build something like this?

Start a Project