9.9 KiB
GPU-Accelerated RTSP Stream Processing System
Project Overview
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
Key Achievements
- Shared CUDA Context: 70% VRAM reduction (from ~200MB to ~60MB per stream)
- Linear VRAM Scaling: Perfect scaling at 60 MB per additional stream
- Zero-Copy Pipeline: All processing stays on GPU until JPEG bytes
- Proven Performance: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
Architecture
Pipeline Flow
RTSP Stream → PyAV (CPU)
↓
NVDEC Decode (GPU) → NV12 Format
↓
NV12 to RGB (GPU) → PyTorch Ops
↓
nvJPEG Encode (GPU) → JPEG Bytes
↓
CPU (JPEG only)
Core Components
StreamDecoderFactory
Singleton factory managing shared CUDA context across all decoder instances.
Key Methods:
get_factory(gpu_id): Returns singleton instancecreate_decoder(rtsp_url, buffer_size): Creates new decoder with shared context
CUDA Context Initialization:
err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
StreamDecoder
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
Key Features:
- Thread-safe frame buffer (deque)
- Connection status tracking
- Automatic reconnection handling
- Background thread for continuous decoding
Key Methods:
start(): Start decoding threadstop(): Stop and cleanupget_latest_frame(): Get most recent RGB frame (GPU tensor)is_connected(): Check connection statusget_buffer_size(): Current buffer size
JPEGEncoderFactory
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
Key Function:
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
"""
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
Uses __cuda_array_interface__ for zero-copy operation.
Performance: 1-2ms per 720p frame
"""
Technical Implementation
Shared CUDA Context Pattern
# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)
# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)
Benefits:
- 70% VRAM reduction per stream
- Single decoder initialization overhead
- Efficient resource sharing
NV12 to RGB Conversion (GPU)
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
"""
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
Uses BT.601 color space conversion.
Input: (height * 1.5, width) NV12 tensor
Output: (3, height, width) RGB tensor
"""
Steps:
- Split Y and UV planes
- Deinterleave UV components
- Upsample chroma (bilinear interpolation)
- Apply BT.601 color matrix
- Clamp to [0, 255]
Zero-Copy Operations
DLPack for PyTorch ↔ nvImageCodec:
# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
Performance Metrics
VRAM Usage (Python Process)
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|---|---|---|---|---|
| 0 | 216 MB | 0 MB | - | - |
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
Result: Perfect linear scaling at ~60 MB per stream
Capacity Estimates
With 60 MB per stream + 216 MB baseline:
- 16GB GPU: ~269 cameras (conservative: ~250)
- 24GB GPU: ~407 cameras (conservative: ~380)
- 48GB GPU: ~815 cameras (conservative: ~780)
- For 1000 streams: ~60GB VRAM required
Throughput
- Frame Rate: 7-7.5 FPS per stream @ 720p
- JPEG Encoding: 1-2ms per frame
- Connection Time: ~15s for stream stabilization
Project Structure
python-rtsp-worker/
├── app.py # FastAPI application
├── services/
│ ├── __init__.py # Package exports
│ ├── stream_decoder.py # StreamDecoder & Factory
│ └── jpeg_encoder.py # JPEG encoding utilities
├── test_stream.py # Single stream test
├── test_multi_stream.py # 4-stream test with monitoring
├── test_vram_scaling.py # System VRAM measurement
├── test_vram_process.py # Process VRAM measurement
├── test_jpeg_encode.py # JPEG encoding test
├── requirements.txt # Python dependencies
├── .env # Camera URLs (gitignored)
├── .env.example # Template for camera URLs
└── .gitignore
Dependencies
fastapi # Web framework
uvicorn[standard] # ASGI server
torch # GPU tensor operations
PyNvVideoCodec # NVDEC hardware decoding
av # FFmpeg/RTSP client
cuda-python # CUDA driver bindings
nvidia-nvimgcodec-cu12 # nvJPEG encoding
python-dotenv # Environment variables
Configuration
Environment Variables (.env)
# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...
Loading URLs in Code
from dotenv import load_dotenv
import os
load_dotenv()
camera_urls = []
i = 1
while True:
url = os.getenv(f'CAMERA_URL_{i}')
if url:
camera_urls.append(url)
i += 1
else:
break
Usage Examples
Basic Usage
from services import StreamDecoderFactory, encode_frame_to_jpeg
# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)
# Create decoder
decoder = factory.create_decoder(
rtsp_url="rtsp://user:pass@host/path",
buffer_size=30
)
# Start decoding
decoder.start()
# Wait for connection
import time
time.sleep(5)
# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
# Encode to JPEG (on GPU)
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
# Save or transmit jpeg_bytes
with open("frame.jpg", "wb") as f:
f.write(jpeg_bytes)
# Cleanup
decoder.stop()
Multi-Stream Usage
from services import StreamDecoderFactory
import time
factory = StreamDecoderFactory(gpu_id=0)
# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
decoder = factory.create_decoder(url, buffer_size=30)
decoder.start()
decoders.append(decoder)
# Wait for connections
time.sleep(15)
# Check status
for i, decoder in enumerate(decoders):
status = decoder.get_status()
buffer_size = decoder.get_buffer_size()
connected = decoder.is_connected()
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
# Process frames
for decoder in decoders:
frame = decoder.get_latest_frame()
if frame is not None:
# Process frame...
pass
# Cleanup
for decoder in decoders:
decoder.stop()
Testing
Run Single Stream Test
python test_stream.py
Run 4-Stream Test with VRAM Monitoring
python test_multi_stream.py
Measure VRAM Scaling
python test_vram_process.py
Test JPEG Encoding
python test_jpeg_encode.py
Known Issues
Segmentation Faults on Cleanup
Status: Non-critical Impact: Occurs during cleanup, doesn't affect core functionality Cause: Likely CUDA context cleanup order issues Workaround: Functionality works correctly; cleanup errors can be ignored
Technical Decisions
Why PyNvVideoCodec?
- Direct access to NVDEC hardware decoder
- Minimal overhead compared to FFmpeg/torchaudio
- Returns GPU tensors via DLPack
- Better control over decode sessions
Why Shared CUDA Context?
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
- Enables 1000-stream target on 60GB GPU
- Minimal complexity overhead with singleton pattern
Why nvImageCodec?
- GPU-native JPEG encoding (nvJPEG)
- Zero-copy with PyTorch via
__cuda_array_interface__ - 1-2ms encoding time per 720p frame
- Keeps data on GPU until final compression
Why Thread-Safe Ring Buffer?
- Decouples decoding from inference pipeline
- Prevents frame drops during processing spikes
- Allows async frame access
- Configurable buffer size per stream
Future Considerations
Hardware Decode Session Limits
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
- May need multiple GPUs for 1000 streams
- Test with actual hardware to verify limits
Scaling Beyond 1000 Streams
- Multi-GPU support with context per GPU
- Load balancing across GPUs
- Network bandwidth considerations
TensorRT Integration
- Next step: Integrate with TensorRT inference pipeline
- GPU frames → TensorRT → Results
- Keep entire pipeline on GPU
References
License
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.