# GPU-Accelerated RTSP Stream Processing System ## Project Overview A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression. ## Key Achievements - **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream) - **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream - **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes - **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM ## Architecture ### Pipeline Flow ``` RTSP Stream → PyAV (CPU) ↓ NVDEC Decode (GPU) → NV12 Format ↓ NV12 to RGB (GPU) → PyTorch Ops ↓ nvJPEG Encode (GPU) → JPEG Bytes ↓ CPU (JPEG only) ``` ### Core Components #### StreamDecoderFactory Singleton factory managing shared CUDA context across all decoder instances. **Key Methods:** - `get_factory(gpu_id)`: Returns singleton instance - `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context **CUDA Context Initialization:** ```python err, = cuda_driver.cuInit(0) err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device) ``` #### StreamDecoder Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer. **Key Features:** - Thread-safe frame buffer (deque) - Connection status tracking - Automatic reconnection handling - Background thread for continuous decoding **Key Methods:** - `start()`: Start decoding thread - `stop()`: Stop and cleanup - `get_latest_frame()`: Get most recent RGB frame (GPU tensor) - `is_connected()`: Check connection status - `get_buffer_size()`: Current buffer size #### JPEGEncoderFactory Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding. **Key Function:** ```python def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]: """ Encodes GPU RGB tensor to JPEG bytes without CPU transfer. Uses __cuda_array_interface__ for zero-copy operation. Performance: 1-2ms per 720p frame """ ``` ## Technical Implementation ### Shared CUDA Context Pattern ```python # Single shared context for all decoders factory = StreamDecoderFactory(gpu_id=0) # All decoders share same context decoder1 = factory.create_decoder(url1, buffer_size=30) decoder2 = factory.create_decoder(url2, buffer_size=30) decoder3 = factory.create_decoder(url3, buffer_size=30) ``` **Benefits:** - 70% VRAM reduction per stream - Single decoder initialization overhead - Efficient resource sharing ### NV12 to RGB Conversion (GPU) ```python def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor: """ Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops. Uses BT.601 color space conversion. Input: (height * 1.5, width) NV12 tensor Output: (3, height, width) RGB tensor """ ``` **Steps:** 1. Split Y and UV planes 2. Deinterleave UV components 3. Upsample chroma (bilinear interpolation) 4. Apply BT.601 color matrix 5. Clamp to [0, 255] ### Zero-Copy Operations **DLPack for PyTorch ↔ nvImageCodec:** ```python # GPU tensor stays on GPU rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous() nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__ jpeg_data = encoder.encode(nv_image, "jpeg", encode_params) ``` ## Performance Metrics ### VRAM Usage (Python Process) | Streams | Total VRAM | Overhead | Per Stream | Marginal Cost | |---------|-----------|----------|------------|---------------| | 0 | 216 MB | 0 MB | - | - | | 1 | 278 MB | 62 MB | 62.0 MB | 62 MB | | 2 | 338 MB | 122 MB | 61.0 MB | 60 MB | | 3 | 398 MB | 182 MB | 60.7 MB | 60 MB | | 4 | 458 MB | 242 MB | 60.5 MB | 60 MB | **Result:** Perfect linear scaling at ~60 MB per stream ### Capacity Estimates With 60 MB per stream + 216 MB baseline: - **16GB GPU**: ~269 cameras (conservative: ~250) - **24GB GPU**: ~407 cameras (conservative: ~380) - **48GB GPU**: ~815 cameras (conservative: ~780) - **For 1000 streams**: ~60GB VRAM required ### Throughput - **Frame Rate**: 7-7.5 FPS per stream @ 720p - **JPEG Encoding**: 1-2ms per frame - **Connection Time**: ~15s for stream stabilization ## Project Structure ``` python-rtsp-worker/ ├── app.py # FastAPI application ├── services/ │ ├── __init__.py # Package exports │ ├── stream_decoder.py # StreamDecoder & Factory │ └── jpeg_encoder.py # JPEG encoding utilities ├── test_stream.py # Single stream test ├── test_multi_stream.py # 4-stream test with monitoring ├── test_vram_scaling.py # System VRAM measurement ├── test_vram_process.py # Process VRAM measurement ├── test_jpeg_encode.py # JPEG encoding test ├── requirements.txt # Python dependencies ├── .env # Camera URLs (gitignored) ├── .env.example # Template for camera URLs └── .gitignore ``` ## Dependencies ``` fastapi # Web framework uvicorn[standard] # ASGI server torch # GPU tensor operations PyNvVideoCodec # NVDEC hardware decoding av # FFmpeg/RTSP client cuda-python # CUDA driver bindings nvidia-nvimgcodec-cu12 # nvJPEG encoding python-dotenv # Environment variables ``` ## Configuration ### Environment Variables (.env) ```bash # RTSP Camera URLs CAMERA_URL_1=rtsp://user:pass@host/path CAMERA_URL_2=rtsp://user:pass@host/path CAMERA_URL_3=rtsp://user:pass@host/path CAMERA_URL_4=rtsp://user:pass@host/path # Add more as needed... ``` ### Loading URLs in Code ```python from dotenv import load_dotenv import os load_dotenv() camera_urls = [] i = 1 while True: url = os.getenv(f'CAMERA_URL_{i}') if url: camera_urls.append(url) i += 1 else: break ``` ## Usage Examples ### Basic Usage ```python from services import StreamDecoderFactory, encode_frame_to_jpeg # Create factory (shared CUDA context) factory = StreamDecoderFactory(gpu_id=0) # Create decoder decoder = factory.create_decoder( rtsp_url="rtsp://user:pass@host/path", buffer_size=30 ) # Start decoding decoder.start() # Wait for connection import time time.sleep(5) # Get latest frame (GPU tensor) rgb_frame = decoder.get_latest_frame() if rgb_frame is not None: # Encode to JPEG (on GPU) jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95) # Save or transmit jpeg_bytes with open("frame.jpg", "wb") as f: f.write(jpeg_bytes) # Cleanup decoder.stop() ``` ### Multi-Stream Usage ```python from services import StreamDecoderFactory import time factory = StreamDecoderFactory(gpu_id=0) # Create multiple decoders (all share context) decoders = [] for url in camera_urls: decoder = factory.create_decoder(url, buffer_size=30) decoder.start() decoders.append(decoder) # Wait for connections time.sleep(15) # Check status for i, decoder in enumerate(decoders): status = decoder.get_status() buffer_size = decoder.get_buffer_size() connected = decoder.is_connected() print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}") # Process frames for decoder in decoders: frame = decoder.get_latest_frame() if frame is not None: # Process frame... pass # Cleanup for decoder in decoders: decoder.stop() ``` ## Testing ### Run Single Stream Test ```bash python test_stream.py ``` ### Run 4-Stream Test with VRAM Monitoring ```bash python test_multi_stream.py ``` ### Measure VRAM Scaling ```bash python test_vram_process.py ``` ### Test JPEG Encoding ```bash python test_jpeg_encode.py ``` ## Known Issues ### Segmentation Faults on Cleanup **Status**: Non-critical **Impact**: Occurs during cleanup, doesn't affect core functionality **Cause**: Likely CUDA context cleanup order issues **Workaround**: Functionality works correctly; cleanup errors can be ignored ## Technical Decisions ### Why PyNvVideoCodec? - Direct access to NVDEC hardware decoder - Minimal overhead compared to FFmpeg/torchaudio - Returns GPU tensors via DLPack - Better control over decode sessions ### Why Shared CUDA Context? - Reduces VRAM from ~200MB to ~60MB per stream (70% savings) - Enables 1000-stream target on 60GB GPU - Minimal complexity overhead with singleton pattern ### Why nvImageCodec? - GPU-native JPEG encoding (nvJPEG) - Zero-copy with PyTorch via `__cuda_array_interface__` - 1-2ms encoding time per 720p frame - Keeps data on GPU until final compression ### Why Thread-Safe Ring Buffer? - Decouples decoding from inference pipeline - Prevents frame drops during processing spikes - Allows async frame access - Configurable buffer size per stream ## Future Considerations ### Hardware Decode Session Limits - NVIDIA GPUs typically support 5-30 concurrent decode sessions - May need multiple GPUs for 1000 streams - Test with actual hardware to verify limits ### Scaling Beyond 1000 Streams - Multi-GPU support with context per GPU - Load balancing across GPUs - Network bandwidth considerations ### TensorRT Integration - Next step: Integrate with TensorRT inference pipeline - GPU frames → TensorRT → Results - Keep entire pipeline on GPU ## References - [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec) - [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk) - [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/) - [CUDA Python Bindings](https://nvidia.github.io/cuda-python/) ## License This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.