305 lines
7.9 KiB
Markdown
305 lines
7.9 KiB
Markdown
# GPU-Accelerated RTSP Stream Processing System
|
|
|
|
## Project Overview
|
|
|
|
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
|
|
|
|
## Key Achievements
|
|
|
|
- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
|
|
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
|
|
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
|
|
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
|
|
|
|
## Architecture
|
|
|
|
### Pipeline Flow
|
|
```
|
|
RTSP Stream → PyAV (CPU)
|
|
↓
|
|
NVDEC Decode (GPU) → NV12 Format
|
|
↓
|
|
NV12 to RGB (GPU) → PyTorch Ops
|
|
↓
|
|
nvJPEG Encode (GPU) → JPEG Bytes
|
|
↓
|
|
CPU (JPEG only)
|
|
```
|
|
|
|
### Core Components
|
|
|
|
#### StreamDecoderFactory
|
|
Singleton factory managing shared CUDA context across all decoder instances.
|
|
|
|
**Key Methods:**
|
|
- `get_factory(gpu_id)`: Returns singleton instance
|
|
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
|
|
|
|
**CUDA Context Initialization:**
|
|
```python
|
|
err, = cuda_driver.cuInit(0)
|
|
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
|
```
|
|
|
|
#### StreamDecoder
|
|
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
|
|
|
|
**Key Features:**
|
|
- Thread-safe frame buffer (deque)
|
|
- Connection status tracking
|
|
- Automatic reconnection handling
|
|
- Background thread for continuous decoding
|
|
|
|
**Key Methods:**
|
|
- `start()`: Start decoding thread
|
|
- `stop()`: Stop and cleanup
|
|
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
|
|
- `is_connected()`: Check connection status
|
|
- `get_buffer_size()`: Current buffer size
|
|
|
|
#### JPEGEncoderFactory
|
|
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
|
|
|
|
**Key Function:**
|
|
```python
|
|
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
|
"""
|
|
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
|
|
Uses __cuda_array_interface__ for zero-copy operation.
|
|
|
|
Performance: 1-2ms per 720p frame
|
|
"""
|
|
```
|
|
|
|
## Technical Implementation
|
|
|
|
### Shared CUDA Context Pattern
|
|
|
|
```python
|
|
# Single shared context for all decoders
|
|
factory = StreamDecoderFactory(gpu_id=0)
|
|
|
|
# All decoders share same context
|
|
decoder1 = factory.create_decoder(url1, buffer_size=30)
|
|
decoder2 = factory.create_decoder(url2, buffer_size=30)
|
|
decoder3 = factory.create_decoder(url3, buffer_size=30)
|
|
```
|
|
|
|
**Benefits:**
|
|
- 70% VRAM reduction per stream
|
|
- Single decoder initialization overhead
|
|
- Efficient resource sharing
|
|
|
|
### NV12 to RGB Conversion (GPU)
|
|
|
|
```python
|
|
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
|
"""
|
|
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
|
|
Uses BT.601 color space conversion.
|
|
|
|
Input: (height * 1.5, width) NV12 tensor
|
|
Output: (3, height, width) RGB tensor
|
|
"""
|
|
```
|
|
|
|
**Steps:**
|
|
1. Split Y and UV planes
|
|
2. Deinterleave UV components
|
|
3. Upsample chroma (bilinear interpolation)
|
|
4. Apply BT.601 color matrix
|
|
5. Clamp to [0, 255]
|
|
|
|
### Zero-Copy Operations
|
|
|
|
**DLPack for PyTorch ↔ nvImageCodec:**
|
|
```python
|
|
# GPU tensor stays on GPU
|
|
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
|
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
|
|
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
|
```
|
|
|
|
## Performance Metrics
|
|
|
|
### VRAM Usage (at 720p)
|
|
|
|
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|
|
|---------|-----------|----------|------------|---------------|
|
|
| 0 | 216 MB | 0 MB | - | - |
|
|
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
|
|
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
|
|
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
|
|
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
|
|
|
|
**Result:** Perfect linear scaling at ~60 MB per stream
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
|
|
python-rtsp-worker/
|
|
├── app.py # FastAPI application
|
|
├── services/
|
|
│ ├── __init__.py # Package exports
|
|
│ ├── stream_decoder.py # StreamDecoder & Factory
|
|
│ └── jpeg_encoder.py # JPEG encoding utilities
|
|
├── test_stream.py # Single stream test
|
|
├── test_multi_stream.py # 4-stream test with monitoring
|
|
├── test_vram_scaling.py # System VRAM measurement
|
|
├── test_vram_process.py # Process VRAM measurement
|
|
├── test_jpeg_encode.py # JPEG encoding test
|
|
├── requirements.txt # Python dependencies
|
|
├── .env # Camera URLs (gitignored)
|
|
├── .env.example # Template for camera URLs
|
|
|
|
└── .gitignore
|
|
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables (.env)
|
|
|
|
```bash
|
|
# RTSP Camera URLs
|
|
CAMERA_URL_1=rtsp://user:pass@host/path
|
|
CAMERA_URL_2=rtsp://user:pass@host/path
|
|
CAMERA_URL_3=rtsp://user:pass@host/path
|
|
CAMERA_URL_4=rtsp://user:pass@host/path
|
|
# Add more as needed...
|
|
```
|
|
|
|
### Loading URLs in Code
|
|
|
|
```python
|
|
from dotenv import load_dotenv
|
|
import os
|
|
|
|
load_dotenv()
|
|
|
|
camera_urls = []
|
|
i = 1
|
|
while True:
|
|
url = os.getenv(f'CAMERA_URL_{i}')
|
|
if url:
|
|
camera_urls.append(url)
|
|
i += 1
|
|
else:
|
|
break
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from services import StreamDecoderFactory, encode_frame_to_jpeg
|
|
|
|
# Create factory (shared CUDA context)
|
|
factory = StreamDecoderFactory(gpu_id=0)
|
|
|
|
# Create decoder
|
|
decoder = factory.create_decoder(
|
|
rtsp_url="rtsp://user:pass@host/path",
|
|
buffer_size=30
|
|
)
|
|
|
|
# Start decoding
|
|
decoder.start()
|
|
|
|
# Wait for connection
|
|
import time
|
|
time.sleep(5)
|
|
|
|
# Get latest frame (GPU tensor)
|
|
rgb_frame = decoder.get_latest_frame()
|
|
if rgb_frame is not None:
|
|
# Encode to JPEG (on GPU)
|
|
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
|
|
|
|
# Save or transmit jpeg_bytes
|
|
with open("frame.jpg", "wb") as f:
|
|
f.write(jpeg_bytes)
|
|
|
|
# Cleanup
|
|
decoder.stop()
|
|
```
|
|
|
|
### Multi-Stream Usage
|
|
|
|
```python
|
|
from services import StreamDecoderFactory
|
|
import time
|
|
|
|
factory = StreamDecoderFactory(gpu_id=0)
|
|
|
|
# Create multiple decoders (all share context)
|
|
decoders = []
|
|
for url in camera_urls:
|
|
decoder = factory.create_decoder(url, buffer_size=30)
|
|
decoder.start()
|
|
decoders.append(decoder)
|
|
|
|
# Wait for connections
|
|
time.sleep(15)
|
|
|
|
# Check status
|
|
for i, decoder in enumerate(decoders):
|
|
status = decoder.get_status()
|
|
buffer_size = decoder.get_buffer_size()
|
|
connected = decoder.is_connected()
|
|
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
|
|
|
|
# Process frames
|
|
for decoder in decoders:
|
|
frame = decoder.get_latest_frame()
|
|
if frame is not None:
|
|
# Process frame...
|
|
pass
|
|
|
|
# Cleanup
|
|
for decoder in decoders:
|
|
decoder.stop()
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Run Single Stream Test
|
|
```bash
|
|
python test_stream.py
|
|
```
|
|
|
|
### Run 4-Stream Test with VRAM Monitoring
|
|
```bash
|
|
python test_multi_stream.py
|
|
```
|
|
|
|
### Measure VRAM Scaling
|
|
```bash
|
|
python test_vram_process.py
|
|
```
|
|
|
|
### Test JPEG Encoding
|
|
```bash
|
|
python test_jpeg_encode.py
|
|
```
|
|
|
|
## Known Issues
|
|
|
|
### Segmentation Faults on Cleanup
|
|
**Status**: Non-critical
|
|
**Impact**: Occurs during cleanup, doesn't affect core functionality
|
|
**Cause**: Likely CUDA context cleanup order issues
|
|
**Workaround**: Functionality works correctly; cleanup errors can be ignored
|
|
|
|
## References
|
|
|
|
- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
|
|
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
|
|
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
|
|
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
|
|
|
|
## License
|
|
|
|
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.
|