feat: inference subsystem and optimization to decoder
This commit is contained in:
commit
3c83a57e44
19 changed files with 3897 additions and 0 deletions
373
claude.md
Normal file
373
claude.md
Normal file
|
|
@ -0,0 +1,373 @@
|
|||
# GPU-Accelerated RTSP Stream Processing System
|
||||
|
||||
## Project Overview
|
||||
|
||||
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
|
||||
|
||||
## Key Achievements
|
||||
|
||||
- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
|
||||
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
|
||||
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
|
||||
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pipeline Flow
|
||||
```
|
||||
RTSP Stream → PyAV (CPU)
|
||||
↓
|
||||
NVDEC Decode (GPU) → NV12 Format
|
||||
↓
|
||||
NV12 to RGB (GPU) → PyTorch Ops
|
||||
↓
|
||||
nvJPEG Encode (GPU) → JPEG Bytes
|
||||
↓
|
||||
CPU (JPEG only)
|
||||
```
|
||||
|
||||
### Core Components
|
||||
|
||||
#### StreamDecoderFactory
|
||||
Singleton factory managing shared CUDA context across all decoder instances.
|
||||
|
||||
**Key Methods:**
|
||||
- `get_factory(gpu_id)`: Returns singleton instance
|
||||
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
|
||||
|
||||
**CUDA Context Initialization:**
|
||||
```python
|
||||
err, = cuda_driver.cuInit(0)
|
||||
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
||||
```
|
||||
|
||||
#### StreamDecoder
|
||||
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
|
||||
|
||||
**Key Features:**
|
||||
- Thread-safe frame buffer (deque)
|
||||
- Connection status tracking
|
||||
- Automatic reconnection handling
|
||||
- Background thread for continuous decoding
|
||||
|
||||
**Key Methods:**
|
||||
- `start()`: Start decoding thread
|
||||
- `stop()`: Stop and cleanup
|
||||
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
|
||||
- `is_connected()`: Check connection status
|
||||
- `get_buffer_size()`: Current buffer size
|
||||
|
||||
#### JPEGEncoderFactory
|
||||
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
|
||||
|
||||
**Key Function:**
|
||||
```python
|
||||
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
||||
"""
|
||||
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
|
||||
Uses __cuda_array_interface__ for zero-copy operation.
|
||||
|
||||
Performance: 1-2ms per 720p frame
|
||||
"""
|
||||
```
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Shared CUDA Context Pattern
|
||||
|
||||
```python
|
||||
# Single shared context for all decoders
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# All decoders share same context
|
||||
decoder1 = factory.create_decoder(url1, buffer_size=30)
|
||||
decoder2 = factory.create_decoder(url2, buffer_size=30)
|
||||
decoder3 = factory.create_decoder(url3, buffer_size=30)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- 70% VRAM reduction per stream
|
||||
- Single decoder initialization overhead
|
||||
- Efficient resource sharing
|
||||
|
||||
### NV12 to RGB Conversion (GPU)
|
||||
|
||||
```python
|
||||
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
||||
"""
|
||||
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
|
||||
Uses BT.601 color space conversion.
|
||||
|
||||
Input: (height * 1.5, width) NV12 tensor
|
||||
Output: (3, height, width) RGB tensor
|
||||
"""
|
||||
```
|
||||
|
||||
**Steps:**
|
||||
1. Split Y and UV planes
|
||||
2. Deinterleave UV components
|
||||
3. Upsample chroma (bilinear interpolation)
|
||||
4. Apply BT.601 color matrix
|
||||
5. Clamp to [0, 255]
|
||||
|
||||
### Zero-Copy Operations
|
||||
|
||||
**DLPack for PyTorch ↔ nvImageCodec:**
|
||||
```python
|
||||
# GPU tensor stays on GPU
|
||||
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
||||
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
|
||||
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
||||
```
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### VRAM Usage (Python Process)
|
||||
|
||||
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|
||||
|---------|-----------|----------|------------|---------------|
|
||||
| 0 | 216 MB | 0 MB | - | - |
|
||||
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
|
||||
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
|
||||
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
|
||||
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
|
||||
|
||||
**Result:** Perfect linear scaling at ~60 MB per stream
|
||||
|
||||
### Capacity Estimates
|
||||
|
||||
With 60 MB per stream + 216 MB baseline:
|
||||
|
||||
- **16GB GPU**: ~269 cameras (conservative: ~250)
|
||||
- **24GB GPU**: ~407 cameras (conservative: ~380)
|
||||
- **48GB GPU**: ~815 cameras (conservative: ~780)
|
||||
- **For 1000 streams**: ~60GB VRAM required
|
||||
|
||||
### Throughput
|
||||
|
||||
- **Frame Rate**: 7-7.5 FPS per stream @ 720p
|
||||
- **JPEG Encoding**: 1-2ms per frame
|
||||
- **Connection Time**: ~15s for stream stabilization
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
python-rtsp-worker/
|
||||
├── app.py # FastAPI application
|
||||
├── services/
|
||||
│ ├── __init__.py # Package exports
|
||||
│ ├── stream_decoder.py # StreamDecoder & Factory
|
||||
│ └── jpeg_encoder.py # JPEG encoding utilities
|
||||
├── test_stream.py # Single stream test
|
||||
├── test_multi_stream.py # 4-stream test with monitoring
|
||||
├── test_vram_scaling.py # System VRAM measurement
|
||||
├── test_vram_process.py # Process VRAM measurement
|
||||
├── test_jpeg_encode.py # JPEG encoding test
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env # Camera URLs (gitignored)
|
||||
├── .env.example # Template for camera URLs
|
||||
└── .gitignore
|
||||
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
fastapi # Web framework
|
||||
uvicorn[standard] # ASGI server
|
||||
torch # GPU tensor operations
|
||||
PyNvVideoCodec # NVDEC hardware decoding
|
||||
av # FFmpeg/RTSP client
|
||||
cuda-python # CUDA driver bindings
|
||||
nvidia-nvimgcodec-cu12 # nvJPEG encoding
|
||||
python-dotenv # Environment variables
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables (.env)
|
||||
|
||||
```bash
|
||||
# RTSP Camera URLs
|
||||
CAMERA_URL_1=rtsp://user:pass@host/path
|
||||
CAMERA_URL_2=rtsp://user:pass@host/path
|
||||
CAMERA_URL_3=rtsp://user:pass@host/path
|
||||
CAMERA_URL_4=rtsp://user:pass@host/path
|
||||
# Add more as needed...
|
||||
```
|
||||
|
||||
### Loading URLs in Code
|
||||
|
||||
```python
|
||||
from dotenv import load_dotenv
|
||||
import os
|
||||
|
||||
load_dotenv()
|
||||
|
||||
camera_urls = []
|
||||
i = 1
|
||||
while True:
|
||||
url = os.getenv(f'CAMERA_URL_{i}')
|
||||
if url:
|
||||
camera_urls.append(url)
|
||||
i += 1
|
||||
else:
|
||||
break
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from services import StreamDecoderFactory, encode_frame_to_jpeg
|
||||
|
||||
# Create factory (shared CUDA context)
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# Create decoder
|
||||
decoder = factory.create_decoder(
|
||||
rtsp_url="rtsp://user:pass@host/path",
|
||||
buffer_size=30
|
||||
)
|
||||
|
||||
# Start decoding
|
||||
decoder.start()
|
||||
|
||||
# Wait for connection
|
||||
import time
|
||||
time.sleep(5)
|
||||
|
||||
# Get latest frame (GPU tensor)
|
||||
rgb_frame = decoder.get_latest_frame()
|
||||
if rgb_frame is not None:
|
||||
# Encode to JPEG (on GPU)
|
||||
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
|
||||
|
||||
# Save or transmit jpeg_bytes
|
||||
with open("frame.jpg", "wb") as f:
|
||||
f.write(jpeg_bytes)
|
||||
|
||||
# Cleanup
|
||||
decoder.stop()
|
||||
```
|
||||
|
||||
### Multi-Stream Usage
|
||||
|
||||
```python
|
||||
from services import StreamDecoderFactory
|
||||
import time
|
||||
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# Create multiple decoders (all share context)
|
||||
decoders = []
|
||||
for url in camera_urls:
|
||||
decoder = factory.create_decoder(url, buffer_size=30)
|
||||
decoder.start()
|
||||
decoders.append(decoder)
|
||||
|
||||
# Wait for connections
|
||||
time.sleep(15)
|
||||
|
||||
# Check status
|
||||
for i, decoder in enumerate(decoders):
|
||||
status = decoder.get_status()
|
||||
buffer_size = decoder.get_buffer_size()
|
||||
connected = decoder.is_connected()
|
||||
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
|
||||
|
||||
# Process frames
|
||||
for decoder in decoders:
|
||||
frame = decoder.get_latest_frame()
|
||||
if frame is not None:
|
||||
# Process frame...
|
||||
pass
|
||||
|
||||
# Cleanup
|
||||
for decoder in decoders:
|
||||
decoder.stop()
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Run Single Stream Test
|
||||
```bash
|
||||
python test_stream.py
|
||||
```
|
||||
|
||||
### Run 4-Stream Test with VRAM Monitoring
|
||||
```bash
|
||||
python test_multi_stream.py
|
||||
```
|
||||
|
||||
### Measure VRAM Scaling
|
||||
```bash
|
||||
python test_vram_process.py
|
||||
```
|
||||
|
||||
### Test JPEG Encoding
|
||||
```bash
|
||||
python test_jpeg_encode.py
|
||||
```
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Segmentation Faults on Cleanup
|
||||
**Status**: Non-critical
|
||||
**Impact**: Occurs during cleanup, doesn't affect core functionality
|
||||
**Cause**: Likely CUDA context cleanup order issues
|
||||
**Workaround**: Functionality works correctly; cleanup errors can be ignored
|
||||
|
||||
## Technical Decisions
|
||||
|
||||
### Why PyNvVideoCodec?
|
||||
- Direct access to NVDEC hardware decoder
|
||||
- Minimal overhead compared to FFmpeg/torchaudio
|
||||
- Returns GPU tensors via DLPack
|
||||
- Better control over decode sessions
|
||||
|
||||
### Why Shared CUDA Context?
|
||||
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
|
||||
- Enables 1000-stream target on 60GB GPU
|
||||
- Minimal complexity overhead with singleton pattern
|
||||
|
||||
### Why nvImageCodec?
|
||||
- GPU-native JPEG encoding (nvJPEG)
|
||||
- Zero-copy with PyTorch via `__cuda_array_interface__`
|
||||
- 1-2ms encoding time per 720p frame
|
||||
- Keeps data on GPU until final compression
|
||||
|
||||
### Why Thread-Safe Ring Buffer?
|
||||
- Decouples decoding from inference pipeline
|
||||
- Prevents frame drops during processing spikes
|
||||
- Allows async frame access
|
||||
- Configurable buffer size per stream
|
||||
|
||||
## Future Considerations
|
||||
|
||||
### Hardware Decode Session Limits
|
||||
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
|
||||
- May need multiple GPUs for 1000 streams
|
||||
- Test with actual hardware to verify limits
|
||||
|
||||
### Scaling Beyond 1000 Streams
|
||||
- Multi-GPU support with context per GPU
|
||||
- Load balancing across GPUs
|
||||
- Network bandwidth considerations
|
||||
|
||||
### TensorRT Integration
|
||||
- Next step: Integrate with TensorRT inference pipeline
|
||||
- GPU frames → TensorRT → Results
|
||||
- Keep entire pipeline on GPU
|
||||
|
||||
## References
|
||||
|
||||
- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
|
||||
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
|
||||
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
|
||||
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
|
||||
|
||||
## License
|
||||
|
||||
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.
|
||||
Loading…
Add table
Add a link
Reference in a new issue