feat: inference subsystem and optimization to decoder

2025-11-09 00:57:08 +07:00 · 2025-11-09 00:57:08 +07:00 · 3c83a57e44
commit 3c83a57e44
19 changed files with 3897 additions and 0 deletions
--- a/claude.md
+++ b/claude.md
@ -0,0 +1,373 @@
+# GPU-Accelerated RTSP Stream Processing System
+
+## Project Overview
+
+A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
+
+## Key Achievements
+
+- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
+- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
+- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
+- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
+
+## Architecture
+
+### Pipeline Flow
+```
+RTSP Stream → PyAV (CPU)
+           ↓
+    NVDEC Decode (GPU) → NV12 Format
+           ↓
+    NV12 to RGB (GPU) → PyTorch Ops
+           ↓
+    nvJPEG Encode (GPU) → JPEG Bytes
+           ↓
+    CPU (JPEG only)
+```
+
+### Core Components
+
+#### StreamDecoderFactory
+Singleton factory managing shared CUDA context across all decoder instances.
+
+**Key Methods:**
+- `get_factory(gpu_id)`: Returns singleton instance
+- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
+
+**CUDA Context Initialization:**
+```python
+err, = cuda_driver.cuInit(0)
+err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
+```
+
+#### StreamDecoder
+Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
+
+**Key Features:**
+- Thread-safe frame buffer (deque)
+- Connection status tracking
+- Automatic reconnection handling
+- Background thread for continuous decoding
+
+**Key Methods:**
+- `start()`: Start decoding thread
+- `stop()`: Stop and cleanup
+- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
+- `is_connected()`: Check connection status
+- `get_buffer_size()`: Current buffer size
+
+#### JPEGEncoderFactory
+Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
+
+**Key Function:**
+```python
+def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
+    """
+    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
+    Uses __cuda_array_interface__ for zero-copy operation.
+
+    Performance: 1-2ms per 720p frame
+    """
+```
+
+## Technical Implementation
+
+### Shared CUDA Context Pattern
+
+```python
+# Single shared context for all decoders
+factory = StreamDecoderFactory(gpu_id=0)
+
+# All decoders share same context
+decoder1 = factory.create_decoder(url1, buffer_size=30)
+decoder2 = factory.create_decoder(url2, buffer_size=30)
+decoder3 = factory.create_decoder(url3, buffer_size=30)
+```
+
+**Benefits:**
+- 70% VRAM reduction per stream
+- Single decoder initialization overhead
+- Efficient resource sharing
+
+### NV12 to RGB Conversion (GPU)
+
+```python
+def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
+    """
+    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
+    Uses BT.601 color space conversion.
+
+    Input: (height * 1.5, width) NV12 tensor
+    Output: (3, height, width) RGB tensor
+    """
+```
+
+**Steps:**
+1. Split Y and UV planes
+2. Deinterleave UV components
+3. Upsample chroma (bilinear interpolation)
+4. Apply BT.601 color matrix
+5. Clamp to [0, 255]
+
+### Zero-Copy Operations
+
+**DLPack for PyTorch ↔ nvImageCodec:**
+```python
+# GPU tensor stays on GPU
+rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
+nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
+jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
+```
+
+## Performance Metrics
+
+### VRAM Usage (Python Process)
+
+| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
+|---------|-----------|----------|------------|---------------|
+| 0       | 216 MB    | 0 MB     | -          | -             |
+| 1       | 278 MB    | 62 MB    | 62.0 MB    | 62 MB         |
+| 2       | 338 MB    | 122 MB   | 61.0 MB    | 60 MB         |
+| 3       | 398 MB    | 182 MB   | 60.7 MB    | 60 MB         |
+| 4       | 458 MB    | 242 MB   | 60.5 MB    | 60 MB         |
+
+**Result:** Perfect linear scaling at ~60 MB per stream
+
+### Capacity Estimates
+
+With 60 MB per stream + 216 MB baseline:
+
+- **16GB GPU**: ~269 cameras (conservative: ~250)
+- **24GB GPU**: ~407 cameras (conservative: ~380)
+- **48GB GPU**: ~815 cameras (conservative: ~780)
+- **For 1000 streams**: ~60GB VRAM required
+
+### Throughput
+
+- **Frame Rate**: 7-7.5 FPS per stream @ 720p
+- **JPEG Encoding**: 1-2ms per frame
+- **Connection Time**: ~15s for stream stabilization
+
+## Project Structure
+
+```
+python-rtsp-worker/
+├── app.py                      # FastAPI application
+├── services/
+│   ├── __init__.py            # Package exports
+│   ├── stream_decoder.py      # StreamDecoder & Factory
+│   └── jpeg_encoder.py        # JPEG encoding utilities
+├── test_stream.py             # Single stream test
+├── test_multi_stream.py       # 4-stream test with monitoring
+├── test_vram_scaling.py       # System VRAM measurement
+├── test_vram_process.py       # Process VRAM measurement
+├── test_jpeg_encode.py        # JPEG encoding test
+├── requirements.txt           # Python dependencies
+├── .env                       # Camera URLs (gitignored)
+├── .env.example              # Template for camera URLs
+└── .gitignore
+
+```
+
+## Dependencies
+
+```
+fastapi                    # Web framework
+uvicorn[standard]         # ASGI server
+torch                     # GPU tensor operations
+PyNvVideoCodec            # NVDEC hardware decoding
+av                        # FFmpeg/RTSP client
+cuda-python               # CUDA driver bindings
+nvidia-nvimgcodec-cu12    # nvJPEG encoding
+python-dotenv             # Environment variables
+```
+
+## Configuration
+
+### Environment Variables (.env)
+
+```bash
+# RTSP Camera URLs
+CAMERA_URL_1=rtsp://user:pass@host/path
+CAMERA_URL_2=rtsp://user:pass@host/path
+CAMERA_URL_3=rtsp://user:pass@host/path
+CAMERA_URL_4=rtsp://user:pass@host/path
+# Add more as needed...
+```
+
+### Loading URLs in Code
+
+```python
+from dotenv import load_dotenv
+import os
+
+load_dotenv()
+
+camera_urls = []
+i = 1
+while True:
+    url = os.getenv(f'CAMERA_URL_{i}')
+    if url:
+        camera_urls.append(url)
+        i += 1
+    else:
+        break
+```
+
+## Usage Examples
+
+### Basic Usage
+
+```python
+from services import StreamDecoderFactory, encode_frame_to_jpeg
+
+# Create factory (shared CUDA context)
+factory = StreamDecoderFactory(gpu_id=0)
+
+# Create decoder
+decoder = factory.create_decoder(
+    rtsp_url="rtsp://user:pass@host/path",
+    buffer_size=30
+)
+
+# Start decoding
+decoder.start()
+
+# Wait for connection
+import time
+time.sleep(5)
+
+# Get latest frame (GPU tensor)
+rgb_frame = decoder.get_latest_frame()
+if rgb_frame is not None:
+    # Encode to JPEG (on GPU)
+    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
+
+    # Save or transmit jpeg_bytes
+    with open("frame.jpg", "wb") as f:
+        f.write(jpeg_bytes)
+
+# Cleanup
+decoder.stop()
+```
+
+### Multi-Stream Usage
+
+```python
+from services import StreamDecoderFactory
+import time
+
+factory = StreamDecoderFactory(gpu_id=0)
+
+# Create multiple decoders (all share context)
+decoders = []
+for url in camera_urls:
+    decoder = factory.create_decoder(url, buffer_size=30)
+    decoder.start()
+    decoders.append(decoder)
+
+# Wait for connections
+time.sleep(15)
+
+# Check status
+for i, decoder in enumerate(decoders):
+    status = decoder.get_status()
+    buffer_size = decoder.get_buffer_size()
+    connected = decoder.is_connected()
+    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
+
+# Process frames
+for decoder in decoders:
+    frame = decoder.get_latest_frame()
+    if frame is not None:
+        # Process frame...
+        pass
+
+# Cleanup
+for decoder in decoders:
+    decoder.stop()
+```
+
+## Testing
+
+### Run Single Stream Test
+```bash
+python test_stream.py
+```
+
+### Run 4-Stream Test with VRAM Monitoring
+```bash
+python test_multi_stream.py
+```
+
+### Measure VRAM Scaling
+```bash
+python test_vram_process.py
+```
+
+### Test JPEG Encoding
+```bash
+python test_jpeg_encode.py
+```
+
+## Known Issues
+
+### Segmentation Faults on Cleanup
+**Status**: Non-critical
+**Impact**: Occurs during cleanup, doesn't affect core functionality
+**Cause**: Likely CUDA context cleanup order issues
+**Workaround**: Functionality works correctly; cleanup errors can be ignored
+
+## Technical Decisions
+
+### Why PyNvVideoCodec?
+- Direct access to NVDEC hardware decoder
+- Minimal overhead compared to FFmpeg/torchaudio
+- Returns GPU tensors via DLPack
+- Better control over decode sessions
+
+### Why Shared CUDA Context?
+- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
+- Enables 1000-stream target on 60GB GPU
+- Minimal complexity overhead with singleton pattern
+
+### Why nvImageCodec?
+- GPU-native JPEG encoding (nvJPEG)
+- Zero-copy with PyTorch via `__cuda_array_interface__`
+- 1-2ms encoding time per 720p frame
+- Keeps data on GPU until final compression
+
+### Why Thread-Safe Ring Buffer?
+- Decouples decoding from inference pipeline
+- Prevents frame drops during processing spikes
+- Allows async frame access
+- Configurable buffer size per stream
+
+## Future Considerations
+
+### Hardware Decode Session Limits
+- NVIDIA GPUs typically support 5-30 concurrent decode sessions
+- May need multiple GPUs for 1000 streams
+- Test with actual hardware to verify limits
+
+### Scaling Beyond 1000 Streams
+- Multi-GPU support with context per GPU
+- Load balancing across GPUs
+- Network bandwidth considerations
+
+### TensorRT Integration
+- Next step: Integrate with TensorRT inference pipeline
+- GPU frames → TensorRT → Results
+- Keep entire pipeline on GPU
+
+## References
+
+- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
+- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
+- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
+- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
+
+## License
+
+This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.