# GPU-Accelerated RTSP Stream Processing System

## Project Overview

A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.

## Key Achievements

- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM

## Architecture

### Pipeline Flow
```
RTSP Stream → PyAV (CPU)
           ↓
    NVDEC Decode (GPU) → NV12 Format
           ↓
    NV12 to RGB (GPU) → PyTorch Ops
           ↓
    nvJPEG Encode (GPU) → JPEG Bytes
           ↓
    CPU (JPEG only)
```

### Core Components

#### StreamDecoderFactory
Singleton factory managing shared CUDA context across all decoder instances.

**Key Methods:**
- `get_factory(gpu_id)`: Returns singleton instance
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context

**CUDA Context Initialization:**
```python
err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
```

#### StreamDecoder
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.

**Key Features:**
- Thread-safe frame buffer (deque)
- Connection status tracking
- Automatic reconnection handling
- Background thread for continuous decoding

**Key Methods:**
- `start()`: Start decoding thread
- `stop()`: Stop and cleanup
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
- `is_connected()`: Check connection status
- `get_buffer_size()`: Current buffer size

#### JPEGEncoderFactory
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.

**Key Function:**
```python
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
    Uses __cuda_array_interface__ for zero-copy operation.

    Performance: 1-2ms per 720p frame
    """
```

## Technical Implementation

### Shared CUDA Context Pattern

```python
# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)

# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)
```

**Benefits:**
- 70% VRAM reduction per stream
- Single decoder initialization overhead
- Efficient resource sharing

### NV12 to RGB Conversion (GPU)

```python
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
    Uses BT.601 color space conversion.

    Input: (height * 1.5, width) NV12 tensor
    Output: (3, height, width) RGB tensor
    """
```

**Steps:**
1. Split Y and UV planes
2. Deinterleave UV components
3. Upsample chroma (bilinear interpolation)
4. Apply BT.601 color matrix
5. Clamp to [0, 255]

### Zero-Copy Operations

**DLPack for PyTorch ↔ nvImageCodec:**
```python
# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
```

## Performance Metrics

### VRAM Usage (Python Process)

| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|---------|-----------|----------|------------|---------------|
| 0       | 216 MB    | 0 MB     | -          | -             |
| 1       | 278 MB    | 62 MB    | 62.0 MB    | 62 MB         |
| 2       | 338 MB    | 122 MB   | 61.0 MB    | 60 MB         |
| 3       | 398 MB    | 182 MB   | 60.7 MB    | 60 MB         |
| 4       | 458 MB    | 242 MB   | 60.5 MB    | 60 MB         |

**Result:** Perfect linear scaling at ~60 MB per stream

### Capacity Estimates

With 60 MB per stream + 216 MB baseline:

- **16GB GPU**: ~269 cameras (conservative: ~250)
- **24GB GPU**: ~407 cameras (conservative: ~380)
- **48GB GPU**: ~815 cameras (conservative: ~780)
- **For 1000 streams**: ~60GB VRAM required

### Throughput

- **Frame Rate**: 7-7.5 FPS per stream @ 720p
- **JPEG Encoding**: 1-2ms per frame
- **Connection Time**: ~15s for stream stabilization

## Project Structure

```
python-rtsp-worker/
├── app.py                      # FastAPI application
├── services/
│   ├── __init__.py            # Package exports
│   ├── stream_decoder.py      # StreamDecoder & Factory
│   └── jpeg_encoder.py        # JPEG encoding utilities
├── test_stream.py             # Single stream test
├── test_multi_stream.py       # 4-stream test with monitoring
├── test_vram_scaling.py       # System VRAM measurement
├── test_vram_process.py       # Process VRAM measurement
├── test_jpeg_encode.py        # JPEG encoding test
├── requirements.txt           # Python dependencies
├── .env                       # Camera URLs (gitignored)
├── .env.example              # Template for camera URLs
└── .gitignore

```

## Dependencies

```
fastapi                    # Web framework
uvicorn[standard]         # ASGI server
torch                     # GPU tensor operations
PyNvVideoCodec            # NVDEC hardware decoding
av                        # FFmpeg/RTSP client
cuda-python               # CUDA driver bindings
nvidia-nvimgcodec-cu12    # nvJPEG encoding
python-dotenv             # Environment variables
```

## Configuration

### Environment Variables (.env)

```bash
# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...
```

### Loading URLs in Code

```python
from dotenv import load_dotenv
import os

load_dotenv()

camera_urls = []
i = 1
while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break
```

## Usage Examples

### Basic Usage

```python
from services import StreamDecoderFactory, encode_frame_to_jpeg

# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)

# Create decoder
decoder = factory.create_decoder(
    rtsp_url="rtsp://user:pass@host/path",
    buffer_size=30
)

# Start decoding
decoder.start()

# Wait for connection
import time
time.sleep(5)

# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
    # Encode to JPEG (on GPU)
    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)

    # Save or transmit jpeg_bytes
    with open("frame.jpg", "wb") as f:
        f.write(jpeg_bytes)

# Cleanup
decoder.stop()
```

### Multi-Stream Usage

```python
from services import StreamDecoderFactory
import time

factory = StreamDecoderFactory(gpu_id=0)

# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
    decoder = factory.create_decoder(url, buffer_size=30)
    decoder.start()
    decoders.append(decoder)

# Wait for connections
time.sleep(15)

# Check status
for i, decoder in enumerate(decoders):
    status = decoder.get_status()
    buffer_size = decoder.get_buffer_size()
    connected = decoder.is_connected()
    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")

# Process frames
for decoder in decoders:
    frame = decoder.get_latest_frame()
    if frame is not None:
        # Process frame...
        pass

# Cleanup
for decoder in decoders:
    decoder.stop()
```

## Testing

### Run Single Stream Test
```bash
python test_stream.py
```

### Run 4-Stream Test with VRAM Monitoring
```bash
python test_multi_stream.py
```

### Measure VRAM Scaling
```bash
python test_vram_process.py
```

### Test JPEG Encoding
```bash
python test_jpeg_encode.py
```

## Known Issues

### Segmentation Faults on Cleanup
**Status**: Non-critical
**Impact**: Occurs during cleanup, doesn't affect core functionality
**Cause**: Likely CUDA context cleanup order issues
**Workaround**: Functionality works correctly; cleanup errors can be ignored

## Technical Decisions

### Why PyNvVideoCodec?
- Direct access to NVDEC hardware decoder
- Minimal overhead compared to FFmpeg/torchaudio
- Returns GPU tensors via DLPack
- Better control over decode sessions

### Why Shared CUDA Context?
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
- Enables 1000-stream target on 60GB GPU
- Minimal complexity overhead with singleton pattern

### Why nvImageCodec?
- GPU-native JPEG encoding (nvJPEG)
- Zero-copy with PyTorch via `__cuda_array_interface__`
- 1-2ms encoding time per 720p frame
- Keeps data on GPU until final compression

### Why Thread-Safe Ring Buffer?
- Decouples decoding from inference pipeline
- Prevents frame drops during processing spikes
- Allows async frame access
- Configurable buffer size per stream

## Future Considerations

### Hardware Decode Session Limits
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
- May need multiple GPUs for 1000 streams
- Test with actual hardware to verify limits

### Scaling Beyond 1000 Streams
- Multi-GPU support with context per GPU
- Load balancing across GPUs
- Network bandwidth considerations

### TensorRT Integration
- Next step: Integrate with TensorRT inference pipeline
- GPU frames → TensorRT → Results
- Keep entire pipeline on GPU

## References

- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)

## License

This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.