python-rtsp-worker/claude.md
2025-11-09 11:53:03 +07:00

7.9 KiB

GPU-Accelerated RTSP Stream Processing System

Project Overview

A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.

Key Achievements

  • Shared CUDA Context: 70% VRAM reduction (from ~200MB to ~60MB per stream)
  • Linear VRAM Scaling: Perfect scaling at 60 MB per additional stream
  • Zero-Copy Pipeline: All processing stays on GPU until JPEG bytes
  • Proven Performance: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM

Architecture

Pipeline Flow

RTSP Stream → PyAV (CPU)
           ↓
    NVDEC Decode (GPU) → NV12 Format
           ↓
    NV12 to RGB (GPU) → PyTorch Ops
           ↓
    nvJPEG Encode (GPU) → JPEG Bytes
           ↓
    CPU (JPEG only)

Core Components

StreamDecoderFactory

Singleton factory managing shared CUDA context across all decoder instances.

Key Methods:

  • get_factory(gpu_id): Returns singleton instance
  • create_decoder(rtsp_url, buffer_size): Creates new decoder with shared context

CUDA Context Initialization:

err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)

StreamDecoder

Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.

Key Features:

  • Thread-safe frame buffer (deque)
  • Connection status tracking
  • Automatic reconnection handling
  • Background thread for continuous decoding

Key Methods:

  • start(): Start decoding thread
  • stop(): Stop and cleanup
  • get_latest_frame(): Get most recent RGB frame (GPU tensor)
  • is_connected(): Check connection status
  • get_buffer_size(): Current buffer size

JPEGEncoderFactory

Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.

Key Function:

def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
    Uses __cuda_array_interface__ for zero-copy operation.

    Performance: 1-2ms per 720p frame
    """

Technical Implementation

Shared CUDA Context Pattern

# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)

# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)

Benefits:

  • 70% VRAM reduction per stream
  • Single decoder initialization overhead
  • Efficient resource sharing

NV12 to RGB Conversion (GPU)

def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
    Uses BT.601 color space conversion.

    Input: (height * 1.5, width) NV12 tensor
    Output: (3, height, width) RGB tensor
    """

Steps:

  1. Split Y and UV planes
  2. Deinterleave UV components
  3. Upsample chroma (bilinear interpolation)
  4. Apply BT.601 color matrix
  5. Clamp to [0, 255]

Zero-Copy Operations

DLPack for PyTorch ↔ nvImageCodec:

# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)

Performance Metrics

VRAM Usage (at 720p)

Streams Total VRAM Overhead Per Stream Marginal Cost
0 216 MB 0 MB - -
1 278 MB 62 MB 62.0 MB 62 MB
2 338 MB 122 MB 61.0 MB 60 MB
3 398 MB 182 MB 60.7 MB 60 MB
4 458 MB 242 MB 60.5 MB 60 MB

Result: Perfect linear scaling at ~60 MB per stream

Project Structure


python-rtsp-worker/
├── app.py                      # FastAPI application
├── services/
│   ├── __init__.py            # Package exports
│   ├── stream_decoder.py      # StreamDecoder & Factory
│   └── jpeg_encoder.py        # JPEG encoding utilities
├── test_stream.py             # Single stream test
├── test_multi_stream.py       # 4-stream test with monitoring
├── test_vram_scaling.py       # System VRAM measurement
├── test_vram_process.py       # Process VRAM measurement
├── test_jpeg_encode.py        # JPEG encoding test
├── requirements.txt           # Python dependencies
├── .env                       # Camera URLs (gitignored)
├── .env.example              # Template for camera URLs

└── .gitignore

Configuration

Environment Variables (.env)

# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...

Loading URLs in Code

from dotenv import load_dotenv
import os

load_dotenv()

camera_urls = []
i = 1
while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break

Usage Examples

Basic Usage

from services import StreamDecoderFactory, encode_frame_to_jpeg

# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)

# Create decoder
decoder = factory.create_decoder(
    rtsp_url="rtsp://user:pass@host/path",
    buffer_size=30
)

# Start decoding
decoder.start()

# Wait for connection
import time
time.sleep(5)

# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
    # Encode to JPEG (on GPU)
    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)

    # Save or transmit jpeg_bytes
    with open("frame.jpg", "wb") as f:
        f.write(jpeg_bytes)

# Cleanup
decoder.stop()

Multi-Stream Usage

from services import StreamDecoderFactory
import time

factory = StreamDecoderFactory(gpu_id=0)

# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
    decoder = factory.create_decoder(url, buffer_size=30)
    decoder.start()
    decoders.append(decoder)

# Wait for connections
time.sleep(15)

# Check status
for i, decoder in enumerate(decoders):
    status = decoder.get_status()
    buffer_size = decoder.get_buffer_size()
    connected = decoder.is_connected()
    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")

# Process frames
for decoder in decoders:
    frame = decoder.get_latest_frame()
    if frame is not None:
        # Process frame...
        pass

# Cleanup
for decoder in decoders:
    decoder.stop()

Testing

Run Single Stream Test

python test_stream.py

Run 4-Stream Test with VRAM Monitoring

python test_multi_stream.py

Measure VRAM Scaling

python test_vram_process.py

Test JPEG Encoding

python test_jpeg_encode.py

Known Issues

Segmentation Faults on Cleanup

Status: Non-critical Impact: Occurs during cleanup, doesn't affect core functionality Cause: Likely CUDA context cleanup order issues Workaround: Functionality works correctly; cleanup errors can be ignored

References

License

This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.