Siwat Sirichai e71316ef3d update docs

2025-11-09 11:53:03 +07:00

7.9 KiB

Raw Blame History

GPU-Accelerated RTSP Stream Processing System

Project Overview

A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.

Key Achievements

Shared CUDA Context: 70% VRAM reduction (from ~200MB to ~60MB per stream)
Linear VRAM Scaling: Perfect scaling at 60 MB per additional stream
Zero-Copy Pipeline: All processing stays on GPU until JPEG bytes
Proven Performance: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM

Architecture

Pipeline Flow

RTSP Stream → PyAV (CPU)
           ↓
    NVDEC Decode (GPU) → NV12 Format
           ↓
    NV12 to RGB (GPU) → PyTorch Ops
           ↓
    nvJPEG Encode (GPU) → JPEG Bytes
           ↓
    CPU (JPEG only)

Core Components

StreamDecoderFactory

Singleton factory managing shared CUDA context across all decoder instances.

Key Methods:

get_factory(gpu_id): Returns singleton instance
create_decoder(rtsp_url, buffer_size): Creates new decoder with shared context

CUDA Context Initialization:

err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)

StreamDecoder

Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.

Key Features:

Thread-safe frame buffer (deque)
Connection status tracking
Automatic reconnection handling
Background thread for continuous decoding

Key Methods:

start(): Start decoding thread
stop(): Stop and cleanup
get_latest_frame(): Get most recent RGB frame (GPU tensor)
is_connected(): Check connection status
get_buffer_size(): Current buffer size

JPEGEncoderFactory

Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.

Key Function:

def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
    Uses __cuda_array_interface__ for zero-copy operation.

    Performance: 1-2ms per 720p frame
    """

Technical Implementation

Shared CUDA Context Pattern

# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)

# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)

Benefits:

70% VRAM reduction per stream
Single decoder initialization overhead
Efficient resource sharing

NV12 to RGB Conversion (GPU)

def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
    Uses BT.601 color space conversion.

    Input: (height * 1.5, width) NV12 tensor
    Output: (3, height, width) RGB tensor
    """

Steps:

Split Y and UV planes
Deinterleave UV components
Upsample chroma (bilinear interpolation)
Apply BT.601 color matrix
Clamp to [0, 255]

Zero-Copy Operations

DLPack for PyTorch ↔ nvImageCodec:

# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)

Performance Metrics

VRAM Usage (at 720p)

Streams	Total VRAM	Overhead	Per Stream	Marginal Cost
0	216 MB	0 MB	-	-
1	278 MB	62 MB	62.0 MB	62 MB
2	338 MB	122 MB	61.0 MB	60 MB
3	398 MB	182 MB	60.7 MB	60 MB
4	458 MB	242 MB	60.5 MB	60 MB

Result: Perfect linear scaling at ~60 MB per stream

Project Structure


python-rtsp-worker/
├── app.py                      # FastAPI application
├── services/
│   ├── __init__.py            # Package exports
│   ├── stream_decoder.py      # StreamDecoder & Factory
│   └── jpeg_encoder.py        # JPEG encoding utilities
├── test_stream.py             # Single stream test
├── test_multi_stream.py       # 4-stream test with monitoring
├── test_vram_scaling.py       # System VRAM measurement
├── test_vram_process.py       # Process VRAM measurement
├── test_jpeg_encode.py        # JPEG encoding test
├── requirements.txt           # Python dependencies
├── .env                       # Camera URLs (gitignored)
├── .env.example              # Template for camera URLs

└── .gitignore

Configuration

Environment Variables (.env)

# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...

Loading URLs in Code

from dotenv import load_dotenv
import os

load_dotenv()

camera_urls = []
i = 1
while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break

Usage Examples

Basic Usage

from services import StreamDecoderFactory, encode_frame_to_jpeg

# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)

# Create decoder
decoder = factory.create_decoder(
    rtsp_url="rtsp://user:pass@host/path",
    buffer_size=30
)

# Start decoding
decoder.start()

# Wait for connection
import time
time.sleep(5)

# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
    # Encode to JPEG (on GPU)
    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)

    # Save or transmit jpeg_bytes
    with open("frame.jpg", "wb") as f:
        f.write(jpeg_bytes)

# Cleanup
decoder.stop()

Multi-Stream Usage

from services import StreamDecoderFactory
import time

factory = StreamDecoderFactory(gpu_id=0)

# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
    decoder = factory.create_decoder(url, buffer_size=30)
    decoder.start()
    decoders.append(decoder)

# Wait for connections
time.sleep(15)

# Check status
for i, decoder in enumerate(decoders):
    status = decoder.get_status()
    buffer_size = decoder.get_buffer_size()
    connected = decoder.is_connected()
    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")

# Process frames
for decoder in decoders:
    frame = decoder.get_latest_frame()
    if frame is not None:
        # Process frame...
        pass

# Cleanup
for decoder in decoders:
    decoder.stop()

Testing

Run Single Stream Test

python test_stream.py

Run 4-Stream Test with VRAM Monitoring

python test_multi_stream.py

Measure VRAM Scaling

python test_vram_process.py

Test JPEG Encoding

python test_jpeg_encode.py

Known Issues

Segmentation Faults on Cleanup

Status: Non-critical Impact: Occurs during cleanup, doesn't affect core functionality Cause: Likely CUDA context cleanup order issues Workaround: Functionality works correctly; cleanup errors can be ignored

References

License

This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.

7.9 KiB Raw Blame History