Siwat Sirichai 3c83a57e44 feat: inference subsystem and optimization to decoder

2025-11-09 00:57:08 +07:00

9.9 KiB

Raw Blame History

GPU-Accelerated RTSP Stream Processing System

Project Overview

A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.

Key Achievements

Shared CUDA Context: 70% VRAM reduction (from ~200MB to ~60MB per stream)
Linear VRAM Scaling: Perfect scaling at 60 MB per additional stream
Zero-Copy Pipeline: All processing stays on GPU until JPEG bytes
Proven Performance: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM

Architecture

Pipeline Flow

RTSP Stream → PyAV (CPU)
           ↓
    NVDEC Decode (GPU) → NV12 Format
           ↓
    NV12 to RGB (GPU) → PyTorch Ops
           ↓
    nvJPEG Encode (GPU) → JPEG Bytes
           ↓
    CPU (JPEG only)

Core Components

StreamDecoderFactory

Singleton factory managing shared CUDA context across all decoder instances.

Key Methods:

get_factory(gpu_id): Returns singleton instance
create_decoder(rtsp_url, buffer_size): Creates new decoder with shared context

CUDA Context Initialization:

err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)

StreamDecoder

Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.

Key Features:

Thread-safe frame buffer (deque)
Connection status tracking
Automatic reconnection handling
Background thread for continuous decoding

Key Methods:

start(): Start decoding thread
stop(): Stop and cleanup
get_latest_frame(): Get most recent RGB frame (GPU tensor)
is_connected(): Check connection status
get_buffer_size(): Current buffer size

JPEGEncoderFactory

Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.

Key Function:

def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
    Uses __cuda_array_interface__ for zero-copy operation.

    Performance: 1-2ms per 720p frame
    """

Technical Implementation

Shared CUDA Context Pattern

# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)

# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)

Benefits:

70% VRAM reduction per stream
Single decoder initialization overhead
Efficient resource sharing

NV12 to RGB Conversion (GPU)

def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
    Uses BT.601 color space conversion.

    Input: (height * 1.5, width) NV12 tensor
    Output: (3, height, width) RGB tensor
    """

Steps:

Split Y and UV planes
Deinterleave UV components
Upsample chroma (bilinear interpolation)
Apply BT.601 color matrix
Clamp to [0, 255]

Zero-Copy Operations

DLPack for PyTorch ↔ nvImageCodec:

# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)

Performance Metrics

VRAM Usage (Python Process)

Streams	Total VRAM	Overhead	Per Stream	Marginal Cost
0	216 MB	0 MB	-	-
1	278 MB	62 MB	62.0 MB	62 MB
2	338 MB	122 MB	61.0 MB	60 MB
3	398 MB	182 MB	60.7 MB	60 MB
4	458 MB	242 MB	60.5 MB	60 MB

Result: Perfect linear scaling at ~60 MB per stream

Capacity Estimates

With 60 MB per stream + 216 MB baseline:

16GB GPU: ~269 cameras (conservative: ~250)
24GB GPU: ~407 cameras (conservative: ~380)
48GB GPU: ~815 cameras (conservative: ~780)
For 1000 streams: ~60GB VRAM required

Throughput

Frame Rate: 7-7.5 FPS per stream @ 720p
JPEG Encoding: 1-2ms per frame
Connection Time: ~15s for stream stabilization

Project Structure

python-rtsp-worker/
├── app.py                      # FastAPI application
├── services/
│   ├── __init__.py            # Package exports
│   ├── stream_decoder.py      # StreamDecoder & Factory
│   └── jpeg_encoder.py        # JPEG encoding utilities
├── test_stream.py             # Single stream test
├── test_multi_stream.py       # 4-stream test with monitoring
├── test_vram_scaling.py       # System VRAM measurement
├── test_vram_process.py       # Process VRAM measurement
├── test_jpeg_encode.py        # JPEG encoding test
├── requirements.txt           # Python dependencies
├── .env                       # Camera URLs (gitignored)
├── .env.example              # Template for camera URLs
└── .gitignore

Dependencies

fastapi                    # Web framework
uvicorn[standard]         # ASGI server
torch                     # GPU tensor operations
PyNvVideoCodec            # NVDEC hardware decoding
av                        # FFmpeg/RTSP client
cuda-python               # CUDA driver bindings
nvidia-nvimgcodec-cu12    # nvJPEG encoding
python-dotenv             # Environment variables

Configuration

Environment Variables (.env)

# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...

Loading URLs in Code

from dotenv import load_dotenv
import os

load_dotenv()

camera_urls = []
i = 1
while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break

Usage Examples

Basic Usage

from services import StreamDecoderFactory, encode_frame_to_jpeg

# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)

# Create decoder
decoder = factory.create_decoder(
    rtsp_url="rtsp://user:pass@host/path",
    buffer_size=30
)

# Start decoding
decoder.start()

# Wait for connection
import time
time.sleep(5)

# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
    # Encode to JPEG (on GPU)
    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)

    # Save or transmit jpeg_bytes
    with open("frame.jpg", "wb") as f:
        f.write(jpeg_bytes)

# Cleanup
decoder.stop()

Multi-Stream Usage

from services import StreamDecoderFactory
import time

factory = StreamDecoderFactory(gpu_id=0)

# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
    decoder = factory.create_decoder(url, buffer_size=30)
    decoder.start()
    decoders.append(decoder)

# Wait for connections
time.sleep(15)

# Check status
for i, decoder in enumerate(decoders):
    status = decoder.get_status()
    buffer_size = decoder.get_buffer_size()
    connected = decoder.is_connected()
    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")

# Process frames
for decoder in decoders:
    frame = decoder.get_latest_frame()
    if frame is not None:
        # Process frame...
        pass

# Cleanup
for decoder in decoders:
    decoder.stop()

Testing

Run Single Stream Test

python test_stream.py

Run 4-Stream Test with VRAM Monitoring

python test_multi_stream.py

Measure VRAM Scaling

python test_vram_process.py

Test JPEG Encoding

python test_jpeg_encode.py

Known Issues

Segmentation Faults on Cleanup

Status: Non-critical Impact: Occurs during cleanup, doesn't affect core functionality Cause: Likely CUDA context cleanup order issues Workaround: Functionality works correctly; cleanup errors can be ignored

Technical Decisions

Why PyNvVideoCodec?

Direct access to NVDEC hardware decoder
Minimal overhead compared to FFmpeg/torchaudio
Returns GPU tensors via DLPack
Better control over decode sessions

Why Shared CUDA Context?

Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
Enables 1000-stream target on 60GB GPU
Minimal complexity overhead with singleton pattern

Why nvImageCodec?

GPU-native JPEG encoding (nvJPEG)
Zero-copy with PyTorch via __cuda_array_interface__
1-2ms encoding time per 720p frame
Keeps data on GPU until final compression

Why Thread-Safe Ring Buffer?

Decouples decoding from inference pipeline
Prevents frame drops during processing spikes
Allows async frame access
Configurable buffer size per stream

Future Considerations

Hardware Decode Session Limits

NVIDIA GPUs typically support 5-30 concurrent decode sessions
May need multiple GPUs for 1000 streams
Test with actual hardware to verify limits

Scaling Beyond 1000 Streams

Multi-GPU support with context per GPU
Load balancing across GPUs
Network bandwidth considerations

TensorRT Integration

Next step: Integrate with TensorRT inference pipeline
GPU frames → TensorRT → Results
Keep entire pipeline on GPU

References

License

This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.

9.9 KiB Raw Blame History