feat: inference subsystem and optimization to decoder

This commit is contained in:
Siwat Sirichai 2025-11-09 00:57:08 +07:00
commit 3c83a57e44
19 changed files with 3897 additions and 0 deletions

11
.env.example Normal file
View file

@ -0,0 +1,11 @@
# RTSP Camera URLs
# Add your camera URLs here, one per line with CAMERA_URL_N format
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more cameras as needed...
# CAMERA_URL_5=rtsp://user:pass@host/path
# CAMERA_URL_6=rtsp://user:pass@host/path

6
.gitignore vendored Normal file
View file

@ -0,0 +1,6 @@
fastapi
__pycache__/
*.pyc
.env
.claude
models/

13
app.py Normal file
View file

@ -0,0 +1,13 @@
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
async def root():
return {"message": "Hello World"}
@app.get("/health")
async def health_check():
return {"status": "healthy"}

373
claude.md Normal file
View file

@ -0,0 +1,373 @@
# GPU-Accelerated RTSP Stream Processing System
## Project Overview
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
## Key Achievements
- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
## Architecture
### Pipeline Flow
```
RTSP Stream → PyAV (CPU)
NVDEC Decode (GPU) → NV12 Format
NV12 to RGB (GPU) → PyTorch Ops
nvJPEG Encode (GPU) → JPEG Bytes
CPU (JPEG only)
```
### Core Components
#### StreamDecoderFactory
Singleton factory managing shared CUDA context across all decoder instances.
**Key Methods:**
- `get_factory(gpu_id)`: Returns singleton instance
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
**CUDA Context Initialization:**
```python
err, = cuda_driver.cuInit(0)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
```
#### StreamDecoder
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
**Key Features:**
- Thread-safe frame buffer (deque)
- Connection status tracking
- Automatic reconnection handling
- Background thread for continuous decoding
**Key Methods:**
- `start()`: Start decoding thread
- `stop()`: Stop and cleanup
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
- `is_connected()`: Check connection status
- `get_buffer_size()`: Current buffer size
#### JPEGEncoderFactory
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
**Key Function:**
```python
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
"""
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
Uses __cuda_array_interface__ for zero-copy operation.
Performance: 1-2ms per 720p frame
"""
```
## Technical Implementation
### Shared CUDA Context Pattern
```python
# Single shared context for all decoders
factory = StreamDecoderFactory(gpu_id=0)
# All decoders share same context
decoder1 = factory.create_decoder(url1, buffer_size=30)
decoder2 = factory.create_decoder(url2, buffer_size=30)
decoder3 = factory.create_decoder(url3, buffer_size=30)
```
**Benefits:**
- 70% VRAM reduction per stream
- Single decoder initialization overhead
- Efficient resource sharing
### NV12 to RGB Conversion (GPU)
```python
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
"""
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
Uses BT.601 color space conversion.
Input: (height * 1.5, width) NV12 tensor
Output: (3, height, width) RGB tensor
"""
```
**Steps:**
1. Split Y and UV planes
2. Deinterleave UV components
3. Upsample chroma (bilinear interpolation)
4. Apply BT.601 color matrix
5. Clamp to [0, 255]
### Zero-Copy Operations
**DLPack for PyTorch ↔ nvImageCodec:**
```python
# GPU tensor stays on GPU
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
```
## Performance Metrics
### VRAM Usage (Python Process)
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|---------|-----------|----------|------------|---------------|
| 0 | 216 MB | 0 MB | - | - |
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
**Result:** Perfect linear scaling at ~60 MB per stream
### Capacity Estimates
With 60 MB per stream + 216 MB baseline:
- **16GB GPU**: ~269 cameras (conservative: ~250)
- **24GB GPU**: ~407 cameras (conservative: ~380)
- **48GB GPU**: ~815 cameras (conservative: ~780)
- **For 1000 streams**: ~60GB VRAM required
### Throughput
- **Frame Rate**: 7-7.5 FPS per stream @ 720p
- **JPEG Encoding**: 1-2ms per frame
- **Connection Time**: ~15s for stream stabilization
## Project Structure
```
python-rtsp-worker/
├── app.py # FastAPI application
├── services/
│ ├── __init__.py # Package exports
│ ├── stream_decoder.py # StreamDecoder & Factory
│ └── jpeg_encoder.py # JPEG encoding utilities
├── test_stream.py # Single stream test
├── test_multi_stream.py # 4-stream test with monitoring
├── test_vram_scaling.py # System VRAM measurement
├── test_vram_process.py # Process VRAM measurement
├── test_jpeg_encode.py # JPEG encoding test
├── requirements.txt # Python dependencies
├── .env # Camera URLs (gitignored)
├── .env.example # Template for camera URLs
└── .gitignore
```
## Dependencies
```
fastapi # Web framework
uvicorn[standard] # ASGI server
torch # GPU tensor operations
PyNvVideoCodec # NVDEC hardware decoding
av # FFmpeg/RTSP client
cuda-python # CUDA driver bindings
nvidia-nvimgcodec-cu12 # nvJPEG encoding
python-dotenv # Environment variables
```
## Configuration
### Environment Variables (.env)
```bash
# RTSP Camera URLs
CAMERA_URL_1=rtsp://user:pass@host/path
CAMERA_URL_2=rtsp://user:pass@host/path
CAMERA_URL_3=rtsp://user:pass@host/path
CAMERA_URL_4=rtsp://user:pass@host/path
# Add more as needed...
```
### Loading URLs in Code
```python
from dotenv import load_dotenv
import os
load_dotenv()
camera_urls = []
i = 1
while True:
url = os.getenv(f'CAMERA_URL_{i}')
if url:
camera_urls.append(url)
i += 1
else:
break
```
## Usage Examples
### Basic Usage
```python
from services import StreamDecoderFactory, encode_frame_to_jpeg
# Create factory (shared CUDA context)
factory = StreamDecoderFactory(gpu_id=0)
# Create decoder
decoder = factory.create_decoder(
rtsp_url="rtsp://user:pass@host/path",
buffer_size=30
)
# Start decoding
decoder.start()
# Wait for connection
import time
time.sleep(5)
# Get latest frame (GPU tensor)
rgb_frame = decoder.get_latest_frame()
if rgb_frame is not None:
# Encode to JPEG (on GPU)
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
# Save or transmit jpeg_bytes
with open("frame.jpg", "wb") as f:
f.write(jpeg_bytes)
# Cleanup
decoder.stop()
```
### Multi-Stream Usage
```python
from services import StreamDecoderFactory
import time
factory = StreamDecoderFactory(gpu_id=0)
# Create multiple decoders (all share context)
decoders = []
for url in camera_urls:
decoder = factory.create_decoder(url, buffer_size=30)
decoder.start()
decoders.append(decoder)
# Wait for connections
time.sleep(15)
# Check status
for i, decoder in enumerate(decoders):
status = decoder.get_status()
buffer_size = decoder.get_buffer_size()
connected = decoder.is_connected()
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
# Process frames
for decoder in decoders:
frame = decoder.get_latest_frame()
if frame is not None:
# Process frame...
pass
# Cleanup
for decoder in decoders:
decoder.stop()
```
## Testing
### Run Single Stream Test
```bash
python test_stream.py
```
### Run 4-Stream Test with VRAM Monitoring
```bash
python test_multi_stream.py
```
### Measure VRAM Scaling
```bash
python test_vram_process.py
```
### Test JPEG Encoding
```bash
python test_jpeg_encode.py
```
## Known Issues
### Segmentation Faults on Cleanup
**Status**: Non-critical
**Impact**: Occurs during cleanup, doesn't affect core functionality
**Cause**: Likely CUDA context cleanup order issues
**Workaround**: Functionality works correctly; cleanup errors can be ignored
## Technical Decisions
### Why PyNvVideoCodec?
- Direct access to NVDEC hardware decoder
- Minimal overhead compared to FFmpeg/torchaudio
- Returns GPU tensors via DLPack
- Better control over decode sessions
### Why Shared CUDA Context?
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
- Enables 1000-stream target on 60GB GPU
- Minimal complexity overhead with singleton pattern
### Why nvImageCodec?
- GPU-native JPEG encoding (nvJPEG)
- Zero-copy with PyTorch via `__cuda_array_interface__`
- 1-2ms encoding time per 720p frame
- Keeps data on GPU until final compression
### Why Thread-Safe Ring Buffer?
- Decouples decoding from inference pipeline
- Prevents frame drops during processing spikes
- Allows async frame access
- Configurable buffer size per stream
## Future Considerations
### Hardware Decode Session Limits
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
- May need multiple GPUs for 1000 streams
- Test with actual hardware to verify limits
### Scaling Beyond 1000 Streams
- Multi-GPU support with context per GPU
- Load balancing across GPUs
- Network bandwidth considerations
### TensorRT Integration
- Next step: Integrate with TensorRT inference pipeline
- GPU frames → TensorRT → Results
- Keep entire pipeline on GPU
## References
- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
## License
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.

11
requirements.dev.txt Normal file
View file

@ -0,0 +1,11 @@
# Development Dependencies
# Install with: pip install -r requirements.dev.txt
# Model conversion tools
tensorrt
onnx
ultralytics # For YOLO models download and export
# Optional: Additional tools for model optimization
onnxruntime-gpu # ONNX runtime for testing
onnx-simplifier # Simplify ONNX models

8
requirements.txt Normal file
View file

@ -0,0 +1,8 @@
fastapi
uvicorn[standard]
torch
PyNvVideoCodec
av
cuda-python
nvidia-nvimgcodec-cu12 # GPU-accelerated JPEG encoding/decoding with nvJPEG
python-dotenv # Load environment variables from .env file

197
scripts/README.md Normal file
View file

@ -0,0 +1,197 @@
# Scripts Directory
This directory contains utility scripts for the python-rtsp-worker project.
## convert_pt_to_tensorrt.py
Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.
### Features
- **Multiple Precision Modes**: FP32, FP16, INT8
- **Dynamic Batch Size**: Support for variable batch sizes
- **Automatic Optimization**: Creates optimization profiles for best performance
- **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
- **Easy to Use**: Simple command-line interface
### Requirements
Make sure you have the following dependencies installed:
```bash
pip install torch tensorrt onnx
```
### Quick Start
**Basic conversion (FP32)**:
```bash
python scripts/convert_pt_to_tensorrt.py \
--model path/to/model.pt \
--output models/model.trt
```
**FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
```bash
python scripts/convert_pt_to_tensorrt.py \
--model yolov8n.pt \
--output models/yolov8n.trt \
--fp16
```
**Custom input shape**:
```bash
python scripts/convert_pt_to_tensorrt.py \
--model model.pt \
--output model.trt \
--input-shape 1,3,416,416
```
**Dynamic batch size** (for variable batch inference):
```bash
python scripts/convert_pt_to_tensorrt.py \
--model model.pt \
--output model.trt \
--dynamic-batch \
--max-batch 16
```
**Maximum optimization** (FP16 + INT8):
```bash
python scripts/convert_pt_to_tensorrt.py \
--model model.pt \
--output model.trt \
--fp16 \
--int8
```
### Command-Line Arguments
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
| `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
| `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
| `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
| `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
| `--dynamic-batch` | No | False | Enable dynamic batch size support |
| `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
| `--workspace-size` | No | 4 | TensorRT workspace size in GB |
| `--gpu` | No | 0 | GPU device ID to use |
| `--input-names` | No | ["input"] | Custom input tensor names |
| `--output-names` | No | ["output"] | Custom output tensor names |
| `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
| `--verbose`, `-v` | No | False | Enable verbose logging |
### Performance Tips
1. **Always use FP16** unless you need FP32 precision:
- 2x faster inference
- 50% less VRAM usage
- Minimal accuracy loss for most models
2. **Use dynamic batching** for variable workloads:
- Process 1-16 images with same engine
- Automatic optimization for common batch sizes
3. **Increase workspace size** for complex models:
- Default 4GB works for most models
- Increase to 8GB for very large models
4. **INT8 quantization** for maximum speed:
- Requires calibration data (not included in basic conversion)
- 4x faster than FP32
- Best for deployment scenarios
### Integration with Model Repository
Once converted, use the TensorRT engine with the model repository:
```python
from services.model_repository import TensorRTModelRepository
# Initialize repository
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
# Load the converted model
repo.load_model(
model_id="my_model",
file_path="models/model.trt",
num_contexts=4
)
# Run inference
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(
model_id="my_model",
inputs={"input": input_tensor}
)
```
### Troubleshooting
**Issue**: `Failed to parse ONNX model`
- Solution: Check if your PyTorch model is compatible with ONNX export
- Try updating PyTorch and ONNX versions
**Issue**: `FP16 not supported on this platform`
- Solution: Your GPU doesn't support FP16. Remove `--fp16` flag
**Issue**: `Out of memory during conversion`
- Solution: Reduce `--workspace-size` or free up GPU memory
**Issue**: `Model contains only state_dict`
- Solution: Your checkpoint only has weights. You need the full model architecture.
- Modify the script's `load_pytorch_model()` method to instantiate your model class
### Examples for Common Models
**YOLOv8**:
```bash
# Download model first
# yolo export model=yolov8n.pt format=engine device=0
# Or use this script
python scripts/convert_pt_to_tensorrt.py \
--model yolov8n.pt \
--output models/yolov8n.trt \
--input-shape 1,3,640,640 \
--fp16
```
**ResNet**:
```bash
python scripts/convert_pt_to_tensorrt.py \
--model resnet50.pt \
--output models/resnet50.trt \
--input-shape 1,3,224,224 \
--fp16 \
--dynamic-batch \
--max-batch 32
```
**Custom Model**:
```bash
python scripts/convert_pt_to_tensorrt.py \
--model custom_model.pt \
--output models/custom.trt \
--input-shape 1,3,512,512 \
--input-names image \
--output-names predictions \
--fp16 \
--verbose
```
### Notes
- The script uses ONNX as an intermediate format, which is the recommended approach
- TensorRT engines are hardware-specific; rebuild for different GPUs
- Conversion time varies (30 seconds to 5 minutes depending on model size)
- The first inference after loading is slower (warmup)
### Support
For issues or questions, please check:
- TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
- PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html

562
scripts/convert_pt_to_tensorrt.py Executable file
View file

@ -0,0 +1,562 @@
#!/usr/bin/env python3
"""
PyTorch to TensorRT Model Conversion Script
This script converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized inference.
Features:
- Automatic FP32/FP16/INT8 precision modes
- Dynamic batch size support
- Input shape validation
- Optimization profiles for dynamic shapes
- ONNX intermediate format
- GPU-accelerated conversion
Usage:
python convert_pt_to_tensorrt.py --model path/to/model.pt --output models/model.trt
python convert_pt_to_tensorrt.py --model yolov8n.pt --input-shape 1 3 640 640 --fp16
python convert_pt_to_tensorrt.py --model model.pt --dynamic-batch --max-batch 16
"""
import argparse
import sys
from pathlib import Path
from typing import Tuple, List, Optional
import torch
import tensorrt as trt
import numpy as np
class TensorRTConverter:
"""Converts PyTorch models to TensorRT engines"""
def __init__(self, gpu_id: int = 0, verbose: bool = True):
"""
Initialize the converter.
Args:
gpu_id: GPU device ID to use for conversion
verbose: Enable verbose logging
"""
self.gpu_id = gpu_id
self.device = torch.device(f'cuda:{gpu_id}')
# TensorRT logger
log_level = trt.Logger.VERBOSE if verbose else trt.Logger.WARNING
self.logger = trt.Logger(log_level)
# Set CUDA device
torch.cuda.set_device(gpu_id)
print(f"Initialized TensorRT Converter on GPU {gpu_id}")
print(f"PyTorch version: {torch.__version__}")
print(f"TensorRT version: {trt.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA device: {torch.cuda.get_device_name(gpu_id)}")
def load_pytorch_model(self, model_path: str) -> torch.nn.Module:
"""
Load PyTorch model from file.
Args:
model_path: Path to .pt or .pth file
Returns:
Loaded PyTorch model in eval mode
"""
print(f"\nLoading PyTorch model from {model_path}...")
if not Path(model_path).exists():
raise FileNotFoundError(f"Model file not found: {model_path}")
# Load model (weights_only=False for models with custom classes)
checkpoint = torch.load(model_path, map_location=self.device, weights_only=False)
# Handle different checkpoint formats
if isinstance(checkpoint, dict):
if 'model' in checkpoint:
model = checkpoint['model']
elif 'state_dict' in checkpoint:
# Need model architecture - this is a limitation
raise ValueError(
"Checkpoint contains only state_dict. "
"Please provide the complete model or modify this script to load your architecture."
)
else:
raise ValueError("Unknown checkpoint format")
else:
model = checkpoint
# Set to eval mode
model.eval()
model.to(self.device)
print(f"✓ Model loaded successfully")
return model
def export_to_onnx(self, model: torch.nn.Module, input_shape: Tuple[int, ...],
onnx_path: str, dynamic_batch: bool = False,
input_names: List[str] = None, output_names: List[str] = None) -> str:
"""
Export PyTorch model to ONNX format (intermediate step).
Args:
model: PyTorch model
input_shape: Input tensor shape (B, C, H, W)
onnx_path: Output path for ONNX file
dynamic_batch: Enable dynamic batch dimension
input_names: List of input tensor names
output_names: List of output tensor names
Returns:
Path to exported ONNX file
"""
print(f"\nExporting to ONNX format...")
print(f"Input shape: {input_shape}")
print(f"Dynamic batch: {dynamic_batch}")
# Default names
if input_names is None:
input_names = ['input']
if output_names is None:
output_names = ['output']
# Create dummy input
dummy_input = torch.randn(*input_shape, device=self.device)
# Dynamic axes configuration
dynamic_axes = None
if dynamic_batch:
dynamic_axes = {
input_names[0]: {0: 'batch'},
output_names[0]: {0: 'batch'}
}
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
onnx_path,
input_names=input_names,
output_names=output_names,
dynamic_axes=dynamic_axes,
opset_version=17, # Use recent ONNX opset
do_constant_folding=True,
verbose=False
)
print(f"✓ ONNX model exported to {onnx_path}")
return onnx_path
def build_tensorrt_engine_from_onnx(self, onnx_path: str, engine_path: str,
fp16: bool = False, int8: bool = False,
max_workspace_size: int = 4,
min_batch: int = 1, opt_batch: int = 1, max_batch: int = 1) -> str:
"""
Build TensorRT engine from ONNX model.
Args:
onnx_path: Path to ONNX model
engine_path: Output path for TensorRT engine
fp16: Enable FP16 precision
int8: Enable INT8 precision (requires calibration)
max_workspace_size: Maximum workspace size in GB
min_batch: Minimum batch size for optimization
opt_batch: Optimal batch size for optimization
max_batch: Maximum batch size for optimization
Returns:
Path to built TensorRT engine
"""
print(f"\nBuilding TensorRT engine from ONNX...")
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
print(f"Workspace size: {max_workspace_size} GB")
# Create builder and network
builder = trt.Builder(self.logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, self.logger)
# Parse ONNX model
print(f"Loading ONNX file from {onnx_path}...")
with open(onnx_path, 'rb') as f:
if not parser.parse(f.read()):
print("ERROR: Failed to parse the ONNX file:")
for error in range(parser.num_errors):
print(f" {parser.get_error(error)}")
raise RuntimeError("Failed to parse ONNX model")
print(f"✓ ONNX model parsed successfully")
# Print network info
print(f"\nNetwork Information:")
print(f" Inputs: {network.num_inputs}")
for i in range(network.num_inputs):
inp = network.get_input(i)
print(f" [{i}] {inp.name}: {inp.shape} ({inp.dtype})")
print(f" Outputs: {network.num_outputs}")
for i in range(network.num_outputs):
out = network.get_output(i)
print(f" [{i}] {out.name}: {out.shape} ({out.dtype})")
# Create builder config
config = builder.create_builder_config()
# Set workspace size
config.set_memory_pool_limit(
trt.MemoryPoolType.WORKSPACE,
max_workspace_size * (1 << 30) # GB to bytes
)
# Enable precision modes
if fp16:
if not builder.platform_has_fast_fp16:
print("Warning: FP16 not supported on this platform, using FP32")
else:
config.set_flag(trt.BuilderFlag.FP16)
print("✓ FP16 mode enabled")
if int8:
if not builder.platform_has_fast_int8:
print("Warning: INT8 not supported on this platform, using FP32/FP16")
else:
config.set_flag(trt.BuilderFlag.INT8)
print("✓ INT8 mode enabled")
print("Note: INT8 calibration not implemented. Results may be suboptimal.")
# Set optimization profile for dynamic shapes
if max_batch > 1 or min_batch != max_batch:
profile = builder.create_optimization_profile()
for i in range(network.num_inputs):
inp = network.get_input(i)
shape = list(inp.shape)
# Handle dynamic batch dimension
if shape[0] == -1:
# Min, opt, max shapes
min_shape = [min_batch] + shape[1:]
opt_shape = [opt_batch] + shape[1:]
max_shape = [max_batch] + shape[1:]
profile.set_shape(inp.name, min_shape, opt_shape, max_shape)
print(f" Dynamic shape for {inp.name}:")
print(f" Min: {min_shape}")
print(f" Opt: {opt_shape}")
print(f" Max: {max_shape}")
config.add_optimization_profile(profile)
# Build engine
print(f"\nBuilding TensorRT engine (this may take a few minutes)...")
serialized_engine = builder.build_serialized_network(network, config)
if serialized_engine is None:
raise RuntimeError("Failed to build TensorRT engine")
# Save engine to file
print(f"Saving engine to {engine_path}...")
with open(engine_path, 'wb') as f:
f.write(serialized_engine)
# Get file size
file_size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
print(f"✓ TensorRT engine built successfully")
print(f" Engine size: {file_size_mb:.2f} MB")
return engine_path
def convert(self, model_path: str, output_path: str,
input_shape: Tuple[int, ...] = (1, 3, 640, 640),
fp16: bool = False, int8: bool = False,
dynamic_batch: bool = False,
max_batch: int = 16,
workspace_size: int = 4,
input_names: List[str] = None,
output_names: List[str] = None,
keep_onnx: bool = False) -> str:
"""
Convert PyTorch or ONNX model to TensorRT engine.
Args:
model_path: Path to PyTorch model (.pt, .pth) or ONNX model (.onnx)
output_path: Path for output TensorRT engine (.trt)
input_shape: Input tensor shape (B, C, H, W) - required for PyTorch models
fp16: Enable FP16 precision
int8: Enable INT8 precision
dynamic_batch: Enable dynamic batch size
max_batch: Maximum batch size (for dynamic batching)
workspace_size: TensorRT workspace size in GB
input_names: Custom input names (for PyTorch export)
output_names: Custom output names (for PyTorch export)
keep_onnx: Keep intermediate ONNX file
Returns:
Path to created TensorRT engine
"""
# Create output directory
output_dir = Path(output_path).parent
output_dir.mkdir(parents=True, exist_ok=True)
# Check if input is already ONNX
model_path_obj = Path(model_path)
is_onnx = model_path_obj.suffix.lower() == '.onnx'
if is_onnx:
# Direct ONNX to TensorRT conversion
print(f"Input is ONNX model, converting directly to TensorRT...")
min_batch = 1
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
max_batch_size = max_batch if dynamic_batch else input_shape[0]
engine_path = self.build_tensorrt_engine_from_onnx(
onnx_path=model_path,
engine_path=output_path,
fp16=fp16,
int8=int8,
max_workspace_size=workspace_size,
min_batch=min_batch,
opt_batch=opt_batch,
max_batch=max_batch_size
)
print(f"\n{'=' * 80}")
print(f"CONVERSION COMPLETED SUCCESSFULLY")
print(f"{'=' * 80}")
print(f"Input: {model_path}")
print(f"Output: {engine_path}")
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
print(f"{'=' * 80}")
return engine_path
# PyTorch to TensorRT conversion (via ONNX)
# Temporary ONNX path
onnx_path = str(output_dir / "temp_model.onnx")
try:
# Step 1: Load PyTorch model
model = self.load_pytorch_model(model_path)
# Step 2: Export to ONNX
self.export_to_onnx(
model=model,
input_shape=input_shape,
onnx_path=onnx_path,
dynamic_batch=dynamic_batch,
input_names=input_names,
output_names=output_names
)
# Step 3: Build TensorRT engine
min_batch = 1
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
max_batch_size = max_batch if dynamic_batch else input_shape[0]
engine_path = self.build_tensorrt_engine_from_onnx(
onnx_path=onnx_path,
engine_path=output_path,
fp16=fp16,
int8=int8,
max_workspace_size=workspace_size,
min_batch=min_batch,
opt_batch=opt_batch,
max_batch=max_batch_size
)
print(f"\n{'=' * 80}")
print(f"CONVERSION COMPLETED SUCCESSFULLY")
print(f"{'=' * 80}")
print(f"Input: {model_path}")
print(f"Output: {engine_path}")
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
print(f"Dynamic batch: {dynamic_batch}")
if dynamic_batch:
print(f"Batch range: [1, {max_batch}]")
print(f"{'=' * 80}")
return engine_path
finally:
# Cleanup temporary ONNX file
if not keep_onnx and Path(onnx_path).exists():
Path(onnx_path).unlink()
print(f"Cleaned up temporary ONNX file")
def parse_shape(shape_str: str) -> Tuple[int, ...]:
"""Parse shape string like '1,3,640,640' to tuple"""
try:
return tuple(int(x) for x in shape_str.split(','))
except ValueError:
raise argparse.ArgumentTypeError(
f"Invalid shape format: {shape_str}. Expected format: 1,3,640,640"
)
def main():
parser = argparse.ArgumentParser(
description="Convert PyTorch models to TensorRT engines",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic conversion (FP32)
python convert_pt_to_tensorrt.py --model yolov8n.pt --output models/yolov8n.trt
# FP16 precision for faster inference
python convert_pt_to_tensorrt.py --model model.pt --output model.trt --fp16
# Custom input shape
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
--input-shape 1,3,416,416
# Dynamic batch size (1 to 16)
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
--dynamic-batch --max-batch 16
# INT8 quantization for maximum speed (requires calibration)
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
--fp16 --int8
# Keep intermediate ONNX file for debugging
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
--keep-onnx
"""
)
# Required arguments
parser.add_argument(
'--model', '-m',
type=str,
required=True,
help='Path to PyTorch model file (.pt or .pth)'
)
parser.add_argument(
'--output', '-o',
type=str,
required=True,
help='Output path for TensorRT engine (.trt or .engine)'
)
# Optional arguments
parser.add_argument(
'--input-shape', '-s',
type=parse_shape,
default=(1, 3, 640, 640),
help='Input tensor shape as B,C,H,W (default: 1,3,640,640)'
)
parser.add_argument(
'--fp16',
action='store_true',
help='Enable FP16 precision (faster inference, slightly lower accuracy)'
)
parser.add_argument(
'--int8',
action='store_true',
help='Enable INT8 precision (fastest, requires calibration)'
)
parser.add_argument(
'--dynamic-batch',
action='store_true',
help='Enable dynamic batch size support'
)
parser.add_argument(
'--max-batch',
type=int,
default=16,
help='Maximum batch size for dynamic batching (default: 16)'
)
parser.add_argument(
'--workspace-size',
type=int,
default=4,
help='TensorRT workspace size in GB (default: 4)'
)
parser.add_argument(
'--gpu',
type=int,
default=0,
help='GPU device ID (default: 0)'
)
parser.add_argument(
'--input-names',
type=str,
nargs='+',
default=None,
help='Custom input tensor names (default: ["input"])'
)
parser.add_argument(
'--output-names',
type=str,
nargs='+',
default=None,
help='Custom output tensor names (default: ["output"])'
)
parser.add_argument(
'--keep-onnx',
action='store_true',
help='Keep intermediate ONNX file'
)
parser.add_argument(
'--verbose', '-v',
action='store_true',
help='Enable verbose logging'
)
args = parser.parse_args()
# Validate arguments
if not Path(args.model).exists():
print(f"Error: Model file not found: {args.model}")
sys.exit(1)
if args.int8 and not args.fp16:
print("Warning: INT8 mode works best with FP16 enabled. Adding --fp16 flag.")
args.fp16 = True
# Run conversion
try:
converter = TensorRTConverter(gpu_id=args.gpu, verbose=args.verbose)
converter.convert(
model_path=args.model,
output_path=args.output,
input_shape=args.input_shape,
fp16=args.fp16,
int8=args.int8,
dynamic_batch=args.dynamic_batch,
max_batch=args.max_batch,
workspace_size=args.workspace_size,
input_names=args.input_names,
output_names=args.output_names,
keep_onnx=args.keep_onnx
)
print("\n✓ Conversion successful!")
except Exception as e:
print(f"\n✗ Conversion failed: {e}")
if args.verbose:
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,380 @@
# TensorRT Model Repository
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
## Architecture
### Key Features
1. **Model Deduplication by File Hash**
- Multiple model IDs can point to the same model file
- Only one engine loaded in VRAM per unique file
- Example: 100 cameras with same model = 1 engine (not 100!)
2. **Context Pooling for Load Balancing**
- Each unique engine has N execution contexts (configurable)
- Contexts borrowed/returned via mutex-based queue
- Enables concurrent inference without context-per-model overhead
- Example: 100 cameras sharing 4 contexts efficiently
3. **GPU-to-GPU Inference**
- All inputs/outputs stay in VRAM (zero CPU transfers)
- Integrates seamlessly with StreamDecoder (frames already on GPU)
- Maximum performance for video inference pipelines
4. **Thread-Safe Concurrent Inference**
- Mutex-based context acquisition (TensorRT best practice)
- No shared IExecutionContext across threads (safe)
- Multiple threads can infer concurrently (limited by pool size)
## Design Rationale
### Why Context Pooling?
**Without pooling** (naive approach):
```
100 cameras → 100 model IDs → 100 execution contexts
```
- Problem: Each context consumes VRAM (layers, workspace, etc.)
- Problem: Context creation overhead per camera
- Problem: Doesn't scale to hundreds of cameras
**With pooling** (our approach):
```
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
```
- Solution: Contexts shared across all cameras using same model
- Solution: Borrow/return mechanism with mutex queue
- Solution: Scales to any number of cameras with fixed context count
### Memory Savings Example
YOLOv8n model (~6MB engine file):
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|----------|-----------|---------|----------|-------------|
| Naive | 100 | 100 | 100 | ~1.5 GB |
| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
**50x memory savings!**
## Usage
### Basic Usage
```python
from services.model_repository import TensorRTModelRepository
# Initialize repository
repo = TensorRTModelRepository(
gpu_id=0,
default_num_contexts=4 # 4 contexts per unique engine
)
# Load model for camera 1
repo.load_model(
model_id="camera_1",
file_path="models/yolov8n.trt"
)
# Load same model for camera 2 (deduplication happens automatically)
repo.load_model(
model_id="camera_2",
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
)
# Run inference (GPU-to-GPU)
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(
model_id="camera_1",
inputs={"images": input_tensor},
synchronize=True,
timeout=5.0 # Wait up to 5s for available context
)
# Outputs stay on GPU
for name, tensor in outputs.items():
print(f"{name}: {tensor.shape} on {tensor.device}")
```
### Multi-Camera Scenario
```python
# Setup multiple cameras
cameras = [f"camera_{i}" for i in range(100)]
# Load same model for all cameras
for camera_id in cameras:
repo.load_model(
model_id=camera_id,
file_path="models/yolov8n.trt" # Same file for all
)
# Check efficiency
stats = repo.get_stats()
print(f"Model IDs: {stats['total_model_ids']}") # 100
print(f"Unique engines: {stats['unique_engines']}") # 1
print(f"Total contexts: {stats['total_contexts']}") # 4
```
### Integration with RTSP Decoder
```python
from services.stream_decoder import StreamDecoderFactory
from services.model_repository import TensorRTModelRepository
# Setup
decoder_factory = StreamDecoderFactory(gpu_id=0)
model_repo = TensorRTModelRepository(gpu_id=0)
# Create decoder for camera
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
decoder.start()
# Load inference model
model_repo.load_model("camera_main", "models/yolov8n.trt")
# Process frames (everything on GPU)
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
# Preprocess (stays on GPU)
frame_gpu = frame_gpu.float() / 255.0
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
# Inference (GPU-to-GPU, zero copy)
outputs = model_repo.infer(
model_id="camera_main",
inputs={"images": frame_gpu}
)
# Post-process outputs (can stay on GPU)
# ... NMS, bounding boxes, etc.
```
### Concurrent Inference
```python
import threading
def process_camera(camera_id: str, model_id: str):
# Get frame from decoder (on GPU)
frame = decoder.get_latest_frame(rgb=True)
# Inference automatically borrows/returns context from pool
outputs = repo.infer(
model_id=model_id,
inputs={"images": frame},
timeout=10.0 # Wait for available context
)
# Process outputs...
# Multiple threads can infer concurrently
threads = []
for i in range(10): # 10 threads
t = threading.Thread(
target=process_camera,
args=(f"camera_{i}", f"camera_{i}")
)
threads.append(t)
t.start()
for t in threads:
t.join()
# With 4 contexts: up to 4 inferences run in parallel
# Others wait in queue, contexts auto-balanced
```
## API Reference
### TensorRTModelRepository
#### `__init__(gpu_id=0, default_num_contexts=4)`
Initialize the repository.
**Args:**
- `gpu_id`: GPU device ID
- `default_num_contexts`: Default context pool size per engine
#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
Load a TensorRT model.
**Args:**
- `model_id`: Unique identifier (e.g., "camera_1")
- `file_path`: Path to .trt/.engine file
- `num_contexts`: Context pool size (None = use default)
- `force_reload`: Reload if model_id exists
**Returns:** `ModelMetadata`
**Deduplication:** If file hash matches existing model, reuses engine + contexts.
#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
Run inference.
**Args:**
- `model_id`: Model identifier
- `inputs`: Dict mapping input names to CUDA tensors
- `synchronize`: Wait for completion
- `timeout`: Max wait time for context (seconds)
**Returns:** Dict mapping output names to CUDA tensors
**Thread-safe:** Borrows context from pool, returns after inference.
#### `unload_model(model_id)`
Unload a model.
If last reference to engine, fully unloads from VRAM.
#### `get_metadata(model_id)`
Get model metadata.
**Returns:** `ModelMetadata` or `None`
#### `get_model_info(model_id)`
Get detailed model information.
**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
#### `get_stats()`
Get repository statistics.
**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
## Best Practices
### 1. Set Appropriate Context Pool Size
```python
# For 10 cameras with same model, 4 contexts is usually enough
repo = TensorRTModelRepository(default_num_contexts=4)
# For high concurrency, increase pool size
repo = TensorRTModelRepository(default_num_contexts=8)
```
**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
### 2. Always Use GPU Tensors
```python
# ✅ Good: Input on GPU
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(model_id, {"images": input_gpu})
# ❌ Bad: Input on CPU (will cause error)
input_cpu = torch.rand(1, 3, 640, 640)
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
```
### 3. Handle Timeout Gracefully
```python
try:
outputs = repo.infer(
model_id="camera_1",
inputs=inputs,
timeout=5.0
)
except RuntimeError as e:
# All contexts busy, increase pool size or add backpressure
print(f"Inference timeout: {e}")
```
### 4. Use Same File for Deduplication
```python
# ✅ Good: Same file path → deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
# ❌ Bad: Different paths (even if same content) → no deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
```
## TensorRT Best Practices Implemented
Based on NVIDIA documentation and web search findings:
1. **Separate IExecutionContext per concurrent stream**
- Each context has its own CUDA stream
- Contexts never shared across threads simultaneously
2. **Mutex-based context management**
- Queue-based borrowing with locks
- Thread-safe acquire/release pattern
3. **GPU memory reuse**
- Engines shared by file hash
- Contexts pooled and reused
4. **Zero-copy operations**
- All data stays in VRAM
- DLPack integration with PyTorch
## Troubleshooting
### "No execution context available within timeout"
**Cause:** All contexts busy with concurrent inferences.
**Solutions:**
1. Increase context pool size:
```python
repo.load_model(model_id, file_path, num_contexts=8)
```
2. Increase timeout:
```python
outputs = repo.infer(model_id, inputs, timeout=30.0)
```
3. Add backpressure/throttling to limit concurrent requests
### Out of Memory (OOM)
**Cause:** Too many unique engines or large context pools.
**Solutions:**
1. Ensure deduplication working (same file paths)
2. Reduce context pool sizes
3. Use smaller models or quantization (INT8/FP16)
### Import Error: "tensorrt could not be resolved"
**Solution:** Install TensorRT:
```bash
pip install tensorrt
# Or use NVIDIA's wheel for your CUDA version
```
## Performance Tips
1. **Batch Processing:** Process multiple frames before synchronizing
```python
outputs = repo.infer(model_id, inputs, synchronize=False)
# ... more inferences ...
torch.cuda.synchronize() # Sync once at end
```
2. **Async Inference:** Don't synchronize if not needed immediately
```python
outputs = repo.infer(model_id, inputs, synchronize=False)
# GPU continues working, CPU continues
# Synchronize later when you need results
```
3. **Monitor Context Utilization:**
```python
stats = repo.get_stats()
print(f"Contexts: {stats['total_contexts']}")
# If timeouts occur frequently, increase pool size
```
## License
Part of python-rtsp-worker project.

14
services/__init__.py Normal file
View file

@ -0,0 +1,14 @@
"""
Services package for RTSP stream processing with GPU acceleration.
"""
from .stream_decoder import StreamDecoderFactory, StreamDecoder, ConnectionStatus
from .jpeg_encoder import JPEGEncoderFactory, encode_frame_to_jpeg
__all__ = [
'StreamDecoderFactory',
'StreamDecoder',
'ConnectionStatus',
'JPEGEncoderFactory',
'encode_frame_to_jpeg',
]

91
services/jpeg_encoder.py Normal file
View file

@ -0,0 +1,91 @@
"""
JPEG Encoder wrapper for GPU-accelerated JPEG encoding using nvImageCodec/nvJPEG.
Provides a shared encoder instance that can be used across multiple streams.
"""
from typing import Optional
import torch
import nvidia.nvimgcodec as nvimgcodec
class JPEGEncoderFactory:
"""
Factory for creating and managing a shared JPEG encoder instance.
Thread-safe singleton pattern for efficient resource sharing.
"""
_instance = None
_encoder = None
def __new__(cls):
if cls._instance is None:
cls._instance = super(JPEGEncoderFactory, cls).__new__(cls)
cls._encoder = nvimgcodec.Encoder()
print("JPEGEncoderFactory initialized with shared nvJPEG encoder")
return cls._instance
@classmethod
def get_encoder(cls):
"""Get the shared JPEG encoder instance"""
if cls._encoder is None:
cls() # Initialize if not already done
return cls._encoder
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
"""
Encode an RGB frame to JPEG on GPU and return JPEG bytes.
This function:
1. Takes RGB frame from GPU (stays on GPU during encoding)
2. Converts PyTorch tensor to nvImageCodec image via as_image()
3. Encodes to JPEG using nvJPEG (GPU operation)
4. Transfers only JPEG bytes to CPU
5. Returns bytes for saving to disk
Args:
rgb_frame: RGB tensor on GPU, shape (3, H, W) or (H, W, 3), dtype uint8
quality: JPEG quality (0-100, default 95)
Returns:
JPEG encoded bytes or None if encoding fails
"""
if rgb_frame is None:
return None
try:
# Ensure we have (H, W, C) format and contiguous memory
if rgb_frame.dim() == 3:
if rgb_frame.shape[0] == 3:
# Convert from (C, H, W) to (H, W, C)
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
else:
# Already (H, W, C)
rgb_hwc = rgb_frame.contiguous()
else:
raise ValueError(f"Expected 3D tensor, got shape {rgb_frame.shape}")
# Get shared encoder
encoder = JPEGEncoderFactory.get_encoder()
# Create encode parameters with quality
# Quality is set via quality_value (0-100 scale)
jpeg_params = nvimgcodec.JpegEncodeParams(optimized_huffman=True)
encode_params = nvimgcodec.EncodeParams(
quality_value=float(quality),
jpeg_encode_params=jpeg_params
)
# Convert PyTorch GPU tensor to nvImageCodec image using __cuda_array_interface__
# This is zero-copy - nvimgcodec reads directly from GPU memory
nv_image = nvimgcodec.as_image(rgb_hwc)
# Encode to JPEG on GPU
# The encoding happens on GPU, only compressed JPEG bytes are transferred to CPU
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
return bytes(jpeg_data)
except Exception as e:
print(f"Error encoding frame to JPEG: {e}")
return None

View file

@ -0,0 +1,631 @@
import threading
import hashlib
from typing import Optional, Dict, Any, List, Tuple
from pathlib import Path
from queue import Queue
import torch
import tensorrt as trt
from dataclasses import dataclass
@dataclass
class ModelMetadata:
"""Metadata for a loaded TensorRT model"""
file_path: str
file_hash: str
input_shapes: Dict[str, Tuple[int, ...]]
output_shapes: Dict[str, Tuple[int, ...]]
input_names: List[str]
output_names: List[str]
input_dtypes: Dict[str, torch.dtype]
output_dtypes: Dict[str, torch.dtype]
class ExecutionContext:
"""
Wrapper for TensorRT execution context with CUDA stream.
Used in context pool for load balancing.
"""
def __init__(self, context: trt.IExecutionContext, stream: torch.cuda.Stream,
context_id: int, device: torch.device):
self.context = context
self.stream = stream
self.context_id = context_id
self.device = device
self.in_use = False
self.lock = threading.Lock()
def __repr__(self):
return f"ExecutionContext(id={self.context_id}, in_use={self.in_use})"
class SharedEngine:
"""
Shared TensorRT engine with context pool for load balancing.
Architecture:
- One engine shared across all model_ids with same file hash
- Pool of N execution contexts for concurrent inference
- Contexts are borrowed/returned using mutex locks
- Load balancing: contexts distributed across requests
"""
def __init__(self, engine: trt.ICudaEngine, file_hash: str, file_path: str,
num_contexts: int, device: torch.device, metadata: ModelMetadata):
self.engine = engine
self.file_hash = file_hash
self.file_path = file_path
self.metadata = metadata
self.device = device
self.num_contexts = num_contexts
# Create context pool
self.context_pool: List[ExecutionContext] = []
self.available_contexts: Queue[ExecutionContext] = Queue()
for i in range(num_contexts):
ctx = engine.create_execution_context()
if ctx is None:
raise RuntimeError(f"Failed to create execution context {i}")
stream = torch.cuda.Stream(device=device)
exec_ctx = ExecutionContext(ctx, stream, i, device)
self.context_pool.append(exec_ctx)
self.available_contexts.put(exec_ctx)
# Model IDs referencing this engine
self.model_ids: set = set()
self.lock = threading.Lock()
print(f"Created context pool with {num_contexts} contexts for engine {file_hash[:8]}...")
def acquire_context(self, timeout: Optional[float] = None) -> Optional[ExecutionContext]:
"""
Acquire an available execution context from the pool.
Blocks if all contexts are in use.
Args:
timeout: Max time to wait for context (None = wait forever)
Returns:
ExecutionContext or None if timeout
"""
try:
exec_ctx = self.available_contexts.get(timeout=timeout)
with exec_ctx.lock:
exec_ctx.in_use = True
return exec_ctx
except:
return None
def release_context(self, exec_ctx: ExecutionContext):
"""
Return a context to the pool.
Args:
exec_ctx: Context to release
"""
with exec_ctx.lock:
exec_ctx.in_use = False
self.available_contexts.put(exec_ctx)
def add_model_id(self, model_id: str):
"""Add a model_id reference to this engine"""
with self.lock:
self.model_ids.add(model_id)
def remove_model_id(self, model_id: str) -> int:
"""
Remove a model_id reference from this engine.
Returns the number of remaining references.
"""
with self.lock:
self.model_ids.discard(model_id)
return len(self.model_ids)
def get_reference_count(self) -> int:
"""Get number of model_ids using this engine"""
with self.lock:
return len(self.model_ids)
def cleanup(self):
"""Cleanup all contexts"""
for exec_ctx in self.context_pool:
del exec_ctx.context
self.context_pool.clear()
del self.engine
class TensorRTModelRepository:
"""
Thread-safe repository for TensorRT models with context pooling and deduplication.
Architecture:
- Deduplication: Multiple model_ids with same file share one engine
- Context Pool: Each unique engine has N execution contexts (configurable)
- Load Balancing: Contexts are borrowed/returned via mutex queue
- Scalability: Adding 100 cameras with same model = 1 engine + N contexts (not 100 contexts!)
Best Practices:
- GPU-to-GPU: All inputs/outputs stay in VRAM (zero CPU transfers)
- Thread Safety: Mutex-based context borrowing (TensorRT best practice)
- Memory Efficient: Deduplicate by file hash, share engine across model_ids
- Concurrent: N contexts allow N parallel inferences per unique model
Example:
# 100 cameras, same model file
for i in range(100):
repo.load_model(f"camera_{i}", "yolov8.trt")
# Result: 1 engine in VRAM, N contexts (e.g., 4), not 100 contexts!
"""
def __init__(self, gpu_id: int = 0, default_num_contexts: int = 4):
"""
Initialize the model repository.
Args:
gpu_id: GPU device ID to use
default_num_contexts: Default number of execution contexts per unique engine
"""
self.gpu_id = gpu_id
self.device = torch.device(f'cuda:{gpu_id}')
self.default_num_contexts = default_num_contexts
# Model ID to engine mapping: model_id -> file_hash
self._model_to_hash: Dict[str, str] = {}
# Shared engines with context pools: file_hash -> SharedEngine
self._shared_engines: Dict[str, SharedEngine] = {}
# Locks for thread safety
self._repo_lock = threading.RLock()
# TensorRT logger
self.trt_logger = trt.Logger(trt.Logger.WARNING)
print(f"TensorRT Model Repository initialized on GPU {gpu_id}")
print(f"Default context pool size: {default_num_contexts} contexts per unique model")
@staticmethod
def compute_file_hash(file_path: str) -> str:
"""
Compute SHA256 hash of a file for deduplication.
Args:
file_path: Path to the file
Returns:
Hexadecimal hash string
"""
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
# Read in chunks to handle large files efficiently
for byte_block in iter(lambda: f.read(65536), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
def _load_engine(self, file_path: str) -> trt.ICudaEngine:
"""
Load TensorRT engine from file.
Args:
file_path: Path to .trt or .engine file
Returns:
TensorRT engine
"""
runtime = trt.Runtime(self.trt_logger)
with open(file_path, 'rb') as f:
engine_data = f.read()
engine = runtime.deserialize_cuda_engine(engine_data)
if engine is None:
raise RuntimeError(f"Failed to load TensorRT engine from {file_path}")
return engine
def _extract_metadata(self, engine: trt.ICudaEngine,
file_path: str, file_hash: str) -> ModelMetadata:
"""
Extract metadata from TensorRT engine.
Args:
engine: TensorRT engine
file_path: Path to model file
file_hash: SHA256 hash of model file
Returns:
ModelMetadata object
"""
input_shapes = {}
output_shapes = {}
input_names = []
output_names = []
input_dtypes = {}
output_dtypes = {}
# TensorRT dtype to PyTorch dtype mapping
trt_to_torch_dtype = {
trt.DataType.FLOAT: torch.float32,
trt.DataType.HALF: torch.float16,
trt.DataType.INT8: torch.int8,
trt.DataType.INT32: torch.int32,
trt.DataType.BOOL: torch.bool,
}
# Iterate through all tensors (inputs and outputs) - TensorRT 10.x API
for i in range(engine.num_io_tensors):
name = engine.get_tensor_name(i)
shape = tuple(engine.get_tensor_shape(name))
dtype = trt_to_torch_dtype.get(engine.get_tensor_dtype(name), torch.float32)
mode = engine.get_tensor_mode(name)
if mode == trt.TensorIOMode.INPUT:
input_names.append(name)
input_shapes[name] = shape
input_dtypes[name] = dtype
else:
output_names.append(name)
output_shapes[name] = shape
output_dtypes[name] = dtype
return ModelMetadata(
file_path=file_path,
file_hash=file_hash,
input_shapes=input_shapes,
output_shapes=output_shapes,
input_names=input_names,
output_names=output_names,
input_dtypes=input_dtypes,
output_dtypes=output_dtypes
)
def load_model(self, model_id: str, file_path: str,
num_contexts: Optional[int] = None,
force_reload: bool = False) -> ModelMetadata:
"""
Load a TensorRT model with the given ID.
Deduplication: If a model with the same file hash is already loaded, the model_id
is simply mapped to the existing SharedEngine (no new engine or contexts created).
Args:
model_id: User-defined identifier for this model (e.g., "camera_1")
file_path: Path to TensorRT engine file (.trt or .engine)
num_contexts: Number of execution contexts in pool (None = use default)
force_reload: If True, reload even if model_id exists
Returns:
ModelMetadata for the loaded model
Raises:
FileNotFoundError: If model file doesn't exist
RuntimeError: If engine loading fails
ValueError: If model_id already exists and force_reload is False
"""
file_path = str(Path(file_path).resolve())
if not Path(file_path).exists():
raise FileNotFoundError(f"Model file not found: {file_path}")
if num_contexts is None:
num_contexts = self.default_num_contexts
with self._repo_lock:
# Check if model_id already exists
if model_id in self._model_to_hash and not force_reload:
raise ValueError(
f"Model ID '{model_id}' already exists. "
f"Use force_reload=True to reload or choose a different ID."
)
# Unload existing model if force_reload
if model_id in self._model_to_hash and force_reload:
self.unload_model(model_id)
# Compute file hash for deduplication
print(f"Computing hash for {file_path}...")
file_hash = self.compute_file_hash(file_path)
print(f"File hash: {file_hash[:16]}...")
# Check if this file is already loaded (deduplication)
if file_hash in self._shared_engines:
shared_engine = self._shared_engines[file_hash]
print(f"Engine already loaded (hash match), reusing engine and context pool...")
print(f" Existing model_ids using this engine: {shared_engine.model_ids}")
else:
# Load new engine
print(f"Loading TensorRT engine from {file_path}...")
engine = self._load_engine(file_path)
# Extract metadata
metadata = self._extract_metadata(engine, file_path, file_hash)
# Create shared engine with context pool
shared_engine = SharedEngine(
engine=engine,
file_hash=file_hash,
file_path=file_path,
num_contexts=num_contexts,
device=self.device,
metadata=metadata
)
self._shared_engines[file_hash] = shared_engine
# Add this model_id to the shared engine
shared_engine.add_model_id(model_id)
# Map model_id to file_hash
self._model_to_hash[model_id] = file_hash
print(f"Model '{model_id}' loaded successfully")
print(f" Inputs: {shared_engine.metadata.input_names}")
for name in shared_engine.metadata.input_names:
print(f" {name}: {shared_engine.metadata.input_shapes[name]} ({shared_engine.metadata.input_dtypes[name]})")
print(f" Outputs: {shared_engine.metadata.output_names}")
for name in shared_engine.metadata.output_names:
print(f" {name}: {shared_engine.metadata.output_shapes[name]} ({shared_engine.metadata.output_dtypes[name]})")
print(f" Context pool size: {num_contexts}")
print(f" Model IDs sharing this engine: {shared_engine.get_reference_count()}")
print(f" Unique engines in VRAM: {len(self._shared_engines)}")
return shared_engine.metadata
def infer(self, model_id: str, inputs: Dict[str, torch.Tensor],
synchronize: bool = True, timeout: Optional[float] = 5.0) -> Dict[str, torch.Tensor]:
"""
Run GPU-to-GPU inference with the specified model using context pooling.
All inputs must be CUDA tensors and outputs will be CUDA tensors (stays in VRAM).
Thread-safe: Borrows an execution context from the pool with mutex locking.
Args:
model_id: Model identifier
inputs: Dictionary mapping input names to CUDA tensors
synchronize: If True, wait for inference to complete. If False, async execution.
timeout: Max time to wait for available context (seconds)
Returns:
Dictionary mapping output names to CUDA tensors (in VRAM)
Raises:
KeyError: If model_id not found
ValueError: If inputs don't match expected shapes or are not on GPU
RuntimeError: If no context available within timeout
"""
# Get shared engine
if model_id not in self._model_to_hash:
raise KeyError(f"Model '{model_id}' not found. Available: {list(self._model_to_hash.keys())}")
file_hash = self._model_to_hash[model_id]
shared_engine = self._shared_engines[file_hash]
metadata = shared_engine.metadata
# Validate inputs
for name in metadata.input_names:
if name not in inputs:
raise ValueError(f"Missing required input: {name}")
tensor = inputs[name]
if not tensor.is_cuda:
raise ValueError(f"Input '{name}' must be a CUDA tensor (on GPU)")
# Check device
if tensor.device != self.device:
print(f"Warning: Input '{name}' on {tensor.device}, moving to {self.device}")
inputs[name] = tensor.to(self.device)
# Acquire context from pool (mutex-based)
exec_ctx = shared_engine.acquire_context(timeout=timeout)
if exec_ctx is None:
raise RuntimeError(
f"No execution context available for model '{model_id}' within {timeout}s. "
f"All {shared_engine.num_contexts} contexts are busy."
)
try:
# Prepare output tensors
outputs = {}
# Set input tensors - TensorRT 10.x API
for name in metadata.input_names:
input_tensor = inputs[name].contiguous()
exec_ctx.context.set_tensor_address(name, input_tensor.data_ptr())
# Allocate and set output tensors
for name in metadata.output_names:
output_shape = metadata.output_shapes[name]
output_dtype = metadata.output_dtypes[name]
output_tensor = torch.empty(
output_shape,
dtype=output_dtype,
device=self.device
)
outputs[name] = output_tensor
exec_ctx.context.set_tensor_address(name, output_tensor.data_ptr())
# Execute inference on context's stream - TensorRT 10.x API
with torch.cuda.stream(exec_ctx.stream):
success = exec_ctx.context.execute_async_v3(
stream_handle=exec_ctx.stream.cuda_stream
)
if not success:
raise RuntimeError(f"Inference failed for model '{model_id}'")
# Synchronize if requested
if synchronize:
exec_ctx.stream.synchronize()
return outputs
finally:
# Always release context back to pool
shared_engine.release_context(exec_ctx)
def infer_batch(self, model_id: str, batch_inputs: List[Dict[str, torch.Tensor]],
synchronize: bool = True) -> List[Dict[str, torch.Tensor]]:
"""
Run inference on multiple inputs.
Contexts are borrowed/returned for each input, enabling parallel processing.
Args:
model_id: Model identifier
batch_inputs: List of input dictionaries
synchronize: If True, wait for all inferences to complete
Returns:
List of output dictionaries
"""
results = []
for inputs in batch_inputs:
outputs = self.infer(model_id, inputs, synchronize=synchronize)
results.append(outputs)
return results
def unload_model(self, model_id: str):
"""
Unload a model from the repository.
Removes the model_id reference from the shared engine. If this was the last
reference, the engine and all its contexts will be fully unloaded from VRAM.
Args:
model_id: Model identifier to unload
"""
with self._repo_lock:
if model_id not in self._model_to_hash:
print(f"Warning: Model '{model_id}' not found")
return
file_hash = self._model_to_hash[model_id]
# Remove model_id from shared engine
if file_hash in self._shared_engines:
shared_engine = self._shared_engines[file_hash]
remaining_refs = shared_engine.remove_model_id(model_id)
# If no more references, cleanup engine and contexts
if remaining_refs == 0:
shared_engine.cleanup()
del self._shared_engines[file_hash]
print(f"Model '{model_id}' unloaded, engine removed from VRAM (0 references)")
else:
print(f"Model '{model_id}' unloaded, engine kept in VRAM ({remaining_refs} references)")
# Remove from model_id mapping
del self._model_to_hash[model_id]
def get_metadata(self, model_id: str) -> Optional[ModelMetadata]:
"""
Get metadata for a loaded model.
Args:
model_id: Model identifier
Returns:
ModelMetadata or None if not found
"""
if model_id not in self._model_to_hash:
return None
file_hash = self._model_to_hash[model_id]
if file_hash not in self._shared_engines:
return None
return self._shared_engines[file_hash].metadata
def list_models(self) -> Dict[str, ModelMetadata]:
"""
List all loaded models.
Returns:
Dictionary mapping model_id to ModelMetadata
"""
with self._repo_lock:
result = {}
for model_id, file_hash in self._model_to_hash.items():
if file_hash in self._shared_engines:
result[model_id] = self._shared_engines[file_hash].metadata
return result
def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
"""
Get detailed information about a loaded model.
Args:
model_id: Model identifier
Returns:
Dictionary with model information or None if not found
"""
if model_id not in self._model_to_hash:
return None
file_hash = self._model_to_hash[model_id]
if file_hash not in self._shared_engines:
return None
shared_engine = self._shared_engines[file_hash]
metadata = shared_engine.metadata
return {
'model_id': model_id,
'file_path': metadata.file_path,
'file_hash': metadata.file_hash[:16] + '...',
'engine_references': shared_engine.get_reference_count(),
'context_pool_size': shared_engine.num_contexts,
'shared_with_model_ids': list(shared_engine.model_ids),
'inputs': {
name: {
'shape': metadata.input_shapes[name],
'dtype': str(metadata.input_dtypes[name])
}
for name in metadata.input_names
},
'outputs': {
name: {
'shape': metadata.output_shapes[name],
'dtype': str(metadata.output_dtypes[name])
}
for name in metadata.output_names
}
}
def get_stats(self) -> Dict[str, Any]:
"""
Get repository statistics.
Returns:
Dictionary with stats about loaded models and memory usage
"""
with self._repo_lock:
total_contexts = sum(
engine.num_contexts
for engine in self._shared_engines.values()
)
return {
'total_model_ids': len(self._model_to_hash),
'unique_engines': len(self._shared_engines),
'total_contexts': total_contexts,
'memory_efficiency': f"{len(self._model_to_hash)} model IDs using only {len(self._shared_engines)} engines",
'gpu_id': self.gpu_id,
'models': list(self._model_to_hash.keys())
}
def __repr__(self):
with self._repo_lock:
return (f"TensorRTModelRepository(gpu={self.gpu_id}, "
f"model_ids={len(self._model_to_hash)}, "
f"unique_engines={len(self._shared_engines)})")
def __del__(self):
"""Cleanup all models on deletion"""
with self._repo_lock:
model_ids = list(self._model_to_hash.keys())
for model_id in model_ids:
self.unload_model(model_id)

481
services/stream_decoder.py Normal file
View file

@ -0,0 +1,481 @@
import threading
from typing import Optional
from collections import deque
from enum import Enum
import torch
import PyNvVideoCodec as nvc
import av
import numpy as np
from cuda.bindings import driver as cuda_driver
from .jpeg_encoder import encode_frame_to_jpeg
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
"""
Convert NV12 format to RGB on GPU using PyTorch operations.
NV12 format:
- Y plane: height x width (luminance)
- UV plane: (height/2) x width (interleaved U and V, subsampled by 2)
Total tensor size: (height * 3/2) x width
Args:
nv12_tensor: Input tensor in NV12 format, shape (H*3/2, W)
height: Original frame height
width: Original frame width
Returns:
RGB tensor, shape (3, H, W) in range [0, 255]
"""
device = nv12_tensor.device
# Split Y and UV planes
y_plane = nv12_tensor[:height, :].float() # (H, W)
uv_plane = nv12_tensor[height:, :].float() # (H/2, W)
# Reshape UV plane to separate U and V channels
# UV is interleaved: U0V0U1V1... we need to deinterleave
uv_plane = uv_plane.reshape(height // 2, width // 2, 2) # (H/2, W/2, 2)
u_plane = uv_plane[:, :, 0] # (H/2, W/2)
v_plane = uv_plane[:, :, 1] # (H/2, W/2)
# Upsample U and V to full resolution using bilinear interpolation
u_upsampled = torch.nn.functional.interpolate(
u_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
size=(height, width),
mode='bilinear',
align_corners=False
).squeeze(0).squeeze(0) # (H, W)
v_upsampled = torch.nn.functional.interpolate(
v_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
size=(height, width),
mode='bilinear',
align_corners=False
).squeeze(0).squeeze(0) # (H, W)
# YUV to RGB conversion using BT.601 standard
# R = Y + 1.402 * (V - 128)
# G = Y - 0.344136 * (U - 128) - 0.714136 * (V - 128)
# B = Y + 1.772 * (U - 128)
y = y_plane
u = u_upsampled - 128.0
v = v_upsampled - 128.0
r = y + 1.402 * v
g = y - 0.344136 * u - 0.714136 * v
b = y + 1.772 * u
# Clamp to [0, 255] and convert to uint8
r = torch.clamp(r, 0, 255).to(torch.uint8)
g = torch.clamp(g, 0, 255).to(torch.uint8)
b = torch.clamp(b, 0, 255).to(torch.uint8)
# Stack to (3, H, W)
rgb = torch.stack([r, g, b], dim=0)
return rgb
class ConnectionStatus(Enum):
DISCONNECTED = "disconnected"
CONNECTING = "connecting"
CONNECTED = "connected"
ERROR = "error"
RECONNECTING = "reconnecting"
class StreamDecoderFactory:
"""
Factory for creating StreamDecoder instances with shared CUDA context.
This minimizes VRAM overhead by sharing the CUDA context across all decoders.
"""
_instance = None
_lock = threading.Lock()
def __new__(cls, gpu_id: int = 0):
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super(StreamDecoderFactory, cls).__new__(cls)
cls._instance._initialized = False
return cls._instance
def __init__(self, gpu_id: int = 0):
if self._initialized:
return
self.gpu_id = gpu_id
# Initialize CUDA and get device
err, = cuda_driver.cuInit(0)
if err != cuda_driver.CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to initialize CUDA: {err}")
# Get CUDA device
err, self.cuda_device = cuda_driver.cuDeviceGet(gpu_id)
if err != cuda_driver.CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to get CUDA device {gpu_id}: {err}")
# Retain primary context (shared across all decoders)
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
if err != cuda_driver.CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to retain CUDA primary context: {err}")
self._initialized = True
print(f"StreamDecoderFactory initialized with shared CUDA context on GPU {gpu_id}")
def create_decoder(self, rtsp_url: str, buffer_size: int = 30,
codec: str = "h264") -> 'StreamDecoder':
"""
Create a new StreamDecoder instance with shared CUDA context.
Args:
rtsp_url: RTSP stream URL
buffer_size: Number of frames to buffer in VRAM
codec: Video codec (h264, hevc, etc.)
Returns:
StreamDecoder instance
"""
return StreamDecoder(
rtsp_url=rtsp_url,
cuda_context=self.cuda_context,
gpu_id=self.gpu_id,
buffer_size=buffer_size,
codec=codec
)
def __del__(self):
"""Cleanup shared CUDA context on factory destruction"""
if hasattr(self, 'cuda_device') and hasattr(self, 'gpu_id'):
cuda_driver.cuDevicePrimaryCtxRelease(self.cuda_device)
class StreamDecoder:
"""
Decodes RTSP stream using NVDEC and maintains a ring buffer of frames in GPU VRAM.
Thread-safe for concurrent read/write operations.
"""
def __init__(self, rtsp_url: str, cuda_context, gpu_id: int,
buffer_size: int = 30, codec: str = "h264"):
"""
Initialize StreamDecoder.
Args:
rtsp_url: RTSP stream URL
cuda_context: Shared CUDA context handle
gpu_id: GPU device ID
buffer_size: Number of frames to keep in ring buffer
codec: Video codec type
"""
self.rtsp_url = rtsp_url
self.cuda_context = cuda_context
self.gpu_id = gpu_id
self.buffer_size = buffer_size
self.codec = codec
# Connection status
self.status = ConnectionStatus.DISCONNECTED
self._status_lock = threading.Lock()
# Frame buffer (ring buffer) - stores CUDA device pointers
self.frame_buffer = deque(maxlen=buffer_size)
self._buffer_lock = threading.RLock()
# Decoder and container instances
self.decoder = None
self.container = None
# Decode thread
self._decode_thread: Optional[threading.Thread] = None
self._stop_flag = threading.Event()
# Frame metadata
self.frame_width: Optional[int] = None
self.frame_height: Optional[int] = None
self.frame_count: int = 0
def start(self):
"""Start the RTSP stream decoding in background thread"""
if self._decode_thread is not None and self._decode_thread.is_alive():
print(f"Decoder already running for {self.rtsp_url}")
return
self._stop_flag.clear()
self._decode_thread = threading.Thread(target=self._decode_loop, daemon=True)
self._decode_thread.start()
print(f"Started decoder thread for {self.rtsp_url}")
def stop(self):
"""Stop the decoding thread and cleanup resources"""
self._stop_flag.set()
if self._decode_thread is not None:
self._decode_thread.join(timeout=5.0)
self._cleanup()
print(f"Stopped decoder for {self.rtsp_url}")
def _set_status(self, status: ConnectionStatus):
"""Thread-safe status update"""
with self._status_lock:
self.status = status
def get_status(self) -> ConnectionStatus:
"""Get current connection status"""
with self._status_lock:
return self.status
def _init_rtsp_connection(self) -> bool:
"""Initialize RTSP connection using PyAV + PyNvVideoCodec"""
try:
self._set_status(ConnectionStatus.CONNECTING)
# Open RTSP stream with PyAV
options = {
'rtsp_transport': 'tcp',
'max_delay': '500000', # 500ms
'rtsp_flags': 'prefer_tcp',
'timeout': '5000000', # 5 seconds
}
self.container = av.open(self.rtsp_url, options=options)
# Get video stream
video_stream = self.container.streams.video[0]
self.frame_width = video_stream.width
self.frame_height = video_stream.height
print(f"RTSP connected: {self.frame_width}x{self.frame_height}")
# Map codec name to PyNvVideoCodec codec enum
codec_map = {
'h264': nvc.cudaVideoCodec.H264,
'hevc': nvc.cudaVideoCodec.HEVC,
'h265': nvc.cudaVideoCodec.HEVC,
}
codec_id = codec_map.get(self.codec.lower(), nvc.cudaVideoCodec.H264)
# Initialize NVDEC decoder with shared CUDA context
self.decoder = nvc.CreateDecoder(
gpuid=self.gpu_id,
codec=codec_id,
cudacontext=self.cuda_context,
usedevicememory=True
)
self._set_status(ConnectionStatus.CONNECTED)
return True
except Exception as e:
print(f"Failed to connect to RTSP stream {self.rtsp_url}: {e}")
self._set_status(ConnectionStatus.ERROR)
return False
def _decode_loop(self):
"""Main decode loop running in background thread"""
retry_count = 0
max_retries = 5
while not self._stop_flag.is_set():
# Initialize connection
if not self._init_rtsp_connection():
retry_count += 1
if retry_count >= max_retries:
print(f"Max retries reached for {self.rtsp_url}")
self._set_status(ConnectionStatus.ERROR)
break
self._set_status(ConnectionStatus.RECONNECTING)
self._stop_flag.wait(timeout=2.0)
continue
retry_count = 0 # Reset on successful connection
try:
# Decode loop - iterate through packets from PyAV
for packet in self.container.demux(video=0):
if self._stop_flag.is_set():
break
if packet.dts is None:
continue
# Convert packet to numpy array
packet_data = np.frombuffer(bytes(packet), dtype=np.uint8)
# Create PacketData and pass numpy array pointer
pkt = nvc.PacketData()
pkt.bsl_data = packet_data.ctypes.data
pkt.bsl = len(packet_data)
# Decode using NVDEC
decoded_frames = self.decoder.Decode(pkt)
if not decoded_frames:
continue
# Add frames to ring buffer (thread-safe)
with self._buffer_lock:
for frame in decoded_frames:
self.frame_buffer.append(frame)
self.frame_count += 1
except Exception as e:
print(f"Error in decode loop for {self.rtsp_url}: {e}")
self._set_status(ConnectionStatus.RECONNECTING)
self._cleanup()
self._stop_flag.wait(timeout=2.0)
def _cleanup(self):
"""Cleanup resources"""
if self.container:
try:
self.container.close()
except:
pass
self.container = None
self.decoder = None
with self._buffer_lock:
self.frame_buffer.clear()
def get_frame(self, index: int = -1, rgb: bool = True) -> Optional[torch.Tensor]:
"""
Get a frame from the buffer as a CUDA tensor (in VRAM).
Args:
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
Returns:
torch.Tensor in CUDA memory (device tensor) or None if buffer empty
- If rgb=True: Shape (3, H, W) in RGB format, dtype uint8
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
"""
with self._buffer_lock:
if len(self.frame_buffer) == 0:
return None
try:
decoded_frame = self.frame_buffer[index]
# Convert DecodedFrame to PyTorch tensor using DLPack (zero-copy)
# This keeps the data in GPU memory
nv12_tensor = torch.from_dlpack(decoded_frame)
if not rgb:
# Return raw NV12 format
return nv12_tensor
# Convert NV12 to RGB on GPU
if self.frame_height is None or self.frame_width is None:
print("Frame dimensions not available")
return None
rgb_tensor = nv12_to_rgb_gpu(nv12_tensor, self.frame_height, self.frame_width)
return rgb_tensor
except (IndexError, Exception) as e:
print(f"Error getting frame: {e}")
return None
def get_latest_frame(self, rgb: bool = True) -> Optional[torch.Tensor]:
"""
Get the most recent decoded frame as CUDA tensor.
Args:
rgb: If True, convert to RGB. If False, return raw NV12.
Returns:
torch.Tensor on GPU in RGB (3, H, W) or NV12 (H*3/2, W) format
"""
return self.get_frame(-1, rgb=rgb)
def get_frame_cpu(self, index: int = -1, rgb: bool = True) -> Optional[np.ndarray]:
"""
Get a frame from the buffer and copy it to CPU memory as numpy array.
Args:
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
Returns:
numpy.ndarray in CPU memory or None if buffer empty
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8 (HWC format for easy display)
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
"""
# Get frame on GPU
gpu_frame = self.get_frame(index=index, rgb=rgb)
if gpu_frame is None:
return None
# Transfer from GPU to CPU
cpu_tensor = gpu_frame.cpu()
# Convert to numpy array
if rgb:
# Convert from (3, H, W) to (H, W, 3) for standard image format
cpu_array = cpu_tensor.permute(1, 2, 0).numpy()
else:
# Keep NV12 format as-is
cpu_array = cpu_tensor.numpy()
return cpu_array
def get_latest_frame_cpu(self, rgb: bool = True) -> Optional[np.ndarray]:
"""
Get the most recent decoded frame as CPU numpy array.
Args:
rgb: If True, convert to RGB. If False, return raw NV12.
Returns:
numpy.ndarray in CPU memory
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
"""
return self.get_frame_cpu(-1, rgb=rgb)
def get_buffer_size(self) -> int:
"""Get current number of frames in buffer"""
with self._buffer_lock:
return len(self.frame_buffer)
def is_connected(self) -> bool:
"""Check if stream is actively connected"""
return self.get_status() == ConnectionStatus.CONNECTED
def get_frame_as_jpeg(self, index: int = -1, quality: int = 95) -> Optional[bytes]:
"""
Get a frame from the buffer and encode to JPEG.
This method:
1. Gets RGB frame from buffer (stays on GPU)
2. Encodes to JPEG using nvJPEG (GPU operation via shared encoder)
3. Transfers JPEG bytes to CPU
4. Returns bytes for saving to disk
Args:
index: Frame index in buffer (-1 for latest)
quality: JPEG quality (0-100, default 95)
Returns:
JPEG encoded bytes or None if frame unavailable
"""
# Get RGB frame (on GPU)
rgb_frame = self.get_frame(index=index, rgb=True)
# Use the shared JPEG encoder from jpeg_encoder module
return encode_frame_to_jpeg(rgb_frame, quality=quality)
def __repr__(self):
return (f"StreamDecoder(url={self.rtsp_url}, status={self.status.value}, "
f"buffer={self.get_buffer_size()}/{self.buffer_size}, "
f"frames_decoded={self.frame_count})")

174
test_jpeg_encode.py Executable file
View file

@ -0,0 +1,174 @@
#!/usr/bin/env python3
"""
Test script for JPEG encoding with nvImageCodec
Tests GPU-accelerated JPEG encoding from RTSP stream frames
"""
import argparse
import sys
import time
import os
from pathlib import Path
from dotenv import load_dotenv
from services import StreamDecoderFactory
# Load environment variables from .env file
load_dotenv()
def main():
parser = argparse.ArgumentParser(description='Test JPEG encoding from RTSP stream')
parser.add_argument(
'--rtsp-url',
type=str,
default=None,
help='RTSP stream URL (defaults to CAMERA_URL_1 from .env)'
)
parser.add_argument(
'--output-dir',
type=str,
default='./snapshots',
help='Output directory for JPEG files'
)
parser.add_argument(
'--num-frames',
type=int,
default=10,
help='Number of frames to capture'
)
parser.add_argument(
'--interval',
type=float,
default=1.0,
help='Interval between captures in seconds'
)
parser.add_argument(
'--quality',
type=int,
default=95,
help='JPEG quality (0-100)'
)
parser.add_argument(
'--gpu-id',
type=int,
default=0,
help='GPU device ID'
)
args = parser.parse_args()
# Get RTSP URL from command line or environment
rtsp_url = args.rtsp_url
if not rtsp_url:
rtsp_url = os.getenv('CAMERA_URL_1')
if not rtsp_url:
print("Error: No RTSP URL provided")
print("Please either:")
print(" 1. Use --rtsp-url argument, or")
print(" 2. Add CAMERA_URL_1 to your .env file")
sys.exit(1)
# Create output directory
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
print("=" * 80)
print("RTSP Stream JPEG Encoding Test")
print("=" * 80)
print(f"RTSP URL: {rtsp_url}")
print(f"Output Directory: {output_dir}")
print(f"Number of Frames: {args.num_frames}")
print(f"Capture Interval: {args.interval}s")
print(f"JPEG Quality: {args.quality}")
print(f"GPU ID: {args.gpu_id}")
print("=" * 80)
print()
try:
# Initialize factory and decoder
print("[1/3] Initializing StreamDecoderFactory...")
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
print("✓ Factory initialized\n")
print("[2/3] Creating and starting decoder...")
decoder = factory.create_decoder(
rtsp_url=rtsp_url,
buffer_size=30
)
decoder.start()
print("✓ Decoder started\n")
# Wait for connection
print("[3/3] Waiting for stream to connect...")
max_wait = 10
for i in range(max_wait):
if decoder.is_connected():
print("✓ Stream connected\n")
break
time.sleep(1)
print(f" Waiting... {i+1}/{max_wait}s")
else:
print("✗ Failed to connect to stream")
sys.exit(1)
# Capture frames
print(f"Capturing {args.num_frames} frames...")
print("-" * 80)
captured = 0
for i in range(args.num_frames):
# Get frame as JPEG
start_time = time.time()
jpeg_bytes = decoder.get_frame_as_jpeg(quality=args.quality)
encode_time = (time.time() - start_time) * 1000 # ms
if jpeg_bytes:
# Save to file
filename = output_dir / f"frame_{i:04d}.jpg"
with open(filename, 'wb') as f:
f.write(jpeg_bytes)
size_kb = len(jpeg_bytes) / 1024
print(f"[{i+1}/{args.num_frames}] Saved {filename.name} "
f"({size_kb:.1f} KB, encoded in {encode_time:.2f}ms)")
captured += 1
else:
print(f"[{i+1}/{args.num_frames}] Failed to get frame")
# Wait before next capture (except for last frame)
if i < args.num_frames - 1:
time.sleep(args.interval)
print("-" * 80)
# Summary
print("\n" + "=" * 80)
print("Capture Complete")
print("=" * 80)
print(f"Successfully captured: {captured}/{args.num_frames} frames")
print(f"Output directory: {output_dir.absolute()}")
print("=" * 80)
except KeyboardInterrupt:
print("\n\n✗ Interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\n✗ Error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
finally:
# Cleanup
if 'decoder' in locals():
print("\nCleaning up...")
decoder.stop()
print("✓ Decoder stopped")
print("\n✓ Test completed successfully")
sys.exit(0)
if __name__ == '__main__':
main()

310
test_model_inference.py Normal file
View file

@ -0,0 +1,310 @@
"""
Test script for TensorRT Model Repository with multi-camera inference.
This demonstrates:
1. Loading the same model for multiple cameras (deduplication)
2. Context pool load balancing
3. GPU-to-GPU inference from RTSP streams
4. Memory efficiency with shared engines
"""
import time
import torch
from services.model_repository import TensorRTModelRepository
from services.stream_decoder import StreamDecoderFactory
def test_multi_camera_inference():
"""
Simulate multi-camera inference scenario.
Example: 100 cameras, all using the same YOLOv8 model
- Without pooling: 100 engines + 100 contexts in VRAM
- With pooling: 1 engine + 4 contexts in VRAM (huge savings!)
"""
# Initialize model repository with context pooling
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
# Camera configurations (simulated)
camera_configs = [
{"id": "camera_1", "rtsp_url": "rtsp://camera1.local/stream"},
{"id": "camera_2", "rtsp_url": "rtsp://camera2.local/stream"},
{"id": "camera_3", "rtsp_url": "rtsp://camera3.local/stream"},
# ... imagine 100 cameras here
]
# Load the same model for all cameras
model_file = "models/yolov8n.trt" # Same file for all cameras
print("=" * 80)
print("LOADING MODELS FOR MULTIPLE CAMERAS")
print("=" * 80)
for config in camera_configs:
try:
# Each camera gets its own model_id, but shares the same engine!
metadata = repo.load_model(
model_id=config["id"],
file_path=model_file,
num_contexts=4 # 4 contexts shared across all cameras
)
print(f"\n✓ Loaded model for {config['id']}")
except Exception as e:
print(f"\n✗ Failed to load model for {config['id']}: {e}")
# Show repository stats
print("\n" + "=" * 80)
print("REPOSITORY STATISTICS")
print("=" * 80)
stats = repo.get_stats()
print(f"Total model IDs: {stats['total_model_ids']}")
print(f"Unique engines in VRAM: {stats['unique_engines']}")
print(f"Total contexts: {stats['total_contexts']}")
print(f"Memory efficiency: {stats['memory_efficiency']}")
# Get detailed info for one camera
print("\n" + "=" * 80)
print("DETAILED MODEL INFO (camera_1)")
print("=" * 80)
info = repo.get_model_info("camera_1")
if info:
print(f"Model ID: {info['model_id']}")
print(f"File: {info['file_path']}")
print(f"File hash: {info['file_hash']}")
print(f"Engine references: {info['engine_references']}")
print(f"Context pool size: {info['context_pool_size']}")
print(f"Shared with: {info['shared_with_model_ids']}")
print(f"\nInputs:")
for name, spec in info['inputs'].items():
print(f" {name}: {spec['shape']} ({spec['dtype']})")
print(f"\nOutputs:")
for name, spec in info['outputs'].items():
print(f" {name}: {spec['shape']} ({spec['dtype']})")
# Simulate inference from multiple cameras
print("\n" + "=" * 80)
print("RUNNING INFERENCE (GPU-to-GPU)")
print("=" * 80)
# Create dummy input tensors (simulating frames from cameras)
# In real scenario, these come from StreamDecoder.get_frame()
batch_size = 1
channels = 3
height = 640
width = 640
for config in camera_configs:
try:
# Simulate getting frame from camera (already on GPU)
input_tensor = torch.rand(
batch_size, channels, height, width,
dtype=torch.float32,
device='cuda:0'
)
# Run inference (stays in GPU)
start = time.time()
outputs = repo.infer(
model_id=config["id"],
inputs={"images": input_tensor}, # Adjust input name based on your model
synchronize=True,
timeout=5.0
)
elapsed = (time.time() - start) * 1000 # Convert to ms
print(f"\n{config['id']}: Inference completed in {elapsed:.2f}ms")
for name, tensor in outputs.items():
print(f" Output '{name}': {tensor.shape} on {tensor.device}")
except Exception as e:
print(f"\n{config['id']}: Inference failed: {e}")
# Cleanup
print("\n" + "=" * 80)
print("CLEANUP")
print("=" * 80)
for config in camera_configs:
repo.unload_model(config["id"])
print("\nAll models unloaded.")
def test_rtsp_stream_with_inference():
"""
Real-world example: Decode RTSP stream and run inference.
Everything stays in GPU memory (zero CPU transfers).
"""
print("=" * 80)
print("RTSP STREAM + TENSORRT INFERENCE (GPU-to-GPU)")
print("=" * 80)
# Initialize components
decoder_factory = StreamDecoderFactory(gpu_id=0)
model_repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
# Setup camera stream
rtsp_url = "rtsp://your-camera-ip/stream"
decoder = decoder_factory.create_decoder(rtsp_url, buffer_size=30)
decoder.start()
# Load inference model
try:
model_repo.load_model(
model_id="camera_main",
file_path="models/yolov8n.trt"
)
except FileNotFoundError:
print("\n⚠ Model file not found. Please export your model to TensorRT:")
print(" Example: yolo export model=yolov8n.pt format=engine device=0")
return
print("\nWaiting for stream to buffer frames...")
time.sleep(3)
# Process frames
for i in range(10):
# Get frame from decoder (already on GPU)
frame_gpu = decoder.get_latest_frame(rgb=True) # Returns torch.Tensor on CUDA
if frame_gpu is None:
print(f"Frame {i}: No frame available")
continue
# Preprocess if needed (stays on GPU)
# For YOLOv8: normalize, resize, etc.
# Example preprocessing (adjust for your model):
frame_gpu = frame_gpu.float() / 255.0 # Normalize to [0, 1]
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dimension: (1, 3, H, W)
# Run inference (GPU-to-GPU, zero copy)
try:
outputs = model_repo.infer(
model_id="camera_main",
inputs={"images": frame_gpu},
synchronize=True
)
print(f"\nFrame {i}: Inference successful")
for name, tensor in outputs.items():
print(f" {name}: {tensor.shape} on {tensor.device}")
# Post-process results (can stay on GPU or move to CPU as needed)
# Example: NMS, bounding box extraction, etc.
except Exception as e:
print(f"\nFrame {i}: Inference failed: {e}")
time.sleep(0.1) # Simulate processing interval
# Cleanup
decoder.stop()
model_repo.unload_model("camera_main")
print("\n✓ Test completed successfully")
def test_concurrent_inference():
"""
Test concurrent inference from multiple threads.
Demonstrates context pool load balancing.
"""
import threading
print("=" * 80)
print("CONCURRENT INFERENCE TEST (Context Pool Load Balancing)")
print("=" * 80)
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
# Load model
try:
repo.load_model("shared_model", "models/yolov8n.trt", num_contexts=4)
except Exception as e:
print(f"Failed to load model: {e}")
return
def worker(worker_id: int, num_inferences: int):
"""Worker thread performing inference"""
for i in range(num_inferences):
try:
# Create dummy input
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0', dtype=torch.float32)
# Acquire context from pool, run inference, release context
outputs = repo.infer(
model_id="shared_model",
inputs={"images": input_tensor},
timeout=10.0
)
print(f"Worker {worker_id}, Inference {i}: SUCCESS")
except Exception as e:
print(f"Worker {worker_id}, Inference {i}: FAILED - {e}")
time.sleep(0.01) # Small delay
# Launch multiple worker threads (more workers than contexts!)
threads = []
num_workers = 10 # 10 workers sharing 4 contexts
inferences_per_worker = 5
print(f"\nLaunching {num_workers} workers (only 4 contexts available)")
print("Contexts will be borrowed/returned automatically\n")
start_time = time.time()
for worker_id in range(num_workers):
t = threading.Thread(target=worker, args=(worker_id, inferences_per_worker))
threads.append(t)
t.start()
# Wait for all workers
for t in threads:
t.join()
elapsed = time.time() - start_time
total_inferences = num_workers * inferences_per_worker
print(f"\n✓ Completed {total_inferences} inferences in {elapsed:.2f}s")
print(f" Throughput: {total_inferences / elapsed:.2f} inferences/sec")
print(f" With only 4 contexts for {num_workers} workers!")
repo.unload_model("shared_model")
if __name__ == "__main__":
print("\n" + "=" * 80)
print("TENSORRT MODEL REPOSITORY - TEST SUITE")
print("=" * 80)
# Test 1: Multi-camera model loading
print("\n\nTEST 1: Multi-Camera Model Loading with Deduplication")
print("-" * 80)
try:
test_multi_camera_inference()
except Exception as e:
print(f"Test 1 failed: {e}")
# Test 2: RTSP stream + inference (commented out by default)
# Uncomment if you have a real RTSP stream
# print("\n\nTEST 2: RTSP Stream + Inference")
# print("-" * 80)
# try:
# test_rtsp_stream_with_inference()
# except Exception as e:
# print(f"Test 2 failed: {e}")
# Test 3: Concurrent inference
print("\n\nTEST 3: Concurrent Inference with Context Pooling")
print("-" * 80)
try:
test_concurrent_inference()
except Exception as e:
print(f"Test 3 failed: {e}")
print("\n" + "=" * 80)
print("ALL TESTS COMPLETED")
print("=" * 80)

255
test_multi_stream.py Executable file
View file

@ -0,0 +1,255 @@
#!/usr/bin/env python3
"""
Multi-stream test script to verify CUDA context sharing efficiency.
Tests multiple RTSP streams simultaneously and monitors VRAM usage.
"""
import argparse
import time
import sys
import subprocess
import os
from pathlib import Path
from dotenv import load_dotenv
from services import StreamDecoderFactory, ConnectionStatus
# Load environment variables from .env file
load_dotenv()
def get_gpu_memory_usage(gpu_id: int = 0) -> int:
"""Get current GPU memory usage in MB using nvidia-smi"""
try:
result = subprocess.run(
['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits', f'--id={gpu_id}'],
capture_output=True,
text=True,
check=True
)
return int(result.stdout.strip())
except Exception as e:
print(f"Warning: Could not get GPU memory usage: {e}")
return 0
def main():
parser = argparse.ArgumentParser(description='Test multi-stream decoding with context sharing')
parser.add_argument(
'--gpu-id',
type=int,
default=0,
help='GPU device ID'
)
parser.add_argument(
'--duration',
type=int,
default=20,
help='Test duration in seconds'
)
parser.add_argument(
'--capture-snapshots',
action='store_true',
help='Capture JPEG snapshots during test'
)
parser.add_argument(
'--output-dir',
type=str,
default='./multi_stream_snapshots',
help='Output directory for snapshots'
)
args = parser.parse_args()
# Load camera URLs from environment
camera_urls = []
i = 1
while True:
url = os.getenv(f'CAMERA_URL_{i}')
if url:
camera_urls.append(url)
i += 1
else:
break
if not camera_urls:
print("Error: No camera URLs found in .env file")
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
sys.exit(1)
# Create output directory if capturing snapshots
if args.capture_snapshots:
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
print("=" * 80)
print("Multi-Stream RTSP Decoder Test - Context Sharing Verification")
print("=" * 80)
print(f"Number of Streams: {len(camera_urls)}")
print(f"GPU ID: {args.gpu_id}")
print(f"Test Duration: {args.duration} seconds")
print(f"Capture Snapshots: {args.capture_snapshots}")
print("=" * 80)
print()
try:
# Get baseline GPU memory
print("[Baseline] Measuring initial GPU memory usage...")
baseline_memory = get_gpu_memory_usage(args.gpu_id)
print(f"✓ Baseline VRAM: {baseline_memory} MB\n")
# Initialize factory (shared CUDA context)
print("[1/4] Initializing StreamDecoderFactory with shared CUDA context...")
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
factory_memory = get_gpu_memory_usage(args.gpu_id)
factory_overhead = factory_memory - baseline_memory
print(f"✓ Factory initialized")
print(f" VRAM after factory: {factory_memory} MB (+{factory_overhead} MB)\n")
# Create all decoders
print(f"[2/4] Creating {len(camera_urls)} StreamDecoder instances...")
decoders = []
for i, url in enumerate(camera_urls):
decoder = factory.create_decoder(
rtsp_url=url,
buffer_size=30,
codec='h264'
)
decoders.append(decoder)
print(f" ✓ Decoder {i+1} created for camera {url.split('@')[1].split('/')[0]}")
decoders_memory = get_gpu_memory_usage(args.gpu_id)
decoders_overhead = decoders_memory - factory_memory
print(f"\n VRAM after creating {len(decoders)} decoders: {decoders_memory} MB (+{decoders_overhead} MB)")
print(f" Average per decoder: {decoders_overhead / len(decoders):.1f} MB\n")
# Start all decoders
print(f"[3/4] Starting all {len(decoders)} decoders...")
for i, decoder in enumerate(decoders):
decoder.start()
print(f" ✓ Decoder {i+1} started")
started_memory = get_gpu_memory_usage(args.gpu_id)
started_overhead = started_memory - decoders_memory
print(f"\n VRAM after starting decoders: {started_memory} MB (+{started_overhead} MB)")
print(f" Average per running decoder: {started_overhead / len(decoders):.1f} MB\n")
# Wait for all streams to connect
print("[4/4] Waiting for all streams to connect...")
max_wait = 15
for wait_time in range(max_wait):
connected = sum(1 for d in decoders if d.is_connected())
print(f" Connected: {connected}/{len(decoders)} streams", end='\r')
if connected == len(decoders):
print(f"\n✓ All {len(decoders)} streams connected!\n")
break
time.sleep(1)
else:
connected = sum(1 for d in decoders if d.is_connected())
print(f"\n⚠ Only {connected}/{len(decoders)} streams connected after {max_wait}s\n")
connected_memory = get_gpu_memory_usage(args.gpu_id)
connected_overhead = connected_memory - started_memory
print(f" VRAM after connection: {connected_memory} MB (+{connected_overhead} MB)\n")
# Monitor streams
print(f"Monitoring streams for {args.duration} seconds...")
print("=" * 80)
print(f"{'Time':<8} {'VRAM':<10} {'Stream 1':<12} {'Stream 2':<12} {'Stream 3':<12} {'Stream 4':<12}")
print("-" * 80)
start_time = time.time()
snapshot_interval = args.duration // 3 if args.capture_snapshots else 0
last_snapshot = 0
while time.time() - start_time < args.duration:
elapsed = time.time() - start_time
current_memory = get_gpu_memory_usage(args.gpu_id)
# Get stats for each decoder
stats = []
for decoder in decoders:
status = decoder.get_status().value[:8]
buffer = decoder.get_buffer_size()
frames = decoder.frame_count
stats.append(f"{status:8s} {buffer:2d}/30 {frames:4d}")
print(f"{elapsed:6.1f}s {current_memory:6d}MB {stats[0]:<12} {stats[1]:<12} {stats[2]:<12} {stats[3]:<12}")
# Capture snapshots
if args.capture_snapshots and snapshot_interval > 0:
if elapsed - last_snapshot >= snapshot_interval:
print("\n → Capturing snapshots from all streams...")
for i, decoder in enumerate(decoders):
jpeg_bytes = decoder.get_frame_as_jpeg(quality=85)
if jpeg_bytes:
filename = output_dir / f"camera_{i+1}_t{int(elapsed)}s.jpg"
with open(filename, 'wb') as f:
f.write(jpeg_bytes)
print(f" Saved {filename.name} ({len(jpeg_bytes)/1024:.1f} KB)")
print()
last_snapshot = elapsed
time.sleep(1)
print("=" * 80)
# Final memory analysis
final_memory = get_gpu_memory_usage(args.gpu_id)
total_overhead = final_memory - baseline_memory
print("\n" + "=" * 80)
print("Memory Usage Analysis")
print("=" * 80)
print(f"Baseline VRAM: {baseline_memory:6d} MB")
print(f"After Factory Init: {factory_memory:6d} MB (+{factory_overhead:4d} MB)")
print(f"After Creating {len(decoders)} Decoders: {decoders_memory:6d} MB (+{decoders_overhead:4d} MB)")
print(f"After Starting Decoders: {started_memory:6d} MB (+{started_overhead:4d} MB)")
print(f"After Connection: {connected_memory:6d} MB (+{connected_overhead:4d} MB)")
print(f"Final (after {args.duration}s): {final_memory:6d} MB (+{total_overhead:4d} MB total)")
print("-" * 80)
print(f"Average VRAM per stream: {total_overhead / len(decoders):6.1f} MB")
print(f"Context sharing efficiency: {'EXCELLENT' if total_overhead < 500 else 'GOOD' if total_overhead < 800 else 'POOR'}")
print("=" * 80)
# Final stats
print("\nFinal Stream Statistics:")
print("-" * 80)
for i, decoder in enumerate(decoders):
status = decoder.get_status().value
buffer = decoder.get_buffer_size()
frames = decoder.frame_count
fps = frames / args.duration if args.duration > 0 else 0
print(f"Stream {i+1}: {status:12s} | Buffer: {buffer:2d}/{decoder.buffer_size} | "
f"Frames: {frames:5d} | Avg FPS: {fps:5.2f}")
print("=" * 80)
except KeyboardInterrupt:
print("\n\n✗ Interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\n✗ Error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
finally:
# Cleanup
if 'decoders' in locals():
print("\nCleaning up...")
for i, decoder in enumerate(decoders):
decoder.stop()
print(f" ✓ Decoder {i+1} stopped")
cleanup_memory = get_gpu_memory_usage(args.gpu_id)
print(f"\nVRAM after cleanup: {cleanup_memory} MB")
print("\n✓ Multi-stream test completed successfully")
sys.exit(0)
if __name__ == '__main__':
main()

152
test_stream.py Executable file
View file

@ -0,0 +1,152 @@
#!/usr/bin/env python3
"""
CLI test script for StreamDecoder
Tests RTSP stream decoding with NVDEC hardware acceleration
"""
import argparse
import time
import sys
from services.stream_decoder import StreamDecoderFactory, ConnectionStatus
def main():
parser = argparse.ArgumentParser(description='Test RTSP stream decoder with NVDEC')
parser.add_argument(
'--rtsp-url',
type=str,
required=True,
help='RTSP stream URL (e.g., rtsp://user:pass@host/path)'
)
parser.add_argument(
'--gpu-id',
type=int,
default=0,
help='GPU device ID'
)
parser.add_argument(
'--buffer-size',
type=int,
default=30,
help='Frame buffer size'
)
parser.add_argument(
'--duration',
type=int,
default=30,
help='Test duration in seconds'
)
parser.add_argument(
'--check-interval',
type=float,
default=1.0,
help='Status check interval in seconds'
)
args = parser.parse_args()
print("=" * 80)
print("RTSP Stream Decoder Test")
print("=" * 80)
print(f"RTSP URL: {args.rtsp_url}")
print(f"GPU ID: {args.gpu_id}")
print(f"Buffer Size: {args.buffer_size} frames")
print(f"Test Duration: {args.duration} seconds")
print("=" * 80)
print()
try:
# Create factory with shared CUDA context
print("[1/4] Initializing StreamDecoderFactory...")
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
print("✓ Factory initialized with shared CUDA context\n")
# Create decoder
print("[2/4] Creating StreamDecoder...")
decoder = factory.create_decoder(
rtsp_url=args.rtsp_url,
buffer_size=args.buffer_size,
codec='h264'
)
print(f"✓ Decoder created: {decoder}\n")
# Start decoding
print("[3/4] Starting decoder thread...")
decoder.start()
print("✓ Decoder thread started\n")
# Monitor for specified duration
print(f"[4/4] Monitoring stream for {args.duration} seconds...")
print("-" * 80)
start_time = time.time()
last_frame_count = 0
while time.time() - start_time < args.duration:
time.sleep(args.check_interval)
# Get status
status = decoder.get_status()
buffer_size = decoder.get_buffer_size()
frame_count = decoder.frame_count
fps = (frame_count - last_frame_count) / args.check_interval
last_frame_count = frame_count
# Print status
elapsed = time.time() - start_time
print(f"[{elapsed:6.1f}s] Status: {status.value:12s} | "
f"Buffer: {buffer_size:2d}/{args.buffer_size:2d} | "
f"Frames: {frame_count:5d} | "
f"FPS: {fps:5.1f}")
# Try to get latest frame
if status == ConnectionStatus.CONNECTED:
frame = decoder.get_latest_frame()
if frame is not None:
print(f" Frame shape: {frame.shape}, dtype: {frame.dtype}, "
f"device: {frame.device}")
# Check for errors
if status == ConnectionStatus.ERROR:
print("\n✗ ERROR: Stream connection failed!")
break
print("-" * 80)
# Final statistics
print("\n" + "=" * 80)
print("Test Complete - Final Statistics")
print("=" * 80)
print(f"Total Frames Decoded: {decoder.frame_count}")
print(f"Average FPS: {decoder.frame_count / args.duration:.2f}")
print(f"Final Status: {decoder.get_status().value}")
print(f"Buffer Utilization: {decoder.get_buffer_size()}/{args.buffer_size}")
if decoder.frame_width and decoder.frame_height:
print(f"Frame Resolution: {decoder.frame_width}x{decoder.frame_height}")
print("=" * 80)
except KeyboardInterrupt:
print("\n\n✗ Interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n\n✗ Error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
finally:
# Cleanup
if 'decoder' in locals():
print("\nCleaning up...")
decoder.stop()
print("✓ Decoder stopped")
print("\n✓ Test completed successfully")
sys.exit(0)
if __name__ == '__main__':
main()

143
test_vram_process.py Normal file
View file

@ -0,0 +1,143 @@
#!/usr/bin/env python3
"""
VRAM scaling test - measures Python process memory usage for 1, 2, 3, and 4 streams.
"""
import os
import time
import subprocess
from dotenv import load_dotenv
from services import StreamDecoderFactory
# Load environment variables from .env file
load_dotenv()
# Load camera URLs from environment
camera_urls = []
i = 1
while True:
url = os.getenv(f'CAMERA_URL_{i}')
if url:
camera_urls.append(url)
i += 1
else:
break
if not camera_urls:
print("Error: No camera URLs found in .env file")
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
exit(1)
def get_python_gpu_memory():
"""Get Python process GPU memory usage in MB"""
try:
pid = os.getpid()
result = subprocess.run(
['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'],
capture_output=True, text=True, check=True
)
for line in result.stdout.strip().split('\n'):
if line:
parts = line.split(',')
if len(parts) >= 2 and int(parts[0].strip()) == pid:
return int(parts[1].strip())
return 0
except:
return 0
def test_n_streams(n, wait_time=15):
"""Test with n streams"""
print(f"\n{'='*80}")
print(f"Testing with {n} stream(s)")
print('='*80)
mem_before = get_python_gpu_memory()
print(f"Python process VRAM before: {mem_before} MB")
# Create factory
factory = StreamDecoderFactory(gpu_id=0)
time.sleep(1)
mem_after_factory = get_python_gpu_memory()
print(f"After factory: {mem_after_factory} MB (+{mem_after_factory - mem_before} MB)")
# Create decoders
decoders = []
for i in range(n):
decoder = factory.create_decoder(camera_urls[i], buffer_size=30)
decoders.append(decoder)
time.sleep(1)
mem_after_create = get_python_gpu_memory()
print(f"After creating {n} decoder(s): {mem_after_create} MB (+{mem_after_create - mem_after_factory} MB)")
# Start decoders
for decoder in decoders:
decoder.start()
time.sleep(2)
mem_after_start = get_python_gpu_memory()
print(f"After starting {n} decoder(s): {mem_after_start} MB (+{mem_after_start - mem_after_create} MB)")
# Wait for connection
print(f"Waiting {wait_time}s for streams to connect and stabilize...")
time.sleep(wait_time)
# Check connection status
connected = sum(1 for d in decoders if d.is_connected())
mem_stable = get_python_gpu_memory()
print(f"Connected: {connected}/{n} streams")
print(f"Python process VRAM (stable): {mem_stable} MB")
# Get frame stats
for i, decoder in enumerate(decoders):
print(f" Stream {i+1}: {decoder.get_status().value:10s} "
f"Buffer: {decoder.get_buffer_size()}/30 "
f"Frames: {decoder.frame_count}")
# Cleanup
for decoder in decoders:
decoder.stop()
time.sleep(2)
mem_after_cleanup = get_python_gpu_memory()
print(f"After cleanup: {mem_after_cleanup} MB")
return mem_stable
if __name__ == '__main__':
print("Python VRAM Scaling Test")
print(f"PID: {os.getpid()}")
baseline = get_python_gpu_memory()
print(f"Baseline Python process VRAM: {baseline} MB\n")
results = {}
for n in [1, 2, 3, 4]:
mem = test_n_streams(n, wait_time=15)
results[n] = mem
print(f"\n{n} stream(s): {mem} MB (process total)")
# Give time between tests
if n < 4:
print("\nWaiting 5s before next test...")
time.sleep(5)
# Summary
print("\n" + "="*80)
print("Python Process VRAM Scaling Summary")
print("="*80)
print(f"Baseline: {baseline:4d} MB")
for n in [1, 2, 3, 4]:
total = results[n]
overhead = total - baseline
per_stream = overhead / n if n > 0 else 0
print(f"{n} stream(s): {total:4d} MB (+{overhead:3d} MB total, {per_stream:5.1f} MB per stream)")
# Calculate marginal cost
print("\nMarginal cost per additional stream:")
for n in [2, 3, 4]:
marginal = results[n] - results[n-1]
print(f" Stream {n}: +{marginal} MB")
print("="*80)

85
verify_tensorrt_model.py Normal file
View file

@ -0,0 +1,85 @@
#!/usr/bin/env python3
"""
Quick verification script for TensorRT model
"""
import torch
from services.model_repository import TensorRTModelRepository
def verify_model():
print("=" * 80)
print("TensorRT Model Verification")
print("=" * 80)
# Initialize repository
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=2)
# Load the model
print("\nLoading YOLOv8n TensorRT engine...")
try:
metadata = repo.load_model(
model_id="yolov8n_test",
file_path="models/yolov8n.trt",
num_contexts=2
)
print("✓ Model loaded successfully!")
except Exception as e:
print(f"✗ Failed to load model: {e}")
return
# Get model info
print("\n" + "=" * 80)
print("Model Information")
print("=" * 80)
info = repo.get_model_info("yolov8n_test")
if info:
print(f"Model ID: {info['model_id']}")
print(f"File: {info['file_path']}")
print(f"File hash: {info['file_hash']}")
print(f"\nInputs:")
for name, spec in info['inputs'].items():
print(f" {name}: {spec['shape']} ({spec['dtype']})")
print(f"\nOutputs:")
for name, spec in info['outputs'].items():
print(f" {name}: {spec['shape']} ({spec['dtype']})")
# Run test inference
print("\n" + "=" * 80)
print("Running Test Inference")
print("=" * 80)
try:
# Create dummy input (simulating a 640x640 image)
input_tensor = torch.rand(1, 3, 640, 640, dtype=torch.float32, device='cuda:0')
print(f"Input tensor: {input_tensor.shape} on {input_tensor.device}")
# Run inference
outputs = repo.infer(
model_id="yolov8n_test",
inputs={"images": input_tensor},
synchronize=True
)
print("\n✓ Inference successful!")
print("\nOutputs:")
for name, tensor in outputs.items():
print(f" {name}: {tensor.shape} on {tensor.device} ({tensor.dtype})")
except Exception as e:
print(f"\n✗ Inference failed: {e}")
import traceback
traceback.print_exc()
# Cleanup
print("\n" + "=" * 80)
print("Cleanup")
print("=" * 80)
repo.unload_model("yolov8n_test")
print("✓ Model unloaded")
print("\n" + "=" * 80)
print("Verification Complete!")
print("=" * 80)
if __name__ == "__main__":
verify_model()