feat: inference subsystem and optimization to decoder
This commit is contained in:
commit
3c83a57e44
19 changed files with 3897 additions and 0 deletions
11
.env.example
Normal file
11
.env.example
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
# RTSP Camera URLs
|
||||
# Add your camera URLs here, one per line with CAMERA_URL_N format
|
||||
|
||||
CAMERA_URL_1=rtsp://user:pass@host/path
|
||||
CAMERA_URL_2=rtsp://user:pass@host/path
|
||||
CAMERA_URL_3=rtsp://user:pass@host/path
|
||||
CAMERA_URL_4=rtsp://user:pass@host/path
|
||||
|
||||
# Add more cameras as needed...
|
||||
# CAMERA_URL_5=rtsp://user:pass@host/path
|
||||
# CAMERA_URL_6=rtsp://user:pass@host/path
|
||||
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
fastapi
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.env
|
||||
.claude
|
||||
models/
|
||||
13
app.py
Normal file
13
app.py
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
from fastapi import FastAPI
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
return {"message": "Hello World"}
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health_check():
|
||||
return {"status": "healthy"}
|
||||
373
claude.md
Normal file
373
claude.md
Normal file
|
|
@ -0,0 +1,373 @@
|
|||
# GPU-Accelerated RTSP Stream Processing System
|
||||
|
||||
## Project Overview
|
||||
|
||||
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
|
||||
|
||||
## Key Achievements
|
||||
|
||||
- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
|
||||
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
|
||||
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
|
||||
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pipeline Flow
|
||||
```
|
||||
RTSP Stream → PyAV (CPU)
|
||||
↓
|
||||
NVDEC Decode (GPU) → NV12 Format
|
||||
↓
|
||||
NV12 to RGB (GPU) → PyTorch Ops
|
||||
↓
|
||||
nvJPEG Encode (GPU) → JPEG Bytes
|
||||
↓
|
||||
CPU (JPEG only)
|
||||
```
|
||||
|
||||
### Core Components
|
||||
|
||||
#### StreamDecoderFactory
|
||||
Singleton factory managing shared CUDA context across all decoder instances.
|
||||
|
||||
**Key Methods:**
|
||||
- `get_factory(gpu_id)`: Returns singleton instance
|
||||
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
|
||||
|
||||
**CUDA Context Initialization:**
|
||||
```python
|
||||
err, = cuda_driver.cuInit(0)
|
||||
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
||||
```
|
||||
|
||||
#### StreamDecoder
|
||||
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
|
||||
|
||||
**Key Features:**
|
||||
- Thread-safe frame buffer (deque)
|
||||
- Connection status tracking
|
||||
- Automatic reconnection handling
|
||||
- Background thread for continuous decoding
|
||||
|
||||
**Key Methods:**
|
||||
- `start()`: Start decoding thread
|
||||
- `stop()`: Stop and cleanup
|
||||
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
|
||||
- `is_connected()`: Check connection status
|
||||
- `get_buffer_size()`: Current buffer size
|
||||
|
||||
#### JPEGEncoderFactory
|
||||
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
|
||||
|
||||
**Key Function:**
|
||||
```python
|
||||
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
||||
"""
|
||||
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
|
||||
Uses __cuda_array_interface__ for zero-copy operation.
|
||||
|
||||
Performance: 1-2ms per 720p frame
|
||||
"""
|
||||
```
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Shared CUDA Context Pattern
|
||||
|
||||
```python
|
||||
# Single shared context for all decoders
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# All decoders share same context
|
||||
decoder1 = factory.create_decoder(url1, buffer_size=30)
|
||||
decoder2 = factory.create_decoder(url2, buffer_size=30)
|
||||
decoder3 = factory.create_decoder(url3, buffer_size=30)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- 70% VRAM reduction per stream
|
||||
- Single decoder initialization overhead
|
||||
- Efficient resource sharing
|
||||
|
||||
### NV12 to RGB Conversion (GPU)
|
||||
|
||||
```python
|
||||
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
||||
"""
|
||||
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
|
||||
Uses BT.601 color space conversion.
|
||||
|
||||
Input: (height * 1.5, width) NV12 tensor
|
||||
Output: (3, height, width) RGB tensor
|
||||
"""
|
||||
```
|
||||
|
||||
**Steps:**
|
||||
1. Split Y and UV planes
|
||||
2. Deinterleave UV components
|
||||
3. Upsample chroma (bilinear interpolation)
|
||||
4. Apply BT.601 color matrix
|
||||
5. Clamp to [0, 255]
|
||||
|
||||
### Zero-Copy Operations
|
||||
|
||||
**DLPack for PyTorch ↔ nvImageCodec:**
|
||||
```python
|
||||
# GPU tensor stays on GPU
|
||||
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
||||
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
|
||||
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
||||
```
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### VRAM Usage (Python Process)
|
||||
|
||||
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|
||||
|---------|-----------|----------|------------|---------------|
|
||||
| 0 | 216 MB | 0 MB | - | - |
|
||||
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
|
||||
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
|
||||
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
|
||||
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
|
||||
|
||||
**Result:** Perfect linear scaling at ~60 MB per stream
|
||||
|
||||
### Capacity Estimates
|
||||
|
||||
With 60 MB per stream + 216 MB baseline:
|
||||
|
||||
- **16GB GPU**: ~269 cameras (conservative: ~250)
|
||||
- **24GB GPU**: ~407 cameras (conservative: ~380)
|
||||
- **48GB GPU**: ~815 cameras (conservative: ~780)
|
||||
- **For 1000 streams**: ~60GB VRAM required
|
||||
|
||||
### Throughput
|
||||
|
||||
- **Frame Rate**: 7-7.5 FPS per stream @ 720p
|
||||
- **JPEG Encoding**: 1-2ms per frame
|
||||
- **Connection Time**: ~15s for stream stabilization
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
python-rtsp-worker/
|
||||
├── app.py # FastAPI application
|
||||
├── services/
|
||||
│ ├── __init__.py # Package exports
|
||||
│ ├── stream_decoder.py # StreamDecoder & Factory
|
||||
│ └── jpeg_encoder.py # JPEG encoding utilities
|
||||
├── test_stream.py # Single stream test
|
||||
├── test_multi_stream.py # 4-stream test with monitoring
|
||||
├── test_vram_scaling.py # System VRAM measurement
|
||||
├── test_vram_process.py # Process VRAM measurement
|
||||
├── test_jpeg_encode.py # JPEG encoding test
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env # Camera URLs (gitignored)
|
||||
├── .env.example # Template for camera URLs
|
||||
└── .gitignore
|
||||
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
fastapi # Web framework
|
||||
uvicorn[standard] # ASGI server
|
||||
torch # GPU tensor operations
|
||||
PyNvVideoCodec # NVDEC hardware decoding
|
||||
av # FFmpeg/RTSP client
|
||||
cuda-python # CUDA driver bindings
|
||||
nvidia-nvimgcodec-cu12 # nvJPEG encoding
|
||||
python-dotenv # Environment variables
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables (.env)
|
||||
|
||||
```bash
|
||||
# RTSP Camera URLs
|
||||
CAMERA_URL_1=rtsp://user:pass@host/path
|
||||
CAMERA_URL_2=rtsp://user:pass@host/path
|
||||
CAMERA_URL_3=rtsp://user:pass@host/path
|
||||
CAMERA_URL_4=rtsp://user:pass@host/path
|
||||
# Add more as needed...
|
||||
```
|
||||
|
||||
### Loading URLs in Code
|
||||
|
||||
```python
|
||||
from dotenv import load_dotenv
|
||||
import os
|
||||
|
||||
load_dotenv()
|
||||
|
||||
camera_urls = []
|
||||
i = 1
|
||||
while True:
|
||||
url = os.getenv(f'CAMERA_URL_{i}')
|
||||
if url:
|
||||
camera_urls.append(url)
|
||||
i += 1
|
||||
else:
|
||||
break
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from services import StreamDecoderFactory, encode_frame_to_jpeg
|
||||
|
||||
# Create factory (shared CUDA context)
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# Create decoder
|
||||
decoder = factory.create_decoder(
|
||||
rtsp_url="rtsp://user:pass@host/path",
|
||||
buffer_size=30
|
||||
)
|
||||
|
||||
# Start decoding
|
||||
decoder.start()
|
||||
|
||||
# Wait for connection
|
||||
import time
|
||||
time.sleep(5)
|
||||
|
||||
# Get latest frame (GPU tensor)
|
||||
rgb_frame = decoder.get_latest_frame()
|
||||
if rgb_frame is not None:
|
||||
# Encode to JPEG (on GPU)
|
||||
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
|
||||
|
||||
# Save or transmit jpeg_bytes
|
||||
with open("frame.jpg", "wb") as f:
|
||||
f.write(jpeg_bytes)
|
||||
|
||||
# Cleanup
|
||||
decoder.stop()
|
||||
```
|
||||
|
||||
### Multi-Stream Usage
|
||||
|
||||
```python
|
||||
from services import StreamDecoderFactory
|
||||
import time
|
||||
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
|
||||
# Create multiple decoders (all share context)
|
||||
decoders = []
|
||||
for url in camera_urls:
|
||||
decoder = factory.create_decoder(url, buffer_size=30)
|
||||
decoder.start()
|
||||
decoders.append(decoder)
|
||||
|
||||
# Wait for connections
|
||||
time.sleep(15)
|
||||
|
||||
# Check status
|
||||
for i, decoder in enumerate(decoders):
|
||||
status = decoder.get_status()
|
||||
buffer_size = decoder.get_buffer_size()
|
||||
connected = decoder.is_connected()
|
||||
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
|
||||
|
||||
# Process frames
|
||||
for decoder in decoders:
|
||||
frame = decoder.get_latest_frame()
|
||||
if frame is not None:
|
||||
# Process frame...
|
||||
pass
|
||||
|
||||
# Cleanup
|
||||
for decoder in decoders:
|
||||
decoder.stop()
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Run Single Stream Test
|
||||
```bash
|
||||
python test_stream.py
|
||||
```
|
||||
|
||||
### Run 4-Stream Test with VRAM Monitoring
|
||||
```bash
|
||||
python test_multi_stream.py
|
||||
```
|
||||
|
||||
### Measure VRAM Scaling
|
||||
```bash
|
||||
python test_vram_process.py
|
||||
```
|
||||
|
||||
### Test JPEG Encoding
|
||||
```bash
|
||||
python test_jpeg_encode.py
|
||||
```
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Segmentation Faults on Cleanup
|
||||
**Status**: Non-critical
|
||||
**Impact**: Occurs during cleanup, doesn't affect core functionality
|
||||
**Cause**: Likely CUDA context cleanup order issues
|
||||
**Workaround**: Functionality works correctly; cleanup errors can be ignored
|
||||
|
||||
## Technical Decisions
|
||||
|
||||
### Why PyNvVideoCodec?
|
||||
- Direct access to NVDEC hardware decoder
|
||||
- Minimal overhead compared to FFmpeg/torchaudio
|
||||
- Returns GPU tensors via DLPack
|
||||
- Better control over decode sessions
|
||||
|
||||
### Why Shared CUDA Context?
|
||||
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
|
||||
- Enables 1000-stream target on 60GB GPU
|
||||
- Minimal complexity overhead with singleton pattern
|
||||
|
||||
### Why nvImageCodec?
|
||||
- GPU-native JPEG encoding (nvJPEG)
|
||||
- Zero-copy with PyTorch via `__cuda_array_interface__`
|
||||
- 1-2ms encoding time per 720p frame
|
||||
- Keeps data on GPU until final compression
|
||||
|
||||
### Why Thread-Safe Ring Buffer?
|
||||
- Decouples decoding from inference pipeline
|
||||
- Prevents frame drops during processing spikes
|
||||
- Allows async frame access
|
||||
- Configurable buffer size per stream
|
||||
|
||||
## Future Considerations
|
||||
|
||||
### Hardware Decode Session Limits
|
||||
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
|
||||
- May need multiple GPUs for 1000 streams
|
||||
- Test with actual hardware to verify limits
|
||||
|
||||
### Scaling Beyond 1000 Streams
|
||||
- Multi-GPU support with context per GPU
|
||||
- Load balancing across GPUs
|
||||
- Network bandwidth considerations
|
||||
|
||||
### TensorRT Integration
|
||||
- Next step: Integrate with TensorRT inference pipeline
|
||||
- GPU frames → TensorRT → Results
|
||||
- Keep entire pipeline on GPU
|
||||
|
||||
## References
|
||||
|
||||
- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
|
||||
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
|
||||
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
|
||||
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
|
||||
|
||||
## License
|
||||
|
||||
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.
|
||||
11
requirements.dev.txt
Normal file
11
requirements.dev.txt
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
# Development Dependencies
|
||||
# Install with: pip install -r requirements.dev.txt
|
||||
|
||||
# Model conversion tools
|
||||
tensorrt
|
||||
onnx
|
||||
ultralytics # For YOLO models download and export
|
||||
|
||||
# Optional: Additional tools for model optimization
|
||||
onnxruntime-gpu # ONNX runtime for testing
|
||||
onnx-simplifier # Simplify ONNX models
|
||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
fastapi
|
||||
uvicorn[standard]
|
||||
torch
|
||||
PyNvVideoCodec
|
||||
av
|
||||
cuda-python
|
||||
nvidia-nvimgcodec-cu12 # GPU-accelerated JPEG encoding/decoding with nvJPEG
|
||||
python-dotenv # Load environment variables from .env file
|
||||
197
scripts/README.md
Normal file
197
scripts/README.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# Scripts Directory
|
||||
|
||||
This directory contains utility scripts for the python-rtsp-worker project.
|
||||
|
||||
## convert_pt_to_tensorrt.py
|
||||
|
||||
Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.
|
||||
|
||||
### Features
|
||||
|
||||
- **Multiple Precision Modes**: FP32, FP16, INT8
|
||||
- **Dynamic Batch Size**: Support for variable batch sizes
|
||||
- **Automatic Optimization**: Creates optimization profiles for best performance
|
||||
- **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
|
||||
- **Easy to Use**: Simple command-line interface
|
||||
|
||||
### Requirements
|
||||
|
||||
Make sure you have the following dependencies installed:
|
||||
|
||||
```bash
|
||||
pip install torch tensorrt onnx
|
||||
```
|
||||
|
||||
### Quick Start
|
||||
|
||||
**Basic conversion (FP32)**:
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model path/to/model.pt \
|
||||
--output models/model.trt
|
||||
```
|
||||
|
||||
**FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model yolov8n.pt \
|
||||
--output models/yolov8n.trt \
|
||||
--fp16
|
||||
```
|
||||
|
||||
**Custom input shape**:
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model model.pt \
|
||||
--output model.trt \
|
||||
--input-shape 1,3,416,416
|
||||
```
|
||||
|
||||
**Dynamic batch size** (for variable batch inference):
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model model.pt \
|
||||
--output model.trt \
|
||||
--dynamic-batch \
|
||||
--max-batch 16
|
||||
```
|
||||
|
||||
**Maximum optimization** (FP16 + INT8):
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model model.pt \
|
||||
--output model.trt \
|
||||
--fp16 \
|
||||
--int8
|
||||
```
|
||||
|
||||
### Command-Line Arguments
|
||||
|
||||
| Argument | Required | Default | Description |
|
||||
|----------|----------|---------|-------------|
|
||||
| `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
|
||||
| `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
|
||||
| `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
|
||||
| `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
|
||||
| `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
|
||||
| `--dynamic-batch` | No | False | Enable dynamic batch size support |
|
||||
| `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
|
||||
| `--workspace-size` | No | 4 | TensorRT workspace size in GB |
|
||||
| `--gpu` | No | 0 | GPU device ID to use |
|
||||
| `--input-names` | No | ["input"] | Custom input tensor names |
|
||||
| `--output-names` | No | ["output"] | Custom output tensor names |
|
||||
| `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
|
||||
| `--verbose`, `-v` | No | False | Enable verbose logging |
|
||||
|
||||
### Performance Tips
|
||||
|
||||
1. **Always use FP16** unless you need FP32 precision:
|
||||
- 2x faster inference
|
||||
- 50% less VRAM usage
|
||||
- Minimal accuracy loss for most models
|
||||
|
||||
2. **Use dynamic batching** for variable workloads:
|
||||
- Process 1-16 images with same engine
|
||||
- Automatic optimization for common batch sizes
|
||||
|
||||
3. **Increase workspace size** for complex models:
|
||||
- Default 4GB works for most models
|
||||
- Increase to 8GB for very large models
|
||||
|
||||
4. **INT8 quantization** for maximum speed:
|
||||
- Requires calibration data (not included in basic conversion)
|
||||
- 4x faster than FP32
|
||||
- Best for deployment scenarios
|
||||
|
||||
### Integration with Model Repository
|
||||
|
||||
Once converted, use the TensorRT engine with the model repository:
|
||||
|
||||
```python
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
# Initialize repository
|
||||
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||
|
||||
# Load the converted model
|
||||
repo.load_model(
|
||||
model_id="my_model",
|
||||
file_path="models/model.trt",
|
||||
num_contexts=4
|
||||
)
|
||||
|
||||
# Run inference
|
||||
import torch
|
||||
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||
outputs = repo.infer(
|
||||
model_id="my_model",
|
||||
inputs={"input": input_tensor}
|
||||
)
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Issue**: `Failed to parse ONNX model`
|
||||
- Solution: Check if your PyTorch model is compatible with ONNX export
|
||||
- Try updating PyTorch and ONNX versions
|
||||
|
||||
**Issue**: `FP16 not supported on this platform`
|
||||
- Solution: Your GPU doesn't support FP16. Remove `--fp16` flag
|
||||
|
||||
**Issue**: `Out of memory during conversion`
|
||||
- Solution: Reduce `--workspace-size` or free up GPU memory
|
||||
|
||||
**Issue**: `Model contains only state_dict`
|
||||
- Solution: Your checkpoint only has weights. You need the full model architecture.
|
||||
- Modify the script's `load_pytorch_model()` method to instantiate your model class
|
||||
|
||||
### Examples for Common Models
|
||||
|
||||
**YOLOv8**:
|
||||
```bash
|
||||
# Download model first
|
||||
# yolo export model=yolov8n.pt format=engine device=0
|
||||
|
||||
# Or use this script
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model yolov8n.pt \
|
||||
--output models/yolov8n.trt \
|
||||
--input-shape 1,3,640,640 \
|
||||
--fp16
|
||||
```
|
||||
|
||||
**ResNet**:
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model resnet50.pt \
|
||||
--output models/resnet50.trt \
|
||||
--input-shape 1,3,224,224 \
|
||||
--fp16 \
|
||||
--dynamic-batch \
|
||||
--max-batch 32
|
||||
```
|
||||
|
||||
**Custom Model**:
|
||||
```bash
|
||||
python scripts/convert_pt_to_tensorrt.py \
|
||||
--model custom_model.pt \
|
||||
--output models/custom.trt \
|
||||
--input-shape 1,3,512,512 \
|
||||
--input-names image \
|
||||
--output-names predictions \
|
||||
--fp16 \
|
||||
--verbose
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- The script uses ONNX as an intermediate format, which is the recommended approach
|
||||
- TensorRT engines are hardware-specific; rebuild for different GPUs
|
||||
- Conversion time varies (30 seconds to 5 minutes depending on model size)
|
||||
- The first inference after loading is slower (warmup)
|
||||
|
||||
### Support
|
||||
|
||||
For issues or questions, please check:
|
||||
- TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
|
||||
- PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html
|
||||
562
scripts/convert_pt_to_tensorrt.py
Executable file
562
scripts/convert_pt_to_tensorrt.py
Executable file
|
|
@ -0,0 +1,562 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
PyTorch to TensorRT Model Conversion Script
|
||||
|
||||
This script converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized inference.
|
||||
|
||||
Features:
|
||||
- Automatic FP32/FP16/INT8 precision modes
|
||||
- Dynamic batch size support
|
||||
- Input shape validation
|
||||
- Optimization profiles for dynamic shapes
|
||||
- ONNX intermediate format
|
||||
- GPU-accelerated conversion
|
||||
|
||||
Usage:
|
||||
python convert_pt_to_tensorrt.py --model path/to/model.pt --output models/model.trt
|
||||
python convert_pt_to_tensorrt.py --model yolov8n.pt --input-shape 1 3 640 640 --fp16
|
||||
python convert_pt_to_tensorrt.py --model model.pt --dynamic-batch --max-batch 16
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Tuple, List, Optional
|
||||
import torch
|
||||
import tensorrt as trt
|
||||
import numpy as np
|
||||
|
||||
|
||||
class TensorRTConverter:
|
||||
"""Converts PyTorch models to TensorRT engines"""
|
||||
|
||||
def __init__(self, gpu_id: int = 0, verbose: bool = True):
|
||||
"""
|
||||
Initialize the converter.
|
||||
|
||||
Args:
|
||||
gpu_id: GPU device ID to use for conversion
|
||||
verbose: Enable verbose logging
|
||||
"""
|
||||
self.gpu_id = gpu_id
|
||||
self.device = torch.device(f'cuda:{gpu_id}')
|
||||
|
||||
# TensorRT logger
|
||||
log_level = trt.Logger.VERBOSE if verbose else trt.Logger.WARNING
|
||||
self.logger = trt.Logger(log_level)
|
||||
|
||||
# Set CUDA device
|
||||
torch.cuda.set_device(gpu_id)
|
||||
|
||||
print(f"Initialized TensorRT Converter on GPU {gpu_id}")
|
||||
print(f"PyTorch version: {torch.__version__}")
|
||||
print(f"TensorRT version: {trt.__version__}")
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
if torch.cuda.is_available():
|
||||
print(f"CUDA device: {torch.cuda.get_device_name(gpu_id)}")
|
||||
|
||||
def load_pytorch_model(self, model_path: str) -> torch.nn.Module:
|
||||
"""
|
||||
Load PyTorch model from file.
|
||||
|
||||
Args:
|
||||
model_path: Path to .pt or .pth file
|
||||
|
||||
Returns:
|
||||
Loaded PyTorch model in eval mode
|
||||
"""
|
||||
print(f"\nLoading PyTorch model from {model_path}...")
|
||||
|
||||
if not Path(model_path).exists():
|
||||
raise FileNotFoundError(f"Model file not found: {model_path}")
|
||||
|
||||
# Load model (weights_only=False for models with custom classes)
|
||||
checkpoint = torch.load(model_path, map_location=self.device, weights_only=False)
|
||||
|
||||
# Handle different checkpoint formats
|
||||
if isinstance(checkpoint, dict):
|
||||
if 'model' in checkpoint:
|
||||
model = checkpoint['model']
|
||||
elif 'state_dict' in checkpoint:
|
||||
# Need model architecture - this is a limitation
|
||||
raise ValueError(
|
||||
"Checkpoint contains only state_dict. "
|
||||
"Please provide the complete model or modify this script to load your architecture."
|
||||
)
|
||||
else:
|
||||
raise ValueError("Unknown checkpoint format")
|
||||
else:
|
||||
model = checkpoint
|
||||
|
||||
# Set to eval mode
|
||||
model.eval()
|
||||
model.to(self.device)
|
||||
|
||||
print(f"✓ Model loaded successfully")
|
||||
return model
|
||||
|
||||
def export_to_onnx(self, model: torch.nn.Module, input_shape: Tuple[int, ...],
|
||||
onnx_path: str, dynamic_batch: bool = False,
|
||||
input_names: List[str] = None, output_names: List[str] = None) -> str:
|
||||
"""
|
||||
Export PyTorch model to ONNX format (intermediate step).
|
||||
|
||||
Args:
|
||||
model: PyTorch model
|
||||
input_shape: Input tensor shape (B, C, H, W)
|
||||
onnx_path: Output path for ONNX file
|
||||
dynamic_batch: Enable dynamic batch dimension
|
||||
input_names: List of input tensor names
|
||||
output_names: List of output tensor names
|
||||
|
||||
Returns:
|
||||
Path to exported ONNX file
|
||||
"""
|
||||
print(f"\nExporting to ONNX format...")
|
||||
print(f"Input shape: {input_shape}")
|
||||
print(f"Dynamic batch: {dynamic_batch}")
|
||||
|
||||
# Default names
|
||||
if input_names is None:
|
||||
input_names = ['input']
|
||||
if output_names is None:
|
||||
output_names = ['output']
|
||||
|
||||
# Create dummy input
|
||||
dummy_input = torch.randn(*input_shape, device=self.device)
|
||||
|
||||
# Dynamic axes configuration
|
||||
dynamic_axes = None
|
||||
if dynamic_batch:
|
||||
dynamic_axes = {
|
||||
input_names[0]: {0: 'batch'},
|
||||
output_names[0]: {0: 'batch'}
|
||||
}
|
||||
|
||||
# Export to ONNX
|
||||
torch.onnx.export(
|
||||
model,
|
||||
dummy_input,
|
||||
onnx_path,
|
||||
input_names=input_names,
|
||||
output_names=output_names,
|
||||
dynamic_axes=dynamic_axes,
|
||||
opset_version=17, # Use recent ONNX opset
|
||||
do_constant_folding=True,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
print(f"✓ ONNX model exported to {onnx_path}")
|
||||
return onnx_path
|
||||
|
||||
def build_tensorrt_engine_from_onnx(self, onnx_path: str, engine_path: str,
|
||||
fp16: bool = False, int8: bool = False,
|
||||
max_workspace_size: int = 4,
|
||||
min_batch: int = 1, opt_batch: int = 1, max_batch: int = 1) -> str:
|
||||
"""
|
||||
Build TensorRT engine from ONNX model.
|
||||
|
||||
Args:
|
||||
onnx_path: Path to ONNX model
|
||||
engine_path: Output path for TensorRT engine
|
||||
fp16: Enable FP16 precision
|
||||
int8: Enable INT8 precision (requires calibration)
|
||||
max_workspace_size: Maximum workspace size in GB
|
||||
min_batch: Minimum batch size for optimization
|
||||
opt_batch: Optimal batch size for optimization
|
||||
max_batch: Maximum batch size for optimization
|
||||
|
||||
Returns:
|
||||
Path to built TensorRT engine
|
||||
"""
|
||||
print(f"\nBuilding TensorRT engine from ONNX...")
|
||||
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||
print(f"Workspace size: {max_workspace_size} GB")
|
||||
|
||||
# Create builder and network
|
||||
builder = trt.Builder(self.logger)
|
||||
network = builder.create_network(
|
||||
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
|
||||
)
|
||||
parser = trt.OnnxParser(network, self.logger)
|
||||
|
||||
# Parse ONNX model
|
||||
print(f"Loading ONNX file from {onnx_path}...")
|
||||
with open(onnx_path, 'rb') as f:
|
||||
if not parser.parse(f.read()):
|
||||
print("ERROR: Failed to parse the ONNX file:")
|
||||
for error in range(parser.num_errors):
|
||||
print(f" {parser.get_error(error)}")
|
||||
raise RuntimeError("Failed to parse ONNX model")
|
||||
|
||||
print(f"✓ ONNX model parsed successfully")
|
||||
|
||||
# Print network info
|
||||
print(f"\nNetwork Information:")
|
||||
print(f" Inputs: {network.num_inputs}")
|
||||
for i in range(network.num_inputs):
|
||||
inp = network.get_input(i)
|
||||
print(f" [{i}] {inp.name}: {inp.shape} ({inp.dtype})")
|
||||
|
||||
print(f" Outputs: {network.num_outputs}")
|
||||
for i in range(network.num_outputs):
|
||||
out = network.get_output(i)
|
||||
print(f" [{i}] {out.name}: {out.shape} ({out.dtype})")
|
||||
|
||||
# Create builder config
|
||||
config = builder.create_builder_config()
|
||||
|
||||
# Set workspace size
|
||||
config.set_memory_pool_limit(
|
||||
trt.MemoryPoolType.WORKSPACE,
|
||||
max_workspace_size * (1 << 30) # GB to bytes
|
||||
)
|
||||
|
||||
# Enable precision modes
|
||||
if fp16:
|
||||
if not builder.platform_has_fast_fp16:
|
||||
print("Warning: FP16 not supported on this platform, using FP32")
|
||||
else:
|
||||
config.set_flag(trt.BuilderFlag.FP16)
|
||||
print("✓ FP16 mode enabled")
|
||||
|
||||
if int8:
|
||||
if not builder.platform_has_fast_int8:
|
||||
print("Warning: INT8 not supported on this platform, using FP32/FP16")
|
||||
else:
|
||||
config.set_flag(trt.BuilderFlag.INT8)
|
||||
print("✓ INT8 mode enabled")
|
||||
print("Note: INT8 calibration not implemented. Results may be suboptimal.")
|
||||
|
||||
# Set optimization profile for dynamic shapes
|
||||
if max_batch > 1 or min_batch != max_batch:
|
||||
profile = builder.create_optimization_profile()
|
||||
|
||||
for i in range(network.num_inputs):
|
||||
inp = network.get_input(i)
|
||||
shape = list(inp.shape)
|
||||
|
||||
# Handle dynamic batch dimension
|
||||
if shape[0] == -1:
|
||||
# Min, opt, max shapes
|
||||
min_shape = [min_batch] + shape[1:]
|
||||
opt_shape = [opt_batch] + shape[1:]
|
||||
max_shape = [max_batch] + shape[1:]
|
||||
|
||||
profile.set_shape(inp.name, min_shape, opt_shape, max_shape)
|
||||
print(f" Dynamic shape for {inp.name}:")
|
||||
print(f" Min: {min_shape}")
|
||||
print(f" Opt: {opt_shape}")
|
||||
print(f" Max: {max_shape}")
|
||||
|
||||
config.add_optimization_profile(profile)
|
||||
|
||||
# Build engine
|
||||
print(f"\nBuilding TensorRT engine (this may take a few minutes)...")
|
||||
serialized_engine = builder.build_serialized_network(network, config)
|
||||
|
||||
if serialized_engine is None:
|
||||
raise RuntimeError("Failed to build TensorRT engine")
|
||||
|
||||
# Save engine to file
|
||||
print(f"Saving engine to {engine_path}...")
|
||||
with open(engine_path, 'wb') as f:
|
||||
f.write(serialized_engine)
|
||||
|
||||
# Get file size
|
||||
file_size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
|
||||
print(f"✓ TensorRT engine built successfully")
|
||||
print(f" Engine size: {file_size_mb:.2f} MB")
|
||||
|
||||
return engine_path
|
||||
|
||||
def convert(self, model_path: str, output_path: str,
|
||||
input_shape: Tuple[int, ...] = (1, 3, 640, 640),
|
||||
fp16: bool = False, int8: bool = False,
|
||||
dynamic_batch: bool = False,
|
||||
max_batch: int = 16,
|
||||
workspace_size: int = 4,
|
||||
input_names: List[str] = None,
|
||||
output_names: List[str] = None,
|
||||
keep_onnx: bool = False) -> str:
|
||||
"""
|
||||
Convert PyTorch or ONNX model to TensorRT engine.
|
||||
|
||||
Args:
|
||||
model_path: Path to PyTorch model (.pt, .pth) or ONNX model (.onnx)
|
||||
output_path: Path for output TensorRT engine (.trt)
|
||||
input_shape: Input tensor shape (B, C, H, W) - required for PyTorch models
|
||||
fp16: Enable FP16 precision
|
||||
int8: Enable INT8 precision
|
||||
dynamic_batch: Enable dynamic batch size
|
||||
max_batch: Maximum batch size (for dynamic batching)
|
||||
workspace_size: TensorRT workspace size in GB
|
||||
input_names: Custom input names (for PyTorch export)
|
||||
output_names: Custom output names (for PyTorch export)
|
||||
keep_onnx: Keep intermediate ONNX file
|
||||
|
||||
Returns:
|
||||
Path to created TensorRT engine
|
||||
"""
|
||||
# Create output directory
|
||||
output_dir = Path(output_path).parent
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Check if input is already ONNX
|
||||
model_path_obj = Path(model_path)
|
||||
is_onnx = model_path_obj.suffix.lower() == '.onnx'
|
||||
|
||||
if is_onnx:
|
||||
# Direct ONNX to TensorRT conversion
|
||||
print(f"Input is ONNX model, converting directly to TensorRT...")
|
||||
|
||||
min_batch = 1
|
||||
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
|
||||
max_batch_size = max_batch if dynamic_batch else input_shape[0]
|
||||
|
||||
engine_path = self.build_tensorrt_engine_from_onnx(
|
||||
onnx_path=model_path,
|
||||
engine_path=output_path,
|
||||
fp16=fp16,
|
||||
int8=int8,
|
||||
max_workspace_size=workspace_size,
|
||||
min_batch=min_batch,
|
||||
opt_batch=opt_batch,
|
||||
max_batch=max_batch_size
|
||||
)
|
||||
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"CONVERSION COMPLETED SUCCESSFULLY")
|
||||
print(f"{'=' * 80}")
|
||||
print(f"Input: {model_path}")
|
||||
print(f"Output: {engine_path}")
|
||||
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||
print(f"{'=' * 80}")
|
||||
|
||||
return engine_path
|
||||
|
||||
# PyTorch to TensorRT conversion (via ONNX)
|
||||
# Temporary ONNX path
|
||||
onnx_path = str(output_dir / "temp_model.onnx")
|
||||
|
||||
try:
|
||||
# Step 1: Load PyTorch model
|
||||
model = self.load_pytorch_model(model_path)
|
||||
|
||||
# Step 2: Export to ONNX
|
||||
self.export_to_onnx(
|
||||
model=model,
|
||||
input_shape=input_shape,
|
||||
onnx_path=onnx_path,
|
||||
dynamic_batch=dynamic_batch,
|
||||
input_names=input_names,
|
||||
output_names=output_names
|
||||
)
|
||||
|
||||
# Step 3: Build TensorRT engine
|
||||
min_batch = 1
|
||||
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
|
||||
max_batch_size = max_batch if dynamic_batch else input_shape[0]
|
||||
|
||||
engine_path = self.build_tensorrt_engine_from_onnx(
|
||||
onnx_path=onnx_path,
|
||||
engine_path=output_path,
|
||||
fp16=fp16,
|
||||
int8=int8,
|
||||
max_workspace_size=workspace_size,
|
||||
min_batch=min_batch,
|
||||
opt_batch=opt_batch,
|
||||
max_batch=max_batch_size
|
||||
)
|
||||
|
||||
print(f"\n{'=' * 80}")
|
||||
print(f"CONVERSION COMPLETED SUCCESSFULLY")
|
||||
print(f"{'=' * 80}")
|
||||
print(f"Input: {model_path}")
|
||||
print(f"Output: {engine_path}")
|
||||
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||
print(f"Dynamic batch: {dynamic_batch}")
|
||||
if dynamic_batch:
|
||||
print(f"Batch range: [1, {max_batch}]")
|
||||
print(f"{'=' * 80}")
|
||||
|
||||
return engine_path
|
||||
|
||||
finally:
|
||||
# Cleanup temporary ONNX file
|
||||
if not keep_onnx and Path(onnx_path).exists():
|
||||
Path(onnx_path).unlink()
|
||||
print(f"Cleaned up temporary ONNX file")
|
||||
|
||||
|
||||
def parse_shape(shape_str: str) -> Tuple[int, ...]:
|
||||
"""Parse shape string like '1,3,640,640' to tuple"""
|
||||
try:
|
||||
return tuple(int(x) for x in shape_str.split(','))
|
||||
except ValueError:
|
||||
raise argparse.ArgumentTypeError(
|
||||
f"Invalid shape format: {shape_str}. Expected format: 1,3,640,640"
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Convert PyTorch models to TensorRT engines",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic conversion (FP32)
|
||||
python convert_pt_to_tensorrt.py --model yolov8n.pt --output models/yolov8n.trt
|
||||
|
||||
# FP16 precision for faster inference
|
||||
python convert_pt_to_tensorrt.py --model model.pt --output model.trt --fp16
|
||||
|
||||
# Custom input shape
|
||||
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||
--input-shape 1,3,416,416
|
||||
|
||||
# Dynamic batch size (1 to 16)
|
||||
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||
--dynamic-batch --max-batch 16
|
||||
|
||||
# INT8 quantization for maximum speed (requires calibration)
|
||||
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||
--fp16 --int8
|
||||
|
||||
# Keep intermediate ONNX file for debugging
|
||||
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||
--keep-onnx
|
||||
"""
|
||||
)
|
||||
|
||||
# Required arguments
|
||||
parser.add_argument(
|
||||
'--model', '-m',
|
||||
type=str,
|
||||
required=True,
|
||||
help='Path to PyTorch model file (.pt or .pth)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--output', '-o',
|
||||
type=str,
|
||||
required=True,
|
||||
help='Output path for TensorRT engine (.trt or .engine)'
|
||||
)
|
||||
|
||||
# Optional arguments
|
||||
parser.add_argument(
|
||||
'--input-shape', '-s',
|
||||
type=parse_shape,
|
||||
default=(1, 3, 640, 640),
|
||||
help='Input tensor shape as B,C,H,W (default: 1,3,640,640)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--fp16',
|
||||
action='store_true',
|
||||
help='Enable FP16 precision (faster inference, slightly lower accuracy)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--int8',
|
||||
action='store_true',
|
||||
help='Enable INT8 precision (fastest, requires calibration)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--dynamic-batch',
|
||||
action='store_true',
|
||||
help='Enable dynamic batch size support'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--max-batch',
|
||||
type=int,
|
||||
default=16,
|
||||
help='Maximum batch size for dynamic batching (default: 16)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--workspace-size',
|
||||
type=int,
|
||||
default=4,
|
||||
help='TensorRT workspace size in GB (default: 4)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--gpu',
|
||||
type=int,
|
||||
default=0,
|
||||
help='GPU device ID (default: 0)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--input-names',
|
||||
type=str,
|
||||
nargs='+',
|
||||
default=None,
|
||||
help='Custom input tensor names (default: ["input"])'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--output-names',
|
||||
type=str,
|
||||
nargs='+',
|
||||
default=None,
|
||||
help='Custom output tensor names (default: ["output"])'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--keep-onnx',
|
||||
action='store_true',
|
||||
help='Keep intermediate ONNX file'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--verbose', '-v',
|
||||
action='store_true',
|
||||
help='Enable verbose logging'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate arguments
|
||||
if not Path(args.model).exists():
|
||||
print(f"Error: Model file not found: {args.model}")
|
||||
sys.exit(1)
|
||||
|
||||
if args.int8 and not args.fp16:
|
||||
print("Warning: INT8 mode works best with FP16 enabled. Adding --fp16 flag.")
|
||||
args.fp16 = True
|
||||
|
||||
# Run conversion
|
||||
try:
|
||||
converter = TensorRTConverter(gpu_id=args.gpu, verbose=args.verbose)
|
||||
|
||||
converter.convert(
|
||||
model_path=args.model,
|
||||
output_path=args.output,
|
||||
input_shape=args.input_shape,
|
||||
fp16=args.fp16,
|
||||
int8=args.int8,
|
||||
dynamic_batch=args.dynamic_batch,
|
||||
max_batch=args.max_batch,
|
||||
workspace_size=args.workspace_size,
|
||||
input_names=args.input_names,
|
||||
output_names=args.output_names,
|
||||
keep_onnx=args.keep_onnx
|
||||
)
|
||||
|
||||
print("\n✓ Conversion successful!")
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Conversion failed: {e}")
|
||||
if args.verbose:
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
380
services/README_MODEL_REPOSITORY.md
Normal file
380
services/README_MODEL_REPOSITORY.md
Normal file
|
|
@ -0,0 +1,380 @@
|
|||
# TensorRT Model Repository
|
||||
|
||||
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Key Features
|
||||
|
||||
1. **Model Deduplication by File Hash**
|
||||
- Multiple model IDs can point to the same model file
|
||||
- Only one engine loaded in VRAM per unique file
|
||||
- Example: 100 cameras with same model = 1 engine (not 100!)
|
||||
|
||||
2. **Context Pooling for Load Balancing**
|
||||
- Each unique engine has N execution contexts (configurable)
|
||||
- Contexts borrowed/returned via mutex-based queue
|
||||
- Enables concurrent inference without context-per-model overhead
|
||||
- Example: 100 cameras sharing 4 contexts efficiently
|
||||
|
||||
3. **GPU-to-GPU Inference**
|
||||
- All inputs/outputs stay in VRAM (zero CPU transfers)
|
||||
- Integrates seamlessly with StreamDecoder (frames already on GPU)
|
||||
- Maximum performance for video inference pipelines
|
||||
|
||||
4. **Thread-Safe Concurrent Inference**
|
||||
- Mutex-based context acquisition (TensorRT best practice)
|
||||
- No shared IExecutionContext across threads (safe)
|
||||
- Multiple threads can infer concurrently (limited by pool size)
|
||||
|
||||
## Design Rationale
|
||||
|
||||
### Why Context Pooling?
|
||||
|
||||
**Without pooling** (naive approach):
|
||||
```
|
||||
100 cameras → 100 model IDs → 100 execution contexts
|
||||
```
|
||||
- Problem: Each context consumes VRAM (layers, workspace, etc.)
|
||||
- Problem: Context creation overhead per camera
|
||||
- Problem: Doesn't scale to hundreds of cameras
|
||||
|
||||
**With pooling** (our approach):
|
||||
```
|
||||
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
|
||||
```
|
||||
- Solution: Contexts shared across all cameras using same model
|
||||
- Solution: Borrow/return mechanism with mutex queue
|
||||
- Solution: Scales to any number of cameras with fixed context count
|
||||
|
||||
### Memory Savings Example
|
||||
|
||||
YOLOv8n model (~6MB engine file):
|
||||
|
||||
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|
||||
|----------|-----------|---------|----------|-------------|
|
||||
| Naive | 100 | 100 | 100 | ~1.5 GB |
|
||||
| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
|
||||
|
||||
**50x memory savings!**
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
# Initialize repository
|
||||
repo = TensorRTModelRepository(
|
||||
gpu_id=0,
|
||||
default_num_contexts=4 # 4 contexts per unique engine
|
||||
)
|
||||
|
||||
# Load model for camera 1
|
||||
repo.load_model(
|
||||
model_id="camera_1",
|
||||
file_path="models/yolov8n.trt"
|
||||
)
|
||||
|
||||
# Load same model for camera 2 (deduplication happens automatically)
|
||||
repo.load_model(
|
||||
model_id="camera_2",
|
||||
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
|
||||
)
|
||||
|
||||
# Run inference (GPU-to-GPU)
|
||||
import torch
|
||||
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||
|
||||
outputs = repo.infer(
|
||||
model_id="camera_1",
|
||||
inputs={"images": input_tensor},
|
||||
synchronize=True,
|
||||
timeout=5.0 # Wait up to 5s for available context
|
||||
)
|
||||
|
||||
# Outputs stay on GPU
|
||||
for name, tensor in outputs.items():
|
||||
print(f"{name}: {tensor.shape} on {tensor.device}")
|
||||
```
|
||||
|
||||
### Multi-Camera Scenario
|
||||
|
||||
```python
|
||||
# Setup multiple cameras
|
||||
cameras = [f"camera_{i}" for i in range(100)]
|
||||
|
||||
# Load same model for all cameras
|
||||
for camera_id in cameras:
|
||||
repo.load_model(
|
||||
model_id=camera_id,
|
||||
file_path="models/yolov8n.trt" # Same file for all
|
||||
)
|
||||
|
||||
# Check efficiency
|
||||
stats = repo.get_stats()
|
||||
print(f"Model IDs: {stats['total_model_ids']}") # 100
|
||||
print(f"Unique engines: {stats['unique_engines']}") # 1
|
||||
print(f"Total contexts: {stats['total_contexts']}") # 4
|
||||
```
|
||||
|
||||
### Integration with RTSP Decoder
|
||||
|
||||
```python
|
||||
from services.stream_decoder import StreamDecoderFactory
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
# Setup
|
||||
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
||||
model_repo = TensorRTModelRepository(gpu_id=0)
|
||||
|
||||
# Create decoder for camera
|
||||
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
|
||||
decoder.start()
|
||||
|
||||
# Load inference model
|
||||
model_repo.load_model("camera_main", "models/yolov8n.trt")
|
||||
|
||||
# Process frames (everything on GPU)
|
||||
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
|
||||
|
||||
# Preprocess (stays on GPU)
|
||||
frame_gpu = frame_gpu.float() / 255.0
|
||||
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
|
||||
|
||||
# Inference (GPU-to-GPU, zero copy)
|
||||
outputs = model_repo.infer(
|
||||
model_id="camera_main",
|
||||
inputs={"images": frame_gpu}
|
||||
)
|
||||
|
||||
# Post-process outputs (can stay on GPU)
|
||||
# ... NMS, bounding boxes, etc.
|
||||
```
|
||||
|
||||
### Concurrent Inference
|
||||
|
||||
```python
|
||||
import threading
|
||||
|
||||
def process_camera(camera_id: str, model_id: str):
|
||||
# Get frame from decoder (on GPU)
|
||||
frame = decoder.get_latest_frame(rgb=True)
|
||||
|
||||
# Inference automatically borrows/returns context from pool
|
||||
outputs = repo.infer(
|
||||
model_id=model_id,
|
||||
inputs={"images": frame},
|
||||
timeout=10.0 # Wait for available context
|
||||
)
|
||||
|
||||
# Process outputs...
|
||||
|
||||
# Multiple threads can infer concurrently
|
||||
threads = []
|
||||
for i in range(10): # 10 threads
|
||||
t = threading.Thread(
|
||||
target=process_camera,
|
||||
args=(f"camera_{i}", f"camera_{i}")
|
||||
)
|
||||
threads.append(t)
|
||||
t.start()
|
||||
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
# With 4 contexts: up to 4 inferences run in parallel
|
||||
# Others wait in queue, contexts auto-balanced
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### TensorRTModelRepository
|
||||
|
||||
#### `__init__(gpu_id=0, default_num_contexts=4)`
|
||||
Initialize the repository.
|
||||
|
||||
**Args:**
|
||||
- `gpu_id`: GPU device ID
|
||||
- `default_num_contexts`: Default context pool size per engine
|
||||
|
||||
#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
|
||||
Load a TensorRT model.
|
||||
|
||||
**Args:**
|
||||
- `model_id`: Unique identifier (e.g., "camera_1")
|
||||
- `file_path`: Path to .trt/.engine file
|
||||
- `num_contexts`: Context pool size (None = use default)
|
||||
- `force_reload`: Reload if model_id exists
|
||||
|
||||
**Returns:** `ModelMetadata`
|
||||
|
||||
**Deduplication:** If file hash matches existing model, reuses engine + contexts.
|
||||
|
||||
#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
|
||||
Run inference.
|
||||
|
||||
**Args:**
|
||||
- `model_id`: Model identifier
|
||||
- `inputs`: Dict mapping input names to CUDA tensors
|
||||
- `synchronize`: Wait for completion
|
||||
- `timeout`: Max wait time for context (seconds)
|
||||
|
||||
**Returns:** Dict mapping output names to CUDA tensors
|
||||
|
||||
**Thread-safe:** Borrows context from pool, returns after inference.
|
||||
|
||||
#### `unload_model(model_id)`
|
||||
Unload a model.
|
||||
|
||||
If last reference to engine, fully unloads from VRAM.
|
||||
|
||||
#### `get_metadata(model_id)`
|
||||
Get model metadata.
|
||||
|
||||
**Returns:** `ModelMetadata` or `None`
|
||||
|
||||
#### `get_model_info(model_id)`
|
||||
Get detailed model information.
|
||||
|
||||
**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
|
||||
|
||||
#### `get_stats()`
|
||||
Get repository statistics.
|
||||
|
||||
**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Set Appropriate Context Pool Size
|
||||
|
||||
```python
|
||||
# For 10 cameras with same model, 4 contexts is usually enough
|
||||
repo = TensorRTModelRepository(default_num_contexts=4)
|
||||
|
||||
# For high concurrency, increase pool size
|
||||
repo = TensorRTModelRepository(default_num_contexts=8)
|
||||
```
|
||||
|
||||
**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
|
||||
|
||||
### 2. Always Use GPU Tensors
|
||||
|
||||
```python
|
||||
# ✅ Good: Input on GPU
|
||||
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||
outputs = repo.infer(model_id, {"images": input_gpu})
|
||||
|
||||
# ❌ Bad: Input on CPU (will cause error)
|
||||
input_cpu = torch.rand(1, 3, 640, 640)
|
||||
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
|
||||
```
|
||||
|
||||
### 3. Handle Timeout Gracefully
|
||||
|
||||
```python
|
||||
try:
|
||||
outputs = repo.infer(
|
||||
model_id="camera_1",
|
||||
inputs=inputs,
|
||||
timeout=5.0
|
||||
)
|
||||
except RuntimeError as e:
|
||||
# All contexts busy, increase pool size or add backpressure
|
||||
print(f"Inference timeout: {e}")
|
||||
```
|
||||
|
||||
### 4. Use Same File for Deduplication
|
||||
|
||||
```python
|
||||
# ✅ Good: Same file path → deduplication
|
||||
repo.load_model("cam1", "/models/yolo.trt")
|
||||
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
|
||||
|
||||
# ❌ Bad: Different paths (even if same content) → no deduplication
|
||||
repo.load_model("cam1", "/models/yolo.trt")
|
||||
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
|
||||
```
|
||||
|
||||
## TensorRT Best Practices Implemented
|
||||
|
||||
Based on NVIDIA documentation and web search findings:
|
||||
|
||||
1. **Separate IExecutionContext per concurrent stream** ✅
|
||||
- Each context has its own CUDA stream
|
||||
- Contexts never shared across threads simultaneously
|
||||
|
||||
2. **Mutex-based context management** ✅
|
||||
- Queue-based borrowing with locks
|
||||
- Thread-safe acquire/release pattern
|
||||
|
||||
3. **GPU memory reuse** ✅
|
||||
- Engines shared by file hash
|
||||
- Contexts pooled and reused
|
||||
|
||||
4. **Zero-copy operations** ✅
|
||||
- All data stays in VRAM
|
||||
- DLPack integration with PyTorch
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No execution context available within timeout"
|
||||
|
||||
**Cause:** All contexts busy with concurrent inferences.
|
||||
|
||||
**Solutions:**
|
||||
1. Increase context pool size:
|
||||
```python
|
||||
repo.load_model(model_id, file_path, num_contexts=8)
|
||||
```
|
||||
2. Increase timeout:
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, timeout=30.0)
|
||||
```
|
||||
3. Add backpressure/throttling to limit concurrent requests
|
||||
|
||||
### Out of Memory (OOM)
|
||||
|
||||
**Cause:** Too many unique engines or large context pools.
|
||||
|
||||
**Solutions:**
|
||||
1. Ensure deduplication working (same file paths)
|
||||
2. Reduce context pool sizes
|
||||
3. Use smaller models or quantization (INT8/FP16)
|
||||
|
||||
### Import Error: "tensorrt could not be resolved"
|
||||
|
||||
**Solution:** Install TensorRT:
|
||||
```bash
|
||||
pip install tensorrt
|
||||
# Or use NVIDIA's wheel for your CUDA version
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Batch Processing:** Process multiple frames before synchronizing
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||
# ... more inferences ...
|
||||
torch.cuda.synchronize() # Sync once at end
|
||||
```
|
||||
|
||||
2. **Async Inference:** Don't synchronize if not needed immediately
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||
# GPU continues working, CPU continues
|
||||
# Synchronize later when you need results
|
||||
```
|
||||
|
||||
3. **Monitor Context Utilization:**
|
||||
```python
|
||||
stats = repo.get_stats()
|
||||
print(f"Contexts: {stats['total_contexts']}")
|
||||
|
||||
# If timeouts occur frequently, increase pool size
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Part of python-rtsp-worker project.
|
||||
14
services/__init__.py
Normal file
14
services/__init__.py
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
"""
|
||||
Services package for RTSP stream processing with GPU acceleration.
|
||||
"""
|
||||
|
||||
from .stream_decoder import StreamDecoderFactory, StreamDecoder, ConnectionStatus
|
||||
from .jpeg_encoder import JPEGEncoderFactory, encode_frame_to_jpeg
|
||||
|
||||
__all__ = [
|
||||
'StreamDecoderFactory',
|
||||
'StreamDecoder',
|
||||
'ConnectionStatus',
|
||||
'JPEGEncoderFactory',
|
||||
'encode_frame_to_jpeg',
|
||||
]
|
||||
91
services/jpeg_encoder.py
Normal file
91
services/jpeg_encoder.py
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
"""
|
||||
JPEG Encoder wrapper for GPU-accelerated JPEG encoding using nvImageCodec/nvJPEG.
|
||||
Provides a shared encoder instance that can be used across multiple streams.
|
||||
"""
|
||||
|
||||
from typing import Optional
|
||||
import torch
|
||||
import nvidia.nvimgcodec as nvimgcodec
|
||||
|
||||
|
||||
class JPEGEncoderFactory:
|
||||
"""
|
||||
Factory for creating and managing a shared JPEG encoder instance.
|
||||
Thread-safe singleton pattern for efficient resource sharing.
|
||||
"""
|
||||
|
||||
_instance = None
|
||||
_encoder = None
|
||||
|
||||
def __new__(cls):
|
||||
if cls._instance is None:
|
||||
cls._instance = super(JPEGEncoderFactory, cls).__new__(cls)
|
||||
cls._encoder = nvimgcodec.Encoder()
|
||||
print("JPEGEncoderFactory initialized with shared nvJPEG encoder")
|
||||
return cls._instance
|
||||
|
||||
@classmethod
|
||||
def get_encoder(cls):
|
||||
"""Get the shared JPEG encoder instance"""
|
||||
if cls._encoder is None:
|
||||
cls() # Initialize if not already done
|
||||
return cls._encoder
|
||||
|
||||
|
||||
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
||||
"""
|
||||
Encode an RGB frame to JPEG on GPU and return JPEG bytes.
|
||||
|
||||
This function:
|
||||
1. Takes RGB frame from GPU (stays on GPU during encoding)
|
||||
2. Converts PyTorch tensor to nvImageCodec image via as_image()
|
||||
3. Encodes to JPEG using nvJPEG (GPU operation)
|
||||
4. Transfers only JPEG bytes to CPU
|
||||
5. Returns bytes for saving to disk
|
||||
|
||||
Args:
|
||||
rgb_frame: RGB tensor on GPU, shape (3, H, W) or (H, W, 3), dtype uint8
|
||||
quality: JPEG quality (0-100, default 95)
|
||||
|
||||
Returns:
|
||||
JPEG encoded bytes or None if encoding fails
|
||||
"""
|
||||
if rgb_frame is None:
|
||||
return None
|
||||
|
||||
try:
|
||||
# Ensure we have (H, W, C) format and contiguous memory
|
||||
if rgb_frame.dim() == 3:
|
||||
if rgb_frame.shape[0] == 3:
|
||||
# Convert from (C, H, W) to (H, W, C)
|
||||
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
||||
else:
|
||||
# Already (H, W, C)
|
||||
rgb_hwc = rgb_frame.contiguous()
|
||||
else:
|
||||
raise ValueError(f"Expected 3D tensor, got shape {rgb_frame.shape}")
|
||||
|
||||
# Get shared encoder
|
||||
encoder = JPEGEncoderFactory.get_encoder()
|
||||
|
||||
# Create encode parameters with quality
|
||||
# Quality is set via quality_value (0-100 scale)
|
||||
jpeg_params = nvimgcodec.JpegEncodeParams(optimized_huffman=True)
|
||||
encode_params = nvimgcodec.EncodeParams(
|
||||
quality_value=float(quality),
|
||||
jpeg_encode_params=jpeg_params
|
||||
)
|
||||
|
||||
# Convert PyTorch GPU tensor to nvImageCodec image using __cuda_array_interface__
|
||||
# This is zero-copy - nvimgcodec reads directly from GPU memory
|
||||
nv_image = nvimgcodec.as_image(rgb_hwc)
|
||||
|
||||
# Encode to JPEG on GPU
|
||||
# The encoding happens on GPU, only compressed JPEG bytes are transferred to CPU
|
||||
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
||||
|
||||
return bytes(jpeg_data)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error encoding frame to JPEG: {e}")
|
||||
return None
|
||||
631
services/model_repository.py
Normal file
631
services/model_repository.py
Normal file
|
|
@ -0,0 +1,631 @@
|
|||
import threading
|
||||
import hashlib
|
||||
from typing import Optional, Dict, Any, List, Tuple
|
||||
from pathlib import Path
|
||||
from queue import Queue
|
||||
import torch
|
||||
import tensorrt as trt
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ModelMetadata:
|
||||
"""Metadata for a loaded TensorRT model"""
|
||||
file_path: str
|
||||
file_hash: str
|
||||
input_shapes: Dict[str, Tuple[int, ...]]
|
||||
output_shapes: Dict[str, Tuple[int, ...]]
|
||||
input_names: List[str]
|
||||
output_names: List[str]
|
||||
input_dtypes: Dict[str, torch.dtype]
|
||||
output_dtypes: Dict[str, torch.dtype]
|
||||
|
||||
|
||||
class ExecutionContext:
|
||||
"""
|
||||
Wrapper for TensorRT execution context with CUDA stream.
|
||||
Used in context pool for load balancing.
|
||||
"""
|
||||
def __init__(self, context: trt.IExecutionContext, stream: torch.cuda.Stream,
|
||||
context_id: int, device: torch.device):
|
||||
self.context = context
|
||||
self.stream = stream
|
||||
self.context_id = context_id
|
||||
self.device = device
|
||||
self.in_use = False
|
||||
self.lock = threading.Lock()
|
||||
|
||||
def __repr__(self):
|
||||
return f"ExecutionContext(id={self.context_id}, in_use={self.in_use})"
|
||||
|
||||
|
||||
class SharedEngine:
|
||||
"""
|
||||
Shared TensorRT engine with context pool for load balancing.
|
||||
|
||||
Architecture:
|
||||
- One engine shared across all model_ids with same file hash
|
||||
- Pool of N execution contexts for concurrent inference
|
||||
- Contexts are borrowed/returned using mutex locks
|
||||
- Load balancing: contexts distributed across requests
|
||||
"""
|
||||
def __init__(self, engine: trt.ICudaEngine, file_hash: str, file_path: str,
|
||||
num_contexts: int, device: torch.device, metadata: ModelMetadata):
|
||||
self.engine = engine
|
||||
self.file_hash = file_hash
|
||||
self.file_path = file_path
|
||||
self.metadata = metadata
|
||||
self.device = device
|
||||
self.num_contexts = num_contexts
|
||||
|
||||
# Create context pool
|
||||
self.context_pool: List[ExecutionContext] = []
|
||||
self.available_contexts: Queue[ExecutionContext] = Queue()
|
||||
|
||||
for i in range(num_contexts):
|
||||
ctx = engine.create_execution_context()
|
||||
if ctx is None:
|
||||
raise RuntimeError(f"Failed to create execution context {i}")
|
||||
|
||||
stream = torch.cuda.Stream(device=device)
|
||||
exec_ctx = ExecutionContext(ctx, stream, i, device)
|
||||
self.context_pool.append(exec_ctx)
|
||||
self.available_contexts.put(exec_ctx)
|
||||
|
||||
# Model IDs referencing this engine
|
||||
self.model_ids: set = set()
|
||||
self.lock = threading.Lock()
|
||||
|
||||
print(f"Created context pool with {num_contexts} contexts for engine {file_hash[:8]}...")
|
||||
|
||||
def acquire_context(self, timeout: Optional[float] = None) -> Optional[ExecutionContext]:
|
||||
"""
|
||||
Acquire an available execution context from the pool.
|
||||
Blocks if all contexts are in use.
|
||||
|
||||
Args:
|
||||
timeout: Max time to wait for context (None = wait forever)
|
||||
|
||||
Returns:
|
||||
ExecutionContext or None if timeout
|
||||
"""
|
||||
try:
|
||||
exec_ctx = self.available_contexts.get(timeout=timeout)
|
||||
with exec_ctx.lock:
|
||||
exec_ctx.in_use = True
|
||||
return exec_ctx
|
||||
except:
|
||||
return None
|
||||
|
||||
def release_context(self, exec_ctx: ExecutionContext):
|
||||
"""
|
||||
Return a context to the pool.
|
||||
|
||||
Args:
|
||||
exec_ctx: Context to release
|
||||
"""
|
||||
with exec_ctx.lock:
|
||||
exec_ctx.in_use = False
|
||||
self.available_contexts.put(exec_ctx)
|
||||
|
||||
def add_model_id(self, model_id: str):
|
||||
"""Add a model_id reference to this engine"""
|
||||
with self.lock:
|
||||
self.model_ids.add(model_id)
|
||||
|
||||
def remove_model_id(self, model_id: str) -> int:
|
||||
"""
|
||||
Remove a model_id reference from this engine.
|
||||
Returns the number of remaining references.
|
||||
"""
|
||||
with self.lock:
|
||||
self.model_ids.discard(model_id)
|
||||
return len(self.model_ids)
|
||||
|
||||
def get_reference_count(self) -> int:
|
||||
"""Get number of model_ids using this engine"""
|
||||
with self.lock:
|
||||
return len(self.model_ids)
|
||||
|
||||
def cleanup(self):
|
||||
"""Cleanup all contexts"""
|
||||
for exec_ctx in self.context_pool:
|
||||
del exec_ctx.context
|
||||
self.context_pool.clear()
|
||||
del self.engine
|
||||
|
||||
|
||||
class TensorRTModelRepository:
|
||||
"""
|
||||
Thread-safe repository for TensorRT models with context pooling and deduplication.
|
||||
|
||||
Architecture:
|
||||
- Deduplication: Multiple model_ids with same file → share one engine
|
||||
- Context Pool: Each unique engine has N execution contexts (configurable)
|
||||
- Load Balancing: Contexts are borrowed/returned via mutex queue
|
||||
- Scalability: Adding 100 cameras with same model = 1 engine + N contexts (not 100 contexts!)
|
||||
|
||||
Best Practices:
|
||||
- GPU-to-GPU: All inputs/outputs stay in VRAM (zero CPU transfers)
|
||||
- Thread Safety: Mutex-based context borrowing (TensorRT best practice)
|
||||
- Memory Efficient: Deduplicate by file hash, share engine across model_ids
|
||||
- Concurrent: N contexts allow N parallel inferences per unique model
|
||||
|
||||
Example:
|
||||
# 100 cameras, same model file
|
||||
for i in range(100):
|
||||
repo.load_model(f"camera_{i}", "yolov8.trt")
|
||||
# Result: 1 engine in VRAM, N contexts (e.g., 4), not 100 contexts!
|
||||
"""
|
||||
|
||||
def __init__(self, gpu_id: int = 0, default_num_contexts: int = 4):
|
||||
"""
|
||||
Initialize the model repository.
|
||||
|
||||
Args:
|
||||
gpu_id: GPU device ID to use
|
||||
default_num_contexts: Default number of execution contexts per unique engine
|
||||
"""
|
||||
self.gpu_id = gpu_id
|
||||
self.device = torch.device(f'cuda:{gpu_id}')
|
||||
self.default_num_contexts = default_num_contexts
|
||||
|
||||
# Model ID to engine mapping: model_id -> file_hash
|
||||
self._model_to_hash: Dict[str, str] = {}
|
||||
|
||||
# Shared engines with context pools: file_hash -> SharedEngine
|
||||
self._shared_engines: Dict[str, SharedEngine] = {}
|
||||
|
||||
# Locks for thread safety
|
||||
self._repo_lock = threading.RLock()
|
||||
|
||||
# TensorRT logger
|
||||
self.trt_logger = trt.Logger(trt.Logger.WARNING)
|
||||
|
||||
print(f"TensorRT Model Repository initialized on GPU {gpu_id}")
|
||||
print(f"Default context pool size: {default_num_contexts} contexts per unique model")
|
||||
|
||||
@staticmethod
|
||||
def compute_file_hash(file_path: str) -> str:
|
||||
"""
|
||||
Compute SHA256 hash of a file for deduplication.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
|
||||
Returns:
|
||||
Hexadecimal hash string
|
||||
"""
|
||||
sha256_hash = hashlib.sha256()
|
||||
with open(file_path, "rb") as f:
|
||||
# Read in chunks to handle large files efficiently
|
||||
for byte_block in iter(lambda: f.read(65536), b""):
|
||||
sha256_hash.update(byte_block)
|
||||
return sha256_hash.hexdigest()
|
||||
|
||||
def _load_engine(self, file_path: str) -> trt.ICudaEngine:
|
||||
"""
|
||||
Load TensorRT engine from file.
|
||||
|
||||
Args:
|
||||
file_path: Path to .trt or .engine file
|
||||
|
||||
Returns:
|
||||
TensorRT engine
|
||||
"""
|
||||
runtime = trt.Runtime(self.trt_logger)
|
||||
|
||||
with open(file_path, 'rb') as f:
|
||||
engine_data = f.read()
|
||||
|
||||
engine = runtime.deserialize_cuda_engine(engine_data)
|
||||
if engine is None:
|
||||
raise RuntimeError(f"Failed to load TensorRT engine from {file_path}")
|
||||
|
||||
return engine
|
||||
|
||||
def _extract_metadata(self, engine: trt.ICudaEngine,
|
||||
file_path: str, file_hash: str) -> ModelMetadata:
|
||||
"""
|
||||
Extract metadata from TensorRT engine.
|
||||
|
||||
Args:
|
||||
engine: TensorRT engine
|
||||
file_path: Path to model file
|
||||
file_hash: SHA256 hash of model file
|
||||
|
||||
Returns:
|
||||
ModelMetadata object
|
||||
"""
|
||||
input_shapes = {}
|
||||
output_shapes = {}
|
||||
input_names = []
|
||||
output_names = []
|
||||
input_dtypes = {}
|
||||
output_dtypes = {}
|
||||
|
||||
# TensorRT dtype to PyTorch dtype mapping
|
||||
trt_to_torch_dtype = {
|
||||
trt.DataType.FLOAT: torch.float32,
|
||||
trt.DataType.HALF: torch.float16,
|
||||
trt.DataType.INT8: torch.int8,
|
||||
trt.DataType.INT32: torch.int32,
|
||||
trt.DataType.BOOL: torch.bool,
|
||||
}
|
||||
|
||||
# Iterate through all tensors (inputs and outputs) - TensorRT 10.x API
|
||||
for i in range(engine.num_io_tensors):
|
||||
name = engine.get_tensor_name(i)
|
||||
shape = tuple(engine.get_tensor_shape(name))
|
||||
dtype = trt_to_torch_dtype.get(engine.get_tensor_dtype(name), torch.float32)
|
||||
mode = engine.get_tensor_mode(name)
|
||||
|
||||
if mode == trt.TensorIOMode.INPUT:
|
||||
input_names.append(name)
|
||||
input_shapes[name] = shape
|
||||
input_dtypes[name] = dtype
|
||||
else:
|
||||
output_names.append(name)
|
||||
output_shapes[name] = shape
|
||||
output_dtypes[name] = dtype
|
||||
|
||||
return ModelMetadata(
|
||||
file_path=file_path,
|
||||
file_hash=file_hash,
|
||||
input_shapes=input_shapes,
|
||||
output_shapes=output_shapes,
|
||||
input_names=input_names,
|
||||
output_names=output_names,
|
||||
input_dtypes=input_dtypes,
|
||||
output_dtypes=output_dtypes
|
||||
)
|
||||
|
||||
def load_model(self, model_id: str, file_path: str,
|
||||
num_contexts: Optional[int] = None,
|
||||
force_reload: bool = False) -> ModelMetadata:
|
||||
"""
|
||||
Load a TensorRT model with the given ID.
|
||||
|
||||
Deduplication: If a model with the same file hash is already loaded, the model_id
|
||||
is simply mapped to the existing SharedEngine (no new engine or contexts created).
|
||||
|
||||
Args:
|
||||
model_id: User-defined identifier for this model (e.g., "camera_1")
|
||||
file_path: Path to TensorRT engine file (.trt or .engine)
|
||||
num_contexts: Number of execution contexts in pool (None = use default)
|
||||
force_reload: If True, reload even if model_id exists
|
||||
|
||||
Returns:
|
||||
ModelMetadata for the loaded model
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If model file doesn't exist
|
||||
RuntimeError: If engine loading fails
|
||||
ValueError: If model_id already exists and force_reload is False
|
||||
"""
|
||||
file_path = str(Path(file_path).resolve())
|
||||
|
||||
if not Path(file_path).exists():
|
||||
raise FileNotFoundError(f"Model file not found: {file_path}")
|
||||
|
||||
if num_contexts is None:
|
||||
num_contexts = self.default_num_contexts
|
||||
|
||||
with self._repo_lock:
|
||||
# Check if model_id already exists
|
||||
if model_id in self._model_to_hash and not force_reload:
|
||||
raise ValueError(
|
||||
f"Model ID '{model_id}' already exists. "
|
||||
f"Use force_reload=True to reload or choose a different ID."
|
||||
)
|
||||
|
||||
# Unload existing model if force_reload
|
||||
if model_id in self._model_to_hash and force_reload:
|
||||
self.unload_model(model_id)
|
||||
|
||||
# Compute file hash for deduplication
|
||||
print(f"Computing hash for {file_path}...")
|
||||
file_hash = self.compute_file_hash(file_path)
|
||||
print(f"File hash: {file_hash[:16]}...")
|
||||
|
||||
# Check if this file is already loaded (deduplication)
|
||||
if file_hash in self._shared_engines:
|
||||
shared_engine = self._shared_engines[file_hash]
|
||||
print(f"Engine already loaded (hash match), reusing engine and context pool...")
|
||||
print(f" Existing model_ids using this engine: {shared_engine.model_ids}")
|
||||
else:
|
||||
# Load new engine
|
||||
print(f"Loading TensorRT engine from {file_path}...")
|
||||
engine = self._load_engine(file_path)
|
||||
|
||||
# Extract metadata
|
||||
metadata = self._extract_metadata(engine, file_path, file_hash)
|
||||
|
||||
# Create shared engine with context pool
|
||||
shared_engine = SharedEngine(
|
||||
engine=engine,
|
||||
file_hash=file_hash,
|
||||
file_path=file_path,
|
||||
num_contexts=num_contexts,
|
||||
device=self.device,
|
||||
metadata=metadata
|
||||
)
|
||||
self._shared_engines[file_hash] = shared_engine
|
||||
|
||||
# Add this model_id to the shared engine
|
||||
shared_engine.add_model_id(model_id)
|
||||
|
||||
# Map model_id to file_hash
|
||||
self._model_to_hash[model_id] = file_hash
|
||||
|
||||
print(f"Model '{model_id}' loaded successfully")
|
||||
print(f" Inputs: {shared_engine.metadata.input_names}")
|
||||
for name in shared_engine.metadata.input_names:
|
||||
print(f" {name}: {shared_engine.metadata.input_shapes[name]} ({shared_engine.metadata.input_dtypes[name]})")
|
||||
print(f" Outputs: {shared_engine.metadata.output_names}")
|
||||
for name in shared_engine.metadata.output_names:
|
||||
print(f" {name}: {shared_engine.metadata.output_shapes[name]} ({shared_engine.metadata.output_dtypes[name]})")
|
||||
print(f" Context pool size: {num_contexts}")
|
||||
print(f" Model IDs sharing this engine: {shared_engine.get_reference_count()}")
|
||||
print(f" Unique engines in VRAM: {len(self._shared_engines)}")
|
||||
|
||||
return shared_engine.metadata
|
||||
|
||||
def infer(self, model_id: str, inputs: Dict[str, torch.Tensor],
|
||||
synchronize: bool = True, timeout: Optional[float] = 5.0) -> Dict[str, torch.Tensor]:
|
||||
"""
|
||||
Run GPU-to-GPU inference with the specified model using context pooling.
|
||||
|
||||
All inputs must be CUDA tensors and outputs will be CUDA tensors (stays in VRAM).
|
||||
Thread-safe: Borrows an execution context from the pool with mutex locking.
|
||||
|
||||
Args:
|
||||
model_id: Model identifier
|
||||
inputs: Dictionary mapping input names to CUDA tensors
|
||||
synchronize: If True, wait for inference to complete. If False, async execution.
|
||||
timeout: Max time to wait for available context (seconds)
|
||||
|
||||
Returns:
|
||||
Dictionary mapping output names to CUDA tensors (in VRAM)
|
||||
|
||||
Raises:
|
||||
KeyError: If model_id not found
|
||||
ValueError: If inputs don't match expected shapes or are not on GPU
|
||||
RuntimeError: If no context available within timeout
|
||||
"""
|
||||
# Get shared engine
|
||||
if model_id not in self._model_to_hash:
|
||||
raise KeyError(f"Model '{model_id}' not found. Available: {list(self._model_to_hash.keys())}")
|
||||
|
||||
file_hash = self._model_to_hash[model_id]
|
||||
shared_engine = self._shared_engines[file_hash]
|
||||
metadata = shared_engine.metadata
|
||||
|
||||
# Validate inputs
|
||||
for name in metadata.input_names:
|
||||
if name not in inputs:
|
||||
raise ValueError(f"Missing required input: {name}")
|
||||
|
||||
tensor = inputs[name]
|
||||
if not tensor.is_cuda:
|
||||
raise ValueError(f"Input '{name}' must be a CUDA tensor (on GPU)")
|
||||
|
||||
# Check device
|
||||
if tensor.device != self.device:
|
||||
print(f"Warning: Input '{name}' on {tensor.device}, moving to {self.device}")
|
||||
inputs[name] = tensor.to(self.device)
|
||||
|
||||
# Acquire context from pool (mutex-based)
|
||||
exec_ctx = shared_engine.acquire_context(timeout=timeout)
|
||||
if exec_ctx is None:
|
||||
raise RuntimeError(
|
||||
f"No execution context available for model '{model_id}' within {timeout}s. "
|
||||
f"All {shared_engine.num_contexts} contexts are busy."
|
||||
)
|
||||
|
||||
try:
|
||||
# Prepare output tensors
|
||||
outputs = {}
|
||||
|
||||
# Set input tensors - TensorRT 10.x API
|
||||
for name in metadata.input_names:
|
||||
input_tensor = inputs[name].contiguous()
|
||||
exec_ctx.context.set_tensor_address(name, input_tensor.data_ptr())
|
||||
|
||||
# Allocate and set output tensors
|
||||
for name in metadata.output_names:
|
||||
output_shape = metadata.output_shapes[name]
|
||||
output_dtype = metadata.output_dtypes[name]
|
||||
|
||||
output_tensor = torch.empty(
|
||||
output_shape,
|
||||
dtype=output_dtype,
|
||||
device=self.device
|
||||
)
|
||||
|
||||
outputs[name] = output_tensor
|
||||
exec_ctx.context.set_tensor_address(name, output_tensor.data_ptr())
|
||||
|
||||
# Execute inference on context's stream - TensorRT 10.x API
|
||||
with torch.cuda.stream(exec_ctx.stream):
|
||||
success = exec_ctx.context.execute_async_v3(
|
||||
stream_handle=exec_ctx.stream.cuda_stream
|
||||
)
|
||||
|
||||
if not success:
|
||||
raise RuntimeError(f"Inference failed for model '{model_id}'")
|
||||
|
||||
# Synchronize if requested
|
||||
if synchronize:
|
||||
exec_ctx.stream.synchronize()
|
||||
|
||||
return outputs
|
||||
|
||||
finally:
|
||||
# Always release context back to pool
|
||||
shared_engine.release_context(exec_ctx)
|
||||
|
||||
def infer_batch(self, model_id: str, batch_inputs: List[Dict[str, torch.Tensor]],
|
||||
synchronize: bool = True) -> List[Dict[str, torch.Tensor]]:
|
||||
"""
|
||||
Run inference on multiple inputs.
|
||||
Contexts are borrowed/returned for each input, enabling parallel processing.
|
||||
|
||||
Args:
|
||||
model_id: Model identifier
|
||||
batch_inputs: List of input dictionaries
|
||||
synchronize: If True, wait for all inferences to complete
|
||||
|
||||
Returns:
|
||||
List of output dictionaries
|
||||
"""
|
||||
results = []
|
||||
for inputs in batch_inputs:
|
||||
outputs = self.infer(model_id, inputs, synchronize=synchronize)
|
||||
results.append(outputs)
|
||||
|
||||
return results
|
||||
|
||||
def unload_model(self, model_id: str):
|
||||
"""
|
||||
Unload a model from the repository.
|
||||
|
||||
Removes the model_id reference from the shared engine. If this was the last
|
||||
reference, the engine and all its contexts will be fully unloaded from VRAM.
|
||||
|
||||
Args:
|
||||
model_id: Model identifier to unload
|
||||
"""
|
||||
with self._repo_lock:
|
||||
if model_id not in self._model_to_hash:
|
||||
print(f"Warning: Model '{model_id}' not found")
|
||||
return
|
||||
|
||||
file_hash = self._model_to_hash[model_id]
|
||||
|
||||
# Remove model_id from shared engine
|
||||
if file_hash in self._shared_engines:
|
||||
shared_engine = self._shared_engines[file_hash]
|
||||
remaining_refs = shared_engine.remove_model_id(model_id)
|
||||
|
||||
# If no more references, cleanup engine and contexts
|
||||
if remaining_refs == 0:
|
||||
shared_engine.cleanup()
|
||||
del self._shared_engines[file_hash]
|
||||
print(f"Model '{model_id}' unloaded, engine removed from VRAM (0 references)")
|
||||
else:
|
||||
print(f"Model '{model_id}' unloaded, engine kept in VRAM ({remaining_refs} references)")
|
||||
|
||||
# Remove from model_id mapping
|
||||
del self._model_to_hash[model_id]
|
||||
|
||||
def get_metadata(self, model_id: str) -> Optional[ModelMetadata]:
|
||||
"""
|
||||
Get metadata for a loaded model.
|
||||
|
||||
Args:
|
||||
model_id: Model identifier
|
||||
|
||||
Returns:
|
||||
ModelMetadata or None if not found
|
||||
"""
|
||||
if model_id not in self._model_to_hash:
|
||||
return None
|
||||
|
||||
file_hash = self._model_to_hash[model_id]
|
||||
if file_hash not in self._shared_engines:
|
||||
return None
|
||||
|
||||
return self._shared_engines[file_hash].metadata
|
||||
|
||||
def list_models(self) -> Dict[str, ModelMetadata]:
|
||||
"""
|
||||
List all loaded models.
|
||||
|
||||
Returns:
|
||||
Dictionary mapping model_id to ModelMetadata
|
||||
"""
|
||||
with self._repo_lock:
|
||||
result = {}
|
||||
for model_id, file_hash in self._model_to_hash.items():
|
||||
if file_hash in self._shared_engines:
|
||||
result[model_id] = self._shared_engines[file_hash].metadata
|
||||
return result
|
||||
|
||||
def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Get detailed information about a loaded model.
|
||||
|
||||
Args:
|
||||
model_id: Model identifier
|
||||
|
||||
Returns:
|
||||
Dictionary with model information or None if not found
|
||||
"""
|
||||
if model_id not in self._model_to_hash:
|
||||
return None
|
||||
|
||||
file_hash = self._model_to_hash[model_id]
|
||||
if file_hash not in self._shared_engines:
|
||||
return None
|
||||
|
||||
shared_engine = self._shared_engines[file_hash]
|
||||
metadata = shared_engine.metadata
|
||||
|
||||
return {
|
||||
'model_id': model_id,
|
||||
'file_path': metadata.file_path,
|
||||
'file_hash': metadata.file_hash[:16] + '...',
|
||||
'engine_references': shared_engine.get_reference_count(),
|
||||
'context_pool_size': shared_engine.num_contexts,
|
||||
'shared_with_model_ids': list(shared_engine.model_ids),
|
||||
'inputs': {
|
||||
name: {
|
||||
'shape': metadata.input_shapes[name],
|
||||
'dtype': str(metadata.input_dtypes[name])
|
||||
}
|
||||
for name in metadata.input_names
|
||||
},
|
||||
'outputs': {
|
||||
name: {
|
||||
'shape': metadata.output_shapes[name],
|
||||
'dtype': str(metadata.output_dtypes[name])
|
||||
}
|
||||
for name in metadata.output_names
|
||||
}
|
||||
}
|
||||
|
||||
def get_stats(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get repository statistics.
|
||||
|
||||
Returns:
|
||||
Dictionary with stats about loaded models and memory usage
|
||||
"""
|
||||
with self._repo_lock:
|
||||
total_contexts = sum(
|
||||
engine.num_contexts
|
||||
for engine in self._shared_engines.values()
|
||||
)
|
||||
|
||||
return {
|
||||
'total_model_ids': len(self._model_to_hash),
|
||||
'unique_engines': len(self._shared_engines),
|
||||
'total_contexts': total_contexts,
|
||||
'memory_efficiency': f"{len(self._model_to_hash)} model IDs using only {len(self._shared_engines)} engines",
|
||||
'gpu_id': self.gpu_id,
|
||||
'models': list(self._model_to_hash.keys())
|
||||
}
|
||||
|
||||
def __repr__(self):
|
||||
with self._repo_lock:
|
||||
return (f"TensorRTModelRepository(gpu={self.gpu_id}, "
|
||||
f"model_ids={len(self._model_to_hash)}, "
|
||||
f"unique_engines={len(self._shared_engines)})")
|
||||
|
||||
def __del__(self):
|
||||
"""Cleanup all models on deletion"""
|
||||
with self._repo_lock:
|
||||
model_ids = list(self._model_to_hash.keys())
|
||||
for model_id in model_ids:
|
||||
self.unload_model(model_id)
|
||||
481
services/stream_decoder.py
Normal file
481
services/stream_decoder.py
Normal file
|
|
@ -0,0 +1,481 @@
|
|||
import threading
|
||||
from typing import Optional
|
||||
from collections import deque
|
||||
from enum import Enum
|
||||
import torch
|
||||
import PyNvVideoCodec as nvc
|
||||
import av
|
||||
import numpy as np
|
||||
from cuda.bindings import driver as cuda_driver
|
||||
from .jpeg_encoder import encode_frame_to_jpeg
|
||||
|
||||
|
||||
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
||||
"""
|
||||
Convert NV12 format to RGB on GPU using PyTorch operations.
|
||||
|
||||
NV12 format:
|
||||
- Y plane: height x width (luminance)
|
||||
- UV plane: (height/2) x width (interleaved U and V, subsampled by 2)
|
||||
|
||||
Total tensor size: (height * 3/2) x width
|
||||
|
||||
Args:
|
||||
nv12_tensor: Input tensor in NV12 format, shape (H*3/2, W)
|
||||
height: Original frame height
|
||||
width: Original frame width
|
||||
|
||||
Returns:
|
||||
RGB tensor, shape (3, H, W) in range [0, 255]
|
||||
"""
|
||||
device = nv12_tensor.device
|
||||
|
||||
# Split Y and UV planes
|
||||
y_plane = nv12_tensor[:height, :].float() # (H, W)
|
||||
uv_plane = nv12_tensor[height:, :].float() # (H/2, W)
|
||||
|
||||
# Reshape UV plane to separate U and V channels
|
||||
# UV is interleaved: U0V0U1V1... we need to deinterleave
|
||||
uv_plane = uv_plane.reshape(height // 2, width // 2, 2) # (H/2, W/2, 2)
|
||||
u_plane = uv_plane[:, :, 0] # (H/2, W/2)
|
||||
v_plane = uv_plane[:, :, 1] # (H/2, W/2)
|
||||
|
||||
# Upsample U and V to full resolution using bilinear interpolation
|
||||
u_upsampled = torch.nn.functional.interpolate(
|
||||
u_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
|
||||
size=(height, width),
|
||||
mode='bilinear',
|
||||
align_corners=False
|
||||
).squeeze(0).squeeze(0) # (H, W)
|
||||
|
||||
v_upsampled = torch.nn.functional.interpolate(
|
||||
v_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
|
||||
size=(height, width),
|
||||
mode='bilinear',
|
||||
align_corners=False
|
||||
).squeeze(0).squeeze(0) # (H, W)
|
||||
|
||||
# YUV to RGB conversion using BT.601 standard
|
||||
# R = Y + 1.402 * (V - 128)
|
||||
# G = Y - 0.344136 * (U - 128) - 0.714136 * (V - 128)
|
||||
# B = Y + 1.772 * (U - 128)
|
||||
|
||||
y = y_plane
|
||||
u = u_upsampled - 128.0
|
||||
v = v_upsampled - 128.0
|
||||
|
||||
r = y + 1.402 * v
|
||||
g = y - 0.344136 * u - 0.714136 * v
|
||||
b = y + 1.772 * u
|
||||
|
||||
# Clamp to [0, 255] and convert to uint8
|
||||
r = torch.clamp(r, 0, 255).to(torch.uint8)
|
||||
g = torch.clamp(g, 0, 255).to(torch.uint8)
|
||||
b = torch.clamp(b, 0, 255).to(torch.uint8)
|
||||
|
||||
# Stack to (3, H, W)
|
||||
rgb = torch.stack([r, g, b], dim=0)
|
||||
|
||||
return rgb
|
||||
|
||||
|
||||
class ConnectionStatus(Enum):
|
||||
DISCONNECTED = "disconnected"
|
||||
CONNECTING = "connecting"
|
||||
CONNECTED = "connected"
|
||||
ERROR = "error"
|
||||
RECONNECTING = "reconnecting"
|
||||
|
||||
|
||||
class StreamDecoderFactory:
|
||||
"""
|
||||
Factory for creating StreamDecoder instances with shared CUDA context.
|
||||
This minimizes VRAM overhead by sharing the CUDA context across all decoders.
|
||||
"""
|
||||
|
||||
_instance = None
|
||||
_lock = threading.Lock()
|
||||
|
||||
def __new__(cls, gpu_id: int = 0):
|
||||
if cls._instance is None:
|
||||
with cls._lock:
|
||||
if cls._instance is None:
|
||||
cls._instance = super(StreamDecoderFactory, cls).__new__(cls)
|
||||
cls._instance._initialized = False
|
||||
return cls._instance
|
||||
|
||||
def __init__(self, gpu_id: int = 0):
|
||||
if self._initialized:
|
||||
return
|
||||
|
||||
self.gpu_id = gpu_id
|
||||
|
||||
# Initialize CUDA and get device
|
||||
err, = cuda_driver.cuInit(0)
|
||||
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError(f"Failed to initialize CUDA: {err}")
|
||||
|
||||
# Get CUDA device
|
||||
err, self.cuda_device = cuda_driver.cuDeviceGet(gpu_id)
|
||||
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError(f"Failed to get CUDA device {gpu_id}: {err}")
|
||||
|
||||
# Retain primary context (shared across all decoders)
|
||||
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
||||
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError(f"Failed to retain CUDA primary context: {err}")
|
||||
|
||||
self._initialized = True
|
||||
print(f"StreamDecoderFactory initialized with shared CUDA context on GPU {gpu_id}")
|
||||
|
||||
def create_decoder(self, rtsp_url: str, buffer_size: int = 30,
|
||||
codec: str = "h264") -> 'StreamDecoder':
|
||||
"""
|
||||
Create a new StreamDecoder instance with shared CUDA context.
|
||||
|
||||
Args:
|
||||
rtsp_url: RTSP stream URL
|
||||
buffer_size: Number of frames to buffer in VRAM
|
||||
codec: Video codec (h264, hevc, etc.)
|
||||
|
||||
Returns:
|
||||
StreamDecoder instance
|
||||
"""
|
||||
return StreamDecoder(
|
||||
rtsp_url=rtsp_url,
|
||||
cuda_context=self.cuda_context,
|
||||
gpu_id=self.gpu_id,
|
||||
buffer_size=buffer_size,
|
||||
codec=codec
|
||||
)
|
||||
|
||||
def __del__(self):
|
||||
"""Cleanup shared CUDA context on factory destruction"""
|
||||
if hasattr(self, 'cuda_device') and hasattr(self, 'gpu_id'):
|
||||
cuda_driver.cuDevicePrimaryCtxRelease(self.cuda_device)
|
||||
|
||||
|
||||
class StreamDecoder:
|
||||
"""
|
||||
Decodes RTSP stream using NVDEC and maintains a ring buffer of frames in GPU VRAM.
|
||||
Thread-safe for concurrent read/write operations.
|
||||
"""
|
||||
|
||||
def __init__(self, rtsp_url: str, cuda_context, gpu_id: int,
|
||||
buffer_size: int = 30, codec: str = "h264"):
|
||||
"""
|
||||
Initialize StreamDecoder.
|
||||
|
||||
Args:
|
||||
rtsp_url: RTSP stream URL
|
||||
cuda_context: Shared CUDA context handle
|
||||
gpu_id: GPU device ID
|
||||
buffer_size: Number of frames to keep in ring buffer
|
||||
codec: Video codec type
|
||||
"""
|
||||
self.rtsp_url = rtsp_url
|
||||
self.cuda_context = cuda_context
|
||||
self.gpu_id = gpu_id
|
||||
self.buffer_size = buffer_size
|
||||
self.codec = codec
|
||||
|
||||
# Connection status
|
||||
self.status = ConnectionStatus.DISCONNECTED
|
||||
self._status_lock = threading.Lock()
|
||||
|
||||
# Frame buffer (ring buffer) - stores CUDA device pointers
|
||||
self.frame_buffer = deque(maxlen=buffer_size)
|
||||
self._buffer_lock = threading.RLock()
|
||||
|
||||
# Decoder and container instances
|
||||
self.decoder = None
|
||||
self.container = None
|
||||
|
||||
# Decode thread
|
||||
self._decode_thread: Optional[threading.Thread] = None
|
||||
self._stop_flag = threading.Event()
|
||||
|
||||
# Frame metadata
|
||||
self.frame_width: Optional[int] = None
|
||||
self.frame_height: Optional[int] = None
|
||||
self.frame_count: int = 0
|
||||
|
||||
def start(self):
|
||||
"""Start the RTSP stream decoding in background thread"""
|
||||
if self._decode_thread is not None and self._decode_thread.is_alive():
|
||||
print(f"Decoder already running for {self.rtsp_url}")
|
||||
return
|
||||
|
||||
self._stop_flag.clear()
|
||||
self._decode_thread = threading.Thread(target=self._decode_loop, daemon=True)
|
||||
self._decode_thread.start()
|
||||
print(f"Started decoder thread for {self.rtsp_url}")
|
||||
|
||||
def stop(self):
|
||||
"""Stop the decoding thread and cleanup resources"""
|
||||
self._stop_flag.set()
|
||||
if self._decode_thread is not None:
|
||||
self._decode_thread.join(timeout=5.0)
|
||||
self._cleanup()
|
||||
print(f"Stopped decoder for {self.rtsp_url}")
|
||||
|
||||
def _set_status(self, status: ConnectionStatus):
|
||||
"""Thread-safe status update"""
|
||||
with self._status_lock:
|
||||
self.status = status
|
||||
|
||||
def get_status(self) -> ConnectionStatus:
|
||||
"""Get current connection status"""
|
||||
with self._status_lock:
|
||||
return self.status
|
||||
|
||||
def _init_rtsp_connection(self) -> bool:
|
||||
"""Initialize RTSP connection using PyAV + PyNvVideoCodec"""
|
||||
try:
|
||||
self._set_status(ConnectionStatus.CONNECTING)
|
||||
|
||||
# Open RTSP stream with PyAV
|
||||
options = {
|
||||
'rtsp_transport': 'tcp',
|
||||
'max_delay': '500000', # 500ms
|
||||
'rtsp_flags': 'prefer_tcp',
|
||||
'timeout': '5000000', # 5 seconds
|
||||
}
|
||||
|
||||
self.container = av.open(self.rtsp_url, options=options)
|
||||
|
||||
# Get video stream
|
||||
video_stream = self.container.streams.video[0]
|
||||
self.frame_width = video_stream.width
|
||||
self.frame_height = video_stream.height
|
||||
|
||||
print(f"RTSP connected: {self.frame_width}x{self.frame_height}")
|
||||
|
||||
# Map codec name to PyNvVideoCodec codec enum
|
||||
codec_map = {
|
||||
'h264': nvc.cudaVideoCodec.H264,
|
||||
'hevc': nvc.cudaVideoCodec.HEVC,
|
||||
'h265': nvc.cudaVideoCodec.HEVC,
|
||||
}
|
||||
|
||||
codec_id = codec_map.get(self.codec.lower(), nvc.cudaVideoCodec.H264)
|
||||
|
||||
# Initialize NVDEC decoder with shared CUDA context
|
||||
self.decoder = nvc.CreateDecoder(
|
||||
gpuid=self.gpu_id,
|
||||
codec=codec_id,
|
||||
cudacontext=self.cuda_context,
|
||||
usedevicememory=True
|
||||
)
|
||||
|
||||
self._set_status(ConnectionStatus.CONNECTED)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"Failed to connect to RTSP stream {self.rtsp_url}: {e}")
|
||||
self._set_status(ConnectionStatus.ERROR)
|
||||
return False
|
||||
|
||||
def _decode_loop(self):
|
||||
"""Main decode loop running in background thread"""
|
||||
retry_count = 0
|
||||
max_retries = 5
|
||||
|
||||
while not self._stop_flag.is_set():
|
||||
# Initialize connection
|
||||
if not self._init_rtsp_connection():
|
||||
retry_count += 1
|
||||
if retry_count >= max_retries:
|
||||
print(f"Max retries reached for {self.rtsp_url}")
|
||||
self._set_status(ConnectionStatus.ERROR)
|
||||
break
|
||||
|
||||
self._set_status(ConnectionStatus.RECONNECTING)
|
||||
self._stop_flag.wait(timeout=2.0)
|
||||
continue
|
||||
|
||||
retry_count = 0 # Reset on successful connection
|
||||
|
||||
try:
|
||||
# Decode loop - iterate through packets from PyAV
|
||||
for packet in self.container.demux(video=0):
|
||||
if self._stop_flag.is_set():
|
||||
break
|
||||
|
||||
if packet.dts is None:
|
||||
continue
|
||||
|
||||
# Convert packet to numpy array
|
||||
packet_data = np.frombuffer(bytes(packet), dtype=np.uint8)
|
||||
|
||||
# Create PacketData and pass numpy array pointer
|
||||
pkt = nvc.PacketData()
|
||||
pkt.bsl_data = packet_data.ctypes.data
|
||||
pkt.bsl = len(packet_data)
|
||||
|
||||
# Decode using NVDEC
|
||||
decoded_frames = self.decoder.Decode(pkt)
|
||||
|
||||
if not decoded_frames:
|
||||
continue
|
||||
|
||||
# Add frames to ring buffer (thread-safe)
|
||||
with self._buffer_lock:
|
||||
for frame in decoded_frames:
|
||||
self.frame_buffer.append(frame)
|
||||
self.frame_count += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in decode loop for {self.rtsp_url}: {e}")
|
||||
self._set_status(ConnectionStatus.RECONNECTING)
|
||||
self._cleanup()
|
||||
self._stop_flag.wait(timeout=2.0)
|
||||
|
||||
def _cleanup(self):
|
||||
"""Cleanup resources"""
|
||||
if self.container:
|
||||
try:
|
||||
self.container.close()
|
||||
except:
|
||||
pass
|
||||
self.container = None
|
||||
|
||||
self.decoder = None
|
||||
|
||||
with self._buffer_lock:
|
||||
self.frame_buffer.clear()
|
||||
|
||||
def get_frame(self, index: int = -1, rgb: bool = True) -> Optional[torch.Tensor]:
|
||||
"""
|
||||
Get a frame from the buffer as a CUDA tensor (in VRAM).
|
||||
|
||||
Args:
|
||||
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
|
||||
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
|
||||
|
||||
Returns:
|
||||
torch.Tensor in CUDA memory (device tensor) or None if buffer empty
|
||||
- If rgb=True: Shape (3, H, W) in RGB format, dtype uint8
|
||||
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||
"""
|
||||
with self._buffer_lock:
|
||||
if len(self.frame_buffer) == 0:
|
||||
return None
|
||||
|
||||
try:
|
||||
decoded_frame = self.frame_buffer[index]
|
||||
|
||||
# Convert DecodedFrame to PyTorch tensor using DLPack (zero-copy)
|
||||
# This keeps the data in GPU memory
|
||||
nv12_tensor = torch.from_dlpack(decoded_frame)
|
||||
|
||||
if not rgb:
|
||||
# Return raw NV12 format
|
||||
return nv12_tensor
|
||||
|
||||
# Convert NV12 to RGB on GPU
|
||||
if self.frame_height is None or self.frame_width is None:
|
||||
print("Frame dimensions not available")
|
||||
return None
|
||||
|
||||
rgb_tensor = nv12_to_rgb_gpu(nv12_tensor, self.frame_height, self.frame_width)
|
||||
return rgb_tensor
|
||||
|
||||
except (IndexError, Exception) as e:
|
||||
print(f"Error getting frame: {e}")
|
||||
return None
|
||||
|
||||
def get_latest_frame(self, rgb: bool = True) -> Optional[torch.Tensor]:
|
||||
"""
|
||||
Get the most recent decoded frame as CUDA tensor.
|
||||
|
||||
Args:
|
||||
rgb: If True, convert to RGB. If False, return raw NV12.
|
||||
|
||||
Returns:
|
||||
torch.Tensor on GPU in RGB (3, H, W) or NV12 (H*3/2, W) format
|
||||
"""
|
||||
return self.get_frame(-1, rgb=rgb)
|
||||
|
||||
def get_frame_cpu(self, index: int = -1, rgb: bool = True) -> Optional[np.ndarray]:
|
||||
"""
|
||||
Get a frame from the buffer and copy it to CPU memory as numpy array.
|
||||
|
||||
Args:
|
||||
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
|
||||
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
|
||||
|
||||
Returns:
|
||||
numpy.ndarray in CPU memory or None if buffer empty
|
||||
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8 (HWC format for easy display)
|
||||
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||
"""
|
||||
# Get frame on GPU
|
||||
gpu_frame = self.get_frame(index=index, rgb=rgb)
|
||||
|
||||
if gpu_frame is None:
|
||||
return None
|
||||
|
||||
# Transfer from GPU to CPU
|
||||
cpu_tensor = gpu_frame.cpu()
|
||||
|
||||
# Convert to numpy array
|
||||
if rgb:
|
||||
# Convert from (3, H, W) to (H, W, 3) for standard image format
|
||||
cpu_array = cpu_tensor.permute(1, 2, 0).numpy()
|
||||
else:
|
||||
# Keep NV12 format as-is
|
||||
cpu_array = cpu_tensor.numpy()
|
||||
|
||||
return cpu_array
|
||||
|
||||
def get_latest_frame_cpu(self, rgb: bool = True) -> Optional[np.ndarray]:
|
||||
"""
|
||||
Get the most recent decoded frame as CPU numpy array.
|
||||
|
||||
Args:
|
||||
rgb: If True, convert to RGB. If False, return raw NV12.
|
||||
|
||||
Returns:
|
||||
numpy.ndarray in CPU memory
|
||||
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8
|
||||
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||
"""
|
||||
return self.get_frame_cpu(-1, rgb=rgb)
|
||||
|
||||
def get_buffer_size(self) -> int:
|
||||
"""Get current number of frames in buffer"""
|
||||
with self._buffer_lock:
|
||||
return len(self.frame_buffer)
|
||||
|
||||
def is_connected(self) -> bool:
|
||||
"""Check if stream is actively connected"""
|
||||
return self.get_status() == ConnectionStatus.CONNECTED
|
||||
|
||||
def get_frame_as_jpeg(self, index: int = -1, quality: int = 95) -> Optional[bytes]:
|
||||
"""
|
||||
Get a frame from the buffer and encode to JPEG.
|
||||
|
||||
This method:
|
||||
1. Gets RGB frame from buffer (stays on GPU)
|
||||
2. Encodes to JPEG using nvJPEG (GPU operation via shared encoder)
|
||||
3. Transfers JPEG bytes to CPU
|
||||
4. Returns bytes for saving to disk
|
||||
|
||||
Args:
|
||||
index: Frame index in buffer (-1 for latest)
|
||||
quality: JPEG quality (0-100, default 95)
|
||||
|
||||
Returns:
|
||||
JPEG encoded bytes or None if frame unavailable
|
||||
"""
|
||||
# Get RGB frame (on GPU)
|
||||
rgb_frame = self.get_frame(index=index, rgb=True)
|
||||
|
||||
# Use the shared JPEG encoder from jpeg_encoder module
|
||||
return encode_frame_to_jpeg(rgb_frame, quality=quality)
|
||||
|
||||
def __repr__(self):
|
||||
return (f"StreamDecoder(url={self.rtsp_url}, status={self.status.value}, "
|
||||
f"buffer={self.get_buffer_size()}/{self.buffer_size}, "
|
||||
f"frames_decoded={self.frame_count})")
|
||||
174
test_jpeg_encode.py
Executable file
174
test_jpeg_encode.py
Executable file
|
|
@ -0,0 +1,174 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for JPEG encoding with nvImageCodec
|
||||
Tests GPU-accelerated JPEG encoding from RTSP stream frames
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import time
|
||||
import os
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
from services import StreamDecoderFactory
|
||||
|
||||
# Load environment variables from .env file
|
||||
load_dotenv()
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Test JPEG encoding from RTSP stream')
|
||||
parser.add_argument(
|
||||
'--rtsp-url',
|
||||
type=str,
|
||||
default=None,
|
||||
help='RTSP stream URL (defaults to CAMERA_URL_1 from .env)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output-dir',
|
||||
type=str,
|
||||
default='./snapshots',
|
||||
help='Output directory for JPEG files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--num-frames',
|
||||
type=int,
|
||||
default=10,
|
||||
help='Number of frames to capture'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--interval',
|
||||
type=float,
|
||||
default=1.0,
|
||||
help='Interval between captures in seconds'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--quality',
|
||||
type=int,
|
||||
default=95,
|
||||
help='JPEG quality (0-100)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--gpu-id',
|
||||
type=int,
|
||||
default=0,
|
||||
help='GPU device ID'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Get RTSP URL from command line or environment
|
||||
rtsp_url = args.rtsp_url
|
||||
if not rtsp_url:
|
||||
rtsp_url = os.getenv('CAMERA_URL_1')
|
||||
if not rtsp_url:
|
||||
print("Error: No RTSP URL provided")
|
||||
print("Please either:")
|
||||
print(" 1. Use --rtsp-url argument, or")
|
||||
print(" 2. Add CAMERA_URL_1 to your .env file")
|
||||
sys.exit(1)
|
||||
|
||||
# Create output directory
|
||||
output_dir = Path(args.output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 80)
|
||||
print("RTSP Stream JPEG Encoding Test")
|
||||
print("=" * 80)
|
||||
print(f"RTSP URL: {rtsp_url}")
|
||||
print(f"Output Directory: {output_dir}")
|
||||
print(f"Number of Frames: {args.num_frames}")
|
||||
print(f"Capture Interval: {args.interval}s")
|
||||
print(f"JPEG Quality: {args.quality}")
|
||||
print(f"GPU ID: {args.gpu_id}")
|
||||
print("=" * 80)
|
||||
print()
|
||||
|
||||
try:
|
||||
# Initialize factory and decoder
|
||||
print("[1/3] Initializing StreamDecoderFactory...")
|
||||
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||
print("✓ Factory initialized\n")
|
||||
|
||||
print("[2/3] Creating and starting decoder...")
|
||||
decoder = factory.create_decoder(
|
||||
rtsp_url=rtsp_url,
|
||||
buffer_size=30
|
||||
)
|
||||
decoder.start()
|
||||
print("✓ Decoder started\n")
|
||||
|
||||
# Wait for connection
|
||||
print("[3/3] Waiting for stream to connect...")
|
||||
max_wait = 10
|
||||
for i in range(max_wait):
|
||||
if decoder.is_connected():
|
||||
print("✓ Stream connected\n")
|
||||
break
|
||||
time.sleep(1)
|
||||
print(f" Waiting... {i+1}/{max_wait}s")
|
||||
else:
|
||||
print("✗ Failed to connect to stream")
|
||||
sys.exit(1)
|
||||
|
||||
# Capture frames
|
||||
print(f"Capturing {args.num_frames} frames...")
|
||||
print("-" * 80)
|
||||
|
||||
captured = 0
|
||||
for i in range(args.num_frames):
|
||||
# Get frame as JPEG
|
||||
start_time = time.time()
|
||||
jpeg_bytes = decoder.get_frame_as_jpeg(quality=args.quality)
|
||||
encode_time = (time.time() - start_time) * 1000 # ms
|
||||
|
||||
if jpeg_bytes:
|
||||
# Save to file
|
||||
filename = output_dir / f"frame_{i:04d}.jpg"
|
||||
with open(filename, 'wb') as f:
|
||||
f.write(jpeg_bytes)
|
||||
|
||||
size_kb = len(jpeg_bytes) / 1024
|
||||
print(f"[{i+1}/{args.num_frames}] Saved {filename.name} "
|
||||
f"({size_kb:.1f} KB, encoded in {encode_time:.2f}ms)")
|
||||
captured += 1
|
||||
else:
|
||||
print(f"[{i+1}/{args.num_frames}] Failed to get frame")
|
||||
|
||||
# Wait before next capture (except for last frame)
|
||||
if i < args.num_frames - 1:
|
||||
time.sleep(args.interval)
|
||||
|
||||
print("-" * 80)
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 80)
|
||||
print("Capture Complete")
|
||||
print("=" * 80)
|
||||
print(f"Successfully captured: {captured}/{args.num_frames} frames")
|
||||
print(f"Output directory: {output_dir.absolute()}")
|
||||
print("=" * 80)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n✗ Interrupted by user")
|
||||
sys.exit(1)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n\n✗ Error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
finally:
|
||||
# Cleanup
|
||||
if 'decoder' in locals():
|
||||
print("\nCleaning up...")
|
||||
decoder.stop()
|
||||
print("✓ Decoder stopped")
|
||||
|
||||
print("\n✓ Test completed successfully")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
310
test_model_inference.py
Normal file
310
test_model_inference.py
Normal file
|
|
@ -0,0 +1,310 @@
|
|||
"""
|
||||
Test script for TensorRT Model Repository with multi-camera inference.
|
||||
|
||||
This demonstrates:
|
||||
1. Loading the same model for multiple cameras (deduplication)
|
||||
2. Context pool load balancing
|
||||
3. GPU-to-GPU inference from RTSP streams
|
||||
4. Memory efficiency with shared engines
|
||||
"""
|
||||
|
||||
import time
|
||||
import torch
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
from services.stream_decoder import StreamDecoderFactory
|
||||
|
||||
|
||||
def test_multi_camera_inference():
|
||||
"""
|
||||
Simulate multi-camera inference scenario.
|
||||
|
||||
Example: 100 cameras, all using the same YOLOv8 model
|
||||
- Without pooling: 100 engines + 100 contexts in VRAM
|
||||
- With pooling: 1 engine + 4 contexts in VRAM (huge savings!)
|
||||
"""
|
||||
|
||||
# Initialize model repository with context pooling
|
||||
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||
|
||||
# Camera configurations (simulated)
|
||||
camera_configs = [
|
||||
{"id": "camera_1", "rtsp_url": "rtsp://camera1.local/stream"},
|
||||
{"id": "camera_2", "rtsp_url": "rtsp://camera2.local/stream"},
|
||||
{"id": "camera_3", "rtsp_url": "rtsp://camera3.local/stream"},
|
||||
# ... imagine 100 cameras here
|
||||
]
|
||||
|
||||
# Load the same model for all cameras
|
||||
model_file = "models/yolov8n.trt" # Same file for all cameras
|
||||
|
||||
print("=" * 80)
|
||||
print("LOADING MODELS FOR MULTIPLE CAMERAS")
|
||||
print("=" * 80)
|
||||
|
||||
for config in camera_configs:
|
||||
try:
|
||||
# Each camera gets its own model_id, but shares the same engine!
|
||||
metadata = repo.load_model(
|
||||
model_id=config["id"],
|
||||
file_path=model_file,
|
||||
num_contexts=4 # 4 contexts shared across all cameras
|
||||
)
|
||||
print(f"\n✓ Loaded model for {config['id']}")
|
||||
except Exception as e:
|
||||
print(f"\n✗ Failed to load model for {config['id']}: {e}")
|
||||
|
||||
# Show repository stats
|
||||
print("\n" + "=" * 80)
|
||||
print("REPOSITORY STATISTICS")
|
||||
print("=" * 80)
|
||||
stats = repo.get_stats()
|
||||
print(f"Total model IDs: {stats['total_model_ids']}")
|
||||
print(f"Unique engines in VRAM: {stats['unique_engines']}")
|
||||
print(f"Total contexts: {stats['total_contexts']}")
|
||||
print(f"Memory efficiency: {stats['memory_efficiency']}")
|
||||
|
||||
# Get detailed info for one camera
|
||||
print("\n" + "=" * 80)
|
||||
print("DETAILED MODEL INFO (camera_1)")
|
||||
print("=" * 80)
|
||||
info = repo.get_model_info("camera_1")
|
||||
if info:
|
||||
print(f"Model ID: {info['model_id']}")
|
||||
print(f"File: {info['file_path']}")
|
||||
print(f"File hash: {info['file_hash']}")
|
||||
print(f"Engine references: {info['engine_references']}")
|
||||
print(f"Context pool size: {info['context_pool_size']}")
|
||||
print(f"Shared with: {info['shared_with_model_ids']}")
|
||||
print(f"\nInputs:")
|
||||
for name, spec in info['inputs'].items():
|
||||
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||
print(f"\nOutputs:")
|
||||
for name, spec in info['outputs'].items():
|
||||
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||
|
||||
# Simulate inference from multiple cameras
|
||||
print("\n" + "=" * 80)
|
||||
print("RUNNING INFERENCE (GPU-to-GPU)")
|
||||
print("=" * 80)
|
||||
|
||||
# Create dummy input tensors (simulating frames from cameras)
|
||||
# In real scenario, these come from StreamDecoder.get_frame()
|
||||
batch_size = 1
|
||||
channels = 3
|
||||
height = 640
|
||||
width = 640
|
||||
|
||||
for config in camera_configs:
|
||||
try:
|
||||
# Simulate getting frame from camera (already on GPU)
|
||||
input_tensor = torch.rand(
|
||||
batch_size, channels, height, width,
|
||||
dtype=torch.float32,
|
||||
device='cuda:0'
|
||||
)
|
||||
|
||||
# Run inference (stays in GPU)
|
||||
start = time.time()
|
||||
outputs = repo.infer(
|
||||
model_id=config["id"],
|
||||
inputs={"images": input_tensor}, # Adjust input name based on your model
|
||||
synchronize=True,
|
||||
timeout=5.0
|
||||
)
|
||||
elapsed = (time.time() - start) * 1000 # Convert to ms
|
||||
|
||||
print(f"\n{config['id']}: Inference completed in {elapsed:.2f}ms")
|
||||
for name, tensor in outputs.items():
|
||||
print(f" Output '{name}': {tensor.shape} on {tensor.device}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n{config['id']}: Inference failed: {e}")
|
||||
|
||||
# Cleanup
|
||||
print("\n" + "=" * 80)
|
||||
print("CLEANUP")
|
||||
print("=" * 80)
|
||||
|
||||
for config in camera_configs:
|
||||
repo.unload_model(config["id"])
|
||||
|
||||
print("\nAll models unloaded.")
|
||||
|
||||
|
||||
def test_rtsp_stream_with_inference():
|
||||
"""
|
||||
Real-world example: Decode RTSP stream and run inference.
|
||||
Everything stays in GPU memory (zero CPU transfers).
|
||||
"""
|
||||
|
||||
print("=" * 80)
|
||||
print("RTSP STREAM + TENSORRT INFERENCE (GPU-to-GPU)")
|
||||
print("=" * 80)
|
||||
|
||||
# Initialize components
|
||||
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
||||
model_repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||
|
||||
# Setup camera stream
|
||||
rtsp_url = "rtsp://your-camera-ip/stream"
|
||||
decoder = decoder_factory.create_decoder(rtsp_url, buffer_size=30)
|
||||
decoder.start()
|
||||
|
||||
# Load inference model
|
||||
try:
|
||||
model_repo.load_model(
|
||||
model_id="camera_main",
|
||||
file_path="models/yolov8n.trt"
|
||||
)
|
||||
except FileNotFoundError:
|
||||
print("\n⚠ Model file not found. Please export your model to TensorRT:")
|
||||
print(" Example: yolo export model=yolov8n.pt format=engine device=0")
|
||||
return
|
||||
|
||||
print("\nWaiting for stream to buffer frames...")
|
||||
time.sleep(3)
|
||||
|
||||
# Process frames
|
||||
for i in range(10):
|
||||
# Get frame from decoder (already on GPU)
|
||||
frame_gpu = decoder.get_latest_frame(rgb=True) # Returns torch.Tensor on CUDA
|
||||
|
||||
if frame_gpu is None:
|
||||
print(f"Frame {i}: No frame available")
|
||||
continue
|
||||
|
||||
# Preprocess if needed (stays on GPU)
|
||||
# For YOLOv8: normalize, resize, etc.
|
||||
# Example preprocessing (adjust for your model):
|
||||
frame_gpu = frame_gpu.float() / 255.0 # Normalize to [0, 1]
|
||||
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dimension: (1, 3, H, W)
|
||||
|
||||
# Run inference (GPU-to-GPU, zero copy)
|
||||
try:
|
||||
outputs = model_repo.infer(
|
||||
model_id="camera_main",
|
||||
inputs={"images": frame_gpu},
|
||||
synchronize=True
|
||||
)
|
||||
|
||||
print(f"\nFrame {i}: Inference successful")
|
||||
for name, tensor in outputs.items():
|
||||
print(f" {name}: {tensor.shape} on {tensor.device}")
|
||||
|
||||
# Post-process results (can stay on GPU or move to CPU as needed)
|
||||
# Example: NMS, bounding box extraction, etc.
|
||||
|
||||
except Exception as e:
|
||||
print(f"\nFrame {i}: Inference failed: {e}")
|
||||
|
||||
time.sleep(0.1) # Simulate processing interval
|
||||
|
||||
# Cleanup
|
||||
decoder.stop()
|
||||
model_repo.unload_model("camera_main")
|
||||
print("\n✓ Test completed successfully")
|
||||
|
||||
|
||||
def test_concurrent_inference():
|
||||
"""
|
||||
Test concurrent inference from multiple threads.
|
||||
Demonstrates context pool load balancing.
|
||||
"""
|
||||
import threading
|
||||
|
||||
print("=" * 80)
|
||||
print("CONCURRENT INFERENCE TEST (Context Pool Load Balancing)")
|
||||
print("=" * 80)
|
||||
|
||||
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||
|
||||
# Load model
|
||||
try:
|
||||
repo.load_model("shared_model", "models/yolov8n.trt", num_contexts=4)
|
||||
except Exception as e:
|
||||
print(f"Failed to load model: {e}")
|
||||
return
|
||||
|
||||
def worker(worker_id: int, num_inferences: int):
|
||||
"""Worker thread performing inference"""
|
||||
for i in range(num_inferences):
|
||||
try:
|
||||
# Create dummy input
|
||||
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0', dtype=torch.float32)
|
||||
|
||||
# Acquire context from pool, run inference, release context
|
||||
outputs = repo.infer(
|
||||
model_id="shared_model",
|
||||
inputs={"images": input_tensor},
|
||||
timeout=10.0
|
||||
)
|
||||
|
||||
print(f"Worker {worker_id}, Inference {i}: SUCCESS")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Worker {worker_id}, Inference {i}: FAILED - {e}")
|
||||
|
||||
time.sleep(0.01) # Small delay
|
||||
|
||||
# Launch multiple worker threads (more workers than contexts!)
|
||||
threads = []
|
||||
num_workers = 10 # 10 workers sharing 4 contexts
|
||||
inferences_per_worker = 5
|
||||
|
||||
print(f"\nLaunching {num_workers} workers (only 4 contexts available)")
|
||||
print("Contexts will be borrowed/returned automatically\n")
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
for worker_id in range(num_workers):
|
||||
t = threading.Thread(target=worker, args=(worker_id, inferences_per_worker))
|
||||
threads.append(t)
|
||||
t.start()
|
||||
|
||||
# Wait for all workers
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
total_inferences = num_workers * inferences_per_worker
|
||||
|
||||
print(f"\n✓ Completed {total_inferences} inferences in {elapsed:.2f}s")
|
||||
print(f" Throughput: {total_inferences / elapsed:.2f} inferences/sec")
|
||||
print(f" With only 4 contexts for {num_workers} workers!")
|
||||
|
||||
repo.unload_model("shared_model")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("\n" + "=" * 80)
|
||||
print("TENSORRT MODEL REPOSITORY - TEST SUITE")
|
||||
print("=" * 80)
|
||||
|
||||
# Test 1: Multi-camera model loading
|
||||
print("\n\nTEST 1: Multi-Camera Model Loading with Deduplication")
|
||||
print("-" * 80)
|
||||
try:
|
||||
test_multi_camera_inference()
|
||||
except Exception as e:
|
||||
print(f"Test 1 failed: {e}")
|
||||
|
||||
# Test 2: RTSP stream + inference (commented out by default)
|
||||
# Uncomment if you have a real RTSP stream
|
||||
# print("\n\nTEST 2: RTSP Stream + Inference")
|
||||
# print("-" * 80)
|
||||
# try:
|
||||
# test_rtsp_stream_with_inference()
|
||||
# except Exception as e:
|
||||
# print(f"Test 2 failed: {e}")
|
||||
|
||||
# Test 3: Concurrent inference
|
||||
print("\n\nTEST 3: Concurrent Inference with Context Pooling")
|
||||
print("-" * 80)
|
||||
try:
|
||||
test_concurrent_inference()
|
||||
except Exception as e:
|
||||
print(f"Test 3 failed: {e}")
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("ALL TESTS COMPLETED")
|
||||
print("=" * 80)
|
||||
255
test_multi_stream.py
Executable file
255
test_multi_stream.py
Executable file
|
|
@ -0,0 +1,255 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Multi-stream test script to verify CUDA context sharing efficiency.
|
||||
Tests multiple RTSP streams simultaneously and monitors VRAM usage.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import time
|
||||
import sys
|
||||
import subprocess
|
||||
import os
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
from services import StreamDecoderFactory, ConnectionStatus
|
||||
|
||||
# Load environment variables from .env file
|
||||
load_dotenv()
|
||||
|
||||
|
||||
def get_gpu_memory_usage(gpu_id: int = 0) -> int:
|
||||
"""Get current GPU memory usage in MB using nvidia-smi"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits', f'--id={gpu_id}'],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=True
|
||||
)
|
||||
return int(result.stdout.strip())
|
||||
except Exception as e:
|
||||
print(f"Warning: Could not get GPU memory usage: {e}")
|
||||
return 0
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Test multi-stream decoding with context sharing')
|
||||
parser.add_argument(
|
||||
'--gpu-id',
|
||||
type=int,
|
||||
default=0,
|
||||
help='GPU device ID'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--duration',
|
||||
type=int,
|
||||
default=20,
|
||||
help='Test duration in seconds'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--capture-snapshots',
|
||||
action='store_true',
|
||||
help='Capture JPEG snapshots during test'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output-dir',
|
||||
type=str,
|
||||
default='./multi_stream_snapshots',
|
||||
help='Output directory for snapshots'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load camera URLs from environment
|
||||
camera_urls = []
|
||||
i = 1
|
||||
while True:
|
||||
url = os.getenv(f'CAMERA_URL_{i}')
|
||||
if url:
|
||||
camera_urls.append(url)
|
||||
i += 1
|
||||
else:
|
||||
break
|
||||
|
||||
if not camera_urls:
|
||||
print("Error: No camera URLs found in .env file")
|
||||
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
|
||||
sys.exit(1)
|
||||
|
||||
# Create output directory if capturing snapshots
|
||||
if args.capture_snapshots:
|
||||
output_dir = Path(args.output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 80)
|
||||
print("Multi-Stream RTSP Decoder Test - Context Sharing Verification")
|
||||
print("=" * 80)
|
||||
print(f"Number of Streams: {len(camera_urls)}")
|
||||
print(f"GPU ID: {args.gpu_id}")
|
||||
print(f"Test Duration: {args.duration} seconds")
|
||||
print(f"Capture Snapshots: {args.capture_snapshots}")
|
||||
print("=" * 80)
|
||||
print()
|
||||
|
||||
try:
|
||||
# Get baseline GPU memory
|
||||
print("[Baseline] Measuring initial GPU memory usage...")
|
||||
baseline_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
print(f"✓ Baseline VRAM: {baseline_memory} MB\n")
|
||||
|
||||
# Initialize factory (shared CUDA context)
|
||||
print("[1/4] Initializing StreamDecoderFactory with shared CUDA context...")
|
||||
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||
|
||||
factory_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
factory_overhead = factory_memory - baseline_memory
|
||||
print(f"✓ Factory initialized")
|
||||
print(f" VRAM after factory: {factory_memory} MB (+{factory_overhead} MB)\n")
|
||||
|
||||
# Create all decoders
|
||||
print(f"[2/4] Creating {len(camera_urls)} StreamDecoder instances...")
|
||||
decoders = []
|
||||
for i, url in enumerate(camera_urls):
|
||||
decoder = factory.create_decoder(
|
||||
rtsp_url=url,
|
||||
buffer_size=30,
|
||||
codec='h264'
|
||||
)
|
||||
decoders.append(decoder)
|
||||
print(f" ✓ Decoder {i+1} created for camera {url.split('@')[1].split('/')[0]}")
|
||||
|
||||
decoders_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
decoders_overhead = decoders_memory - factory_memory
|
||||
print(f"\n VRAM after creating {len(decoders)} decoders: {decoders_memory} MB (+{decoders_overhead} MB)")
|
||||
print(f" Average per decoder: {decoders_overhead / len(decoders):.1f} MB\n")
|
||||
|
||||
# Start all decoders
|
||||
print(f"[3/4] Starting all {len(decoders)} decoders...")
|
||||
for i, decoder in enumerate(decoders):
|
||||
decoder.start()
|
||||
print(f" ✓ Decoder {i+1} started")
|
||||
|
||||
started_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
started_overhead = started_memory - decoders_memory
|
||||
print(f"\n VRAM after starting decoders: {started_memory} MB (+{started_overhead} MB)")
|
||||
print(f" Average per running decoder: {started_overhead / len(decoders):.1f} MB\n")
|
||||
|
||||
# Wait for all streams to connect
|
||||
print("[4/4] Waiting for all streams to connect...")
|
||||
max_wait = 15
|
||||
for wait_time in range(max_wait):
|
||||
connected = sum(1 for d in decoders if d.is_connected())
|
||||
print(f" Connected: {connected}/{len(decoders)} streams", end='\r')
|
||||
|
||||
if connected == len(decoders):
|
||||
print(f"\n✓ All {len(decoders)} streams connected!\n")
|
||||
break
|
||||
|
||||
time.sleep(1)
|
||||
else:
|
||||
connected = sum(1 for d in decoders if d.is_connected())
|
||||
print(f"\n⚠ Only {connected}/{len(decoders)} streams connected after {max_wait}s\n")
|
||||
|
||||
connected_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
connected_overhead = connected_memory - started_memory
|
||||
print(f" VRAM after connection: {connected_memory} MB (+{connected_overhead} MB)\n")
|
||||
|
||||
# Monitor streams
|
||||
print(f"Monitoring streams for {args.duration} seconds...")
|
||||
print("=" * 80)
|
||||
print(f"{'Time':<8} {'VRAM':<10} {'Stream 1':<12} {'Stream 2':<12} {'Stream 3':<12} {'Stream 4':<12}")
|
||||
print("-" * 80)
|
||||
|
||||
start_time = time.time()
|
||||
snapshot_interval = args.duration // 3 if args.capture_snapshots else 0
|
||||
last_snapshot = 0
|
||||
|
||||
while time.time() - start_time < args.duration:
|
||||
elapsed = time.time() - start_time
|
||||
current_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
|
||||
# Get stats for each decoder
|
||||
stats = []
|
||||
for decoder in decoders:
|
||||
status = decoder.get_status().value[:8]
|
||||
buffer = decoder.get_buffer_size()
|
||||
frames = decoder.frame_count
|
||||
stats.append(f"{status:8s} {buffer:2d}/30 {frames:4d}")
|
||||
|
||||
print(f"{elapsed:6.1f}s {current_memory:6d}MB {stats[0]:<12} {stats[1]:<12} {stats[2]:<12} {stats[3]:<12}")
|
||||
|
||||
# Capture snapshots
|
||||
if args.capture_snapshots and snapshot_interval > 0:
|
||||
if elapsed - last_snapshot >= snapshot_interval:
|
||||
print("\n → Capturing snapshots from all streams...")
|
||||
for i, decoder in enumerate(decoders):
|
||||
jpeg_bytes = decoder.get_frame_as_jpeg(quality=85)
|
||||
if jpeg_bytes:
|
||||
filename = output_dir / f"camera_{i+1}_t{int(elapsed)}s.jpg"
|
||||
with open(filename, 'wb') as f:
|
||||
f.write(jpeg_bytes)
|
||||
print(f" Saved {filename.name} ({len(jpeg_bytes)/1024:.1f} KB)")
|
||||
print()
|
||||
last_snapshot = elapsed
|
||||
|
||||
time.sleep(1)
|
||||
|
||||
print("=" * 80)
|
||||
|
||||
# Final memory analysis
|
||||
final_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
total_overhead = final_memory - baseline_memory
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("Memory Usage Analysis")
|
||||
print("=" * 80)
|
||||
print(f"Baseline VRAM: {baseline_memory:6d} MB")
|
||||
print(f"After Factory Init: {factory_memory:6d} MB (+{factory_overhead:4d} MB)")
|
||||
print(f"After Creating {len(decoders)} Decoders: {decoders_memory:6d} MB (+{decoders_overhead:4d} MB)")
|
||||
print(f"After Starting Decoders: {started_memory:6d} MB (+{started_overhead:4d} MB)")
|
||||
print(f"After Connection: {connected_memory:6d} MB (+{connected_overhead:4d} MB)")
|
||||
print(f"Final (after {args.duration}s): {final_memory:6d} MB (+{total_overhead:4d} MB total)")
|
||||
print("-" * 80)
|
||||
print(f"Average VRAM per stream: {total_overhead / len(decoders):6.1f} MB")
|
||||
print(f"Context sharing efficiency: {'EXCELLENT' if total_overhead < 500 else 'GOOD' if total_overhead < 800 else 'POOR'}")
|
||||
print("=" * 80)
|
||||
|
||||
# Final stats
|
||||
print("\nFinal Stream Statistics:")
|
||||
print("-" * 80)
|
||||
for i, decoder in enumerate(decoders):
|
||||
status = decoder.get_status().value
|
||||
buffer = decoder.get_buffer_size()
|
||||
frames = decoder.frame_count
|
||||
fps = frames / args.duration if args.duration > 0 else 0
|
||||
print(f"Stream {i+1}: {status:12s} | Buffer: {buffer:2d}/{decoder.buffer_size} | "
|
||||
f"Frames: {frames:5d} | Avg FPS: {fps:5.2f}")
|
||||
print("=" * 80)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n✗ Interrupted by user")
|
||||
sys.exit(1)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n\n✗ Error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
finally:
|
||||
# Cleanup
|
||||
if 'decoders' in locals():
|
||||
print("\nCleaning up...")
|
||||
for i, decoder in enumerate(decoders):
|
||||
decoder.stop()
|
||||
print(f" ✓ Decoder {i+1} stopped")
|
||||
|
||||
cleanup_memory = get_gpu_memory_usage(args.gpu_id)
|
||||
print(f"\nVRAM after cleanup: {cleanup_memory} MB")
|
||||
|
||||
print("\n✓ Multi-stream test completed successfully")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
152
test_stream.py
Executable file
152
test_stream.py
Executable file
|
|
@ -0,0 +1,152 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
CLI test script for StreamDecoder
|
||||
Tests RTSP stream decoding with NVDEC hardware acceleration
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import time
|
||||
import sys
|
||||
from services.stream_decoder import StreamDecoderFactory, ConnectionStatus
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Test RTSP stream decoder with NVDEC')
|
||||
parser.add_argument(
|
||||
'--rtsp-url',
|
||||
type=str,
|
||||
required=True,
|
||||
help='RTSP stream URL (e.g., rtsp://user:pass@host/path)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--gpu-id',
|
||||
type=int,
|
||||
default=0,
|
||||
help='GPU device ID'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--buffer-size',
|
||||
type=int,
|
||||
default=30,
|
||||
help='Frame buffer size'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--duration',
|
||||
type=int,
|
||||
default=30,
|
||||
help='Test duration in seconds'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--check-interval',
|
||||
type=float,
|
||||
default=1.0,
|
||||
help='Status check interval in seconds'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("=" * 80)
|
||||
print("RTSP Stream Decoder Test")
|
||||
print("=" * 80)
|
||||
print(f"RTSP URL: {args.rtsp_url}")
|
||||
print(f"GPU ID: {args.gpu_id}")
|
||||
print(f"Buffer Size: {args.buffer_size} frames")
|
||||
print(f"Test Duration: {args.duration} seconds")
|
||||
print("=" * 80)
|
||||
print()
|
||||
|
||||
try:
|
||||
# Create factory with shared CUDA context
|
||||
print("[1/4] Initializing StreamDecoderFactory...")
|
||||
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||
print("✓ Factory initialized with shared CUDA context\n")
|
||||
|
||||
# Create decoder
|
||||
print("[2/4] Creating StreamDecoder...")
|
||||
decoder = factory.create_decoder(
|
||||
rtsp_url=args.rtsp_url,
|
||||
buffer_size=args.buffer_size,
|
||||
codec='h264'
|
||||
)
|
||||
print(f"✓ Decoder created: {decoder}\n")
|
||||
|
||||
# Start decoding
|
||||
print("[3/4] Starting decoder thread...")
|
||||
decoder.start()
|
||||
print("✓ Decoder thread started\n")
|
||||
|
||||
# Monitor for specified duration
|
||||
print(f"[4/4] Monitoring stream for {args.duration} seconds...")
|
||||
print("-" * 80)
|
||||
|
||||
start_time = time.time()
|
||||
last_frame_count = 0
|
||||
|
||||
while time.time() - start_time < args.duration:
|
||||
time.sleep(args.check_interval)
|
||||
|
||||
# Get status
|
||||
status = decoder.get_status()
|
||||
buffer_size = decoder.get_buffer_size()
|
||||
frame_count = decoder.frame_count
|
||||
fps = (frame_count - last_frame_count) / args.check_interval
|
||||
last_frame_count = frame_count
|
||||
|
||||
# Print status
|
||||
elapsed = time.time() - start_time
|
||||
print(f"[{elapsed:6.1f}s] Status: {status.value:12s} | "
|
||||
f"Buffer: {buffer_size:2d}/{args.buffer_size:2d} | "
|
||||
f"Frames: {frame_count:5d} | "
|
||||
f"FPS: {fps:5.1f}")
|
||||
|
||||
# Try to get latest frame
|
||||
if status == ConnectionStatus.CONNECTED:
|
||||
frame = decoder.get_latest_frame()
|
||||
if frame is not None:
|
||||
print(f" Frame shape: {frame.shape}, dtype: {frame.dtype}, "
|
||||
f"device: {frame.device}")
|
||||
|
||||
# Check for errors
|
||||
if status == ConnectionStatus.ERROR:
|
||||
print("\n✗ ERROR: Stream connection failed!")
|
||||
break
|
||||
|
||||
print("-" * 80)
|
||||
|
||||
# Final statistics
|
||||
print("\n" + "=" * 80)
|
||||
print("Test Complete - Final Statistics")
|
||||
print("=" * 80)
|
||||
print(f"Total Frames Decoded: {decoder.frame_count}")
|
||||
print(f"Average FPS: {decoder.frame_count / args.duration:.2f}")
|
||||
print(f"Final Status: {decoder.get_status().value}")
|
||||
print(f"Buffer Utilization: {decoder.get_buffer_size()}/{args.buffer_size}")
|
||||
|
||||
if decoder.frame_width and decoder.frame_height:
|
||||
print(f"Frame Resolution: {decoder.frame_width}x{decoder.frame_height}")
|
||||
|
||||
print("=" * 80)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n✗ Interrupted by user")
|
||||
sys.exit(1)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n\n✗ Error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
finally:
|
||||
# Cleanup
|
||||
if 'decoder' in locals():
|
||||
print("\nCleaning up...")
|
||||
decoder.stop()
|
||||
print("✓ Decoder stopped")
|
||||
|
||||
print("\n✓ Test completed successfully")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
143
test_vram_process.py
Normal file
143
test_vram_process.py
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
VRAM scaling test - measures Python process memory usage for 1, 2, 3, and 4 streams.
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import subprocess
|
||||
from dotenv import load_dotenv
|
||||
from services import StreamDecoderFactory
|
||||
|
||||
# Load environment variables from .env file
|
||||
load_dotenv()
|
||||
|
||||
# Load camera URLs from environment
|
||||
camera_urls = []
|
||||
i = 1
|
||||
while True:
|
||||
url = os.getenv(f'CAMERA_URL_{i}')
|
||||
if url:
|
||||
camera_urls.append(url)
|
||||
i += 1
|
||||
else:
|
||||
break
|
||||
|
||||
if not camera_urls:
|
||||
print("Error: No camera URLs found in .env file")
|
||||
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
|
||||
exit(1)
|
||||
|
||||
def get_python_gpu_memory():
|
||||
"""Get Python process GPU memory usage in MB"""
|
||||
try:
|
||||
pid = os.getpid()
|
||||
result = subprocess.run(
|
||||
['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
for line in result.stdout.strip().split('\n'):
|
||||
if line:
|
||||
parts = line.split(',')
|
||||
if len(parts) >= 2 and int(parts[0].strip()) == pid:
|
||||
return int(parts[1].strip())
|
||||
return 0
|
||||
except:
|
||||
return 0
|
||||
|
||||
def test_n_streams(n, wait_time=15):
|
||||
"""Test with n streams"""
|
||||
print(f"\n{'='*80}")
|
||||
print(f"Testing with {n} stream(s)")
|
||||
print('='*80)
|
||||
|
||||
mem_before = get_python_gpu_memory()
|
||||
print(f"Python process VRAM before: {mem_before} MB")
|
||||
|
||||
# Create factory
|
||||
factory = StreamDecoderFactory(gpu_id=0)
|
||||
time.sleep(1)
|
||||
mem_after_factory = get_python_gpu_memory()
|
||||
print(f"After factory: {mem_after_factory} MB (+{mem_after_factory - mem_before} MB)")
|
||||
|
||||
# Create decoders
|
||||
decoders = []
|
||||
for i in range(n):
|
||||
decoder = factory.create_decoder(camera_urls[i], buffer_size=30)
|
||||
decoders.append(decoder)
|
||||
|
||||
time.sleep(1)
|
||||
mem_after_create = get_python_gpu_memory()
|
||||
print(f"After creating {n} decoder(s): {mem_after_create} MB (+{mem_after_create - mem_after_factory} MB)")
|
||||
|
||||
# Start decoders
|
||||
for decoder in decoders:
|
||||
decoder.start()
|
||||
|
||||
time.sleep(2)
|
||||
mem_after_start = get_python_gpu_memory()
|
||||
print(f"After starting {n} decoder(s): {mem_after_start} MB (+{mem_after_start - mem_after_create} MB)")
|
||||
|
||||
# Wait for connection
|
||||
print(f"Waiting {wait_time}s for streams to connect and stabilize...")
|
||||
time.sleep(wait_time)
|
||||
|
||||
# Check connection status
|
||||
connected = sum(1 for d in decoders if d.is_connected())
|
||||
mem_stable = get_python_gpu_memory()
|
||||
|
||||
print(f"Connected: {connected}/{n} streams")
|
||||
print(f"Python process VRAM (stable): {mem_stable} MB")
|
||||
|
||||
# Get frame stats
|
||||
for i, decoder in enumerate(decoders):
|
||||
print(f" Stream {i+1}: {decoder.get_status().value:10s} "
|
||||
f"Buffer: {decoder.get_buffer_size()}/30 "
|
||||
f"Frames: {decoder.frame_count}")
|
||||
|
||||
# Cleanup
|
||||
for decoder in decoders:
|
||||
decoder.stop()
|
||||
|
||||
time.sleep(2)
|
||||
mem_after_cleanup = get_python_gpu_memory()
|
||||
print(f"After cleanup: {mem_after_cleanup} MB")
|
||||
|
||||
return mem_stable
|
||||
|
||||
if __name__ == '__main__':
|
||||
print("Python VRAM Scaling Test")
|
||||
print(f"PID: {os.getpid()}")
|
||||
|
||||
baseline = get_python_gpu_memory()
|
||||
print(f"Baseline Python process VRAM: {baseline} MB\n")
|
||||
|
||||
results = {}
|
||||
for n in [1, 2, 3, 4]:
|
||||
mem = test_n_streams(n, wait_time=15)
|
||||
results[n] = mem
|
||||
print(f"\n→ {n} stream(s): {mem} MB (process total)")
|
||||
|
||||
# Give time between tests
|
||||
if n < 4:
|
||||
print("\nWaiting 5s before next test...")
|
||||
time.sleep(5)
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*80)
|
||||
print("Python Process VRAM Scaling Summary")
|
||||
print("="*80)
|
||||
print(f"Baseline: {baseline:4d} MB")
|
||||
for n in [1, 2, 3, 4]:
|
||||
total = results[n]
|
||||
overhead = total - baseline
|
||||
per_stream = overhead / n if n > 0 else 0
|
||||
print(f"{n} stream(s): {total:4d} MB (+{overhead:3d} MB total, {per_stream:5.1f} MB per stream)")
|
||||
|
||||
# Calculate marginal cost
|
||||
print("\nMarginal cost per additional stream:")
|
||||
for n in [2, 3, 4]:
|
||||
marginal = results[n] - results[n-1]
|
||||
print(f" Stream {n}: +{marginal} MB")
|
||||
|
||||
print("="*80)
|
||||
85
verify_tensorrt_model.py
Normal file
85
verify_tensorrt_model.py
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quick verification script for TensorRT model
|
||||
"""
|
||||
|
||||
import torch
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
def verify_model():
|
||||
print("=" * 80)
|
||||
print("TensorRT Model Verification")
|
||||
print("=" * 80)
|
||||
|
||||
# Initialize repository
|
||||
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=2)
|
||||
|
||||
# Load the model
|
||||
print("\nLoading YOLOv8n TensorRT engine...")
|
||||
try:
|
||||
metadata = repo.load_model(
|
||||
model_id="yolov8n_test",
|
||||
file_path="models/yolov8n.trt",
|
||||
num_contexts=2
|
||||
)
|
||||
print("✓ Model loaded successfully!")
|
||||
except Exception as e:
|
||||
print(f"✗ Failed to load model: {e}")
|
||||
return
|
||||
|
||||
# Get model info
|
||||
print("\n" + "=" * 80)
|
||||
print("Model Information")
|
||||
print("=" * 80)
|
||||
info = repo.get_model_info("yolov8n_test")
|
||||
if info:
|
||||
print(f"Model ID: {info['model_id']}")
|
||||
print(f"File: {info['file_path']}")
|
||||
print(f"File hash: {info['file_hash']}")
|
||||
print(f"\nInputs:")
|
||||
for name, spec in info['inputs'].items():
|
||||
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||
print(f"\nOutputs:")
|
||||
for name, spec in info['outputs'].items():
|
||||
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||
|
||||
# Run test inference
|
||||
print("\n" + "=" * 80)
|
||||
print("Running Test Inference")
|
||||
print("=" * 80)
|
||||
|
||||
try:
|
||||
# Create dummy input (simulating a 640x640 image)
|
||||
input_tensor = torch.rand(1, 3, 640, 640, dtype=torch.float32, device='cuda:0')
|
||||
print(f"Input tensor: {input_tensor.shape} on {input_tensor.device}")
|
||||
|
||||
# Run inference
|
||||
outputs = repo.infer(
|
||||
model_id="yolov8n_test",
|
||||
inputs={"images": input_tensor},
|
||||
synchronize=True
|
||||
)
|
||||
|
||||
print("\n✓ Inference successful!")
|
||||
print("\nOutputs:")
|
||||
for name, tensor in outputs.items():
|
||||
print(f" {name}: {tensor.shape} on {tensor.device} ({tensor.dtype})")
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Inference failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
# Cleanup
|
||||
print("\n" + "=" * 80)
|
||||
print("Cleanup")
|
||||
print("=" * 80)
|
||||
repo.unload_model("yolov8n_test")
|
||||
print("✓ Model unloaded")
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("Verification Complete!")
|
||||
print("=" * 80)
|
||||
|
||||
if __name__ == "__main__":
|
||||
verify_model()
|
||||
Loading…
Add table
Add a link
Reference in a new issue