feat: inference subsystem and optimization to decoder
This commit is contained in:
commit
3c83a57e44
19 changed files with 3897 additions and 0 deletions
11
.env.example
Normal file
11
.env.example
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
# RTSP Camera URLs
|
||||||
|
# Add your camera URLs here, one per line with CAMERA_URL_N format
|
||||||
|
|
||||||
|
CAMERA_URL_1=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_2=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_3=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_4=rtsp://user:pass@host/path
|
||||||
|
|
||||||
|
# Add more cameras as needed...
|
||||||
|
# CAMERA_URL_5=rtsp://user:pass@host/path
|
||||||
|
# CAMERA_URL_6=rtsp://user:pass@host/path
|
||||||
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,6 @@
|
||||||
|
fastapi
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
.env
|
||||||
|
.claude
|
||||||
|
models/
|
||||||
13
app.py
Normal file
13
app.py
Normal file
|
|
@ -0,0 +1,13 @@
|
||||||
|
from fastapi import FastAPI
|
||||||
|
|
||||||
|
app = FastAPI()
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root():
|
||||||
|
return {"message": "Hello World"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health_check():
|
||||||
|
return {"status": "healthy"}
|
||||||
373
claude.md
Normal file
373
claude.md
Normal file
|
|
@ -0,0 +1,373 @@
|
||||||
|
# GPU-Accelerated RTSP Stream Processing System
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
|
||||||
|
|
||||||
|
## Key Achievements
|
||||||
|
|
||||||
|
- **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
|
||||||
|
- **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
|
||||||
|
- **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
|
||||||
|
- **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Pipeline Flow
|
||||||
|
```
|
||||||
|
RTSP Stream → PyAV (CPU)
|
||||||
|
↓
|
||||||
|
NVDEC Decode (GPU) → NV12 Format
|
||||||
|
↓
|
||||||
|
NV12 to RGB (GPU) → PyTorch Ops
|
||||||
|
↓
|
||||||
|
nvJPEG Encode (GPU) → JPEG Bytes
|
||||||
|
↓
|
||||||
|
CPU (JPEG only)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
#### StreamDecoderFactory
|
||||||
|
Singleton factory managing shared CUDA context across all decoder instances.
|
||||||
|
|
||||||
|
**Key Methods:**
|
||||||
|
- `get_factory(gpu_id)`: Returns singleton instance
|
||||||
|
- `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
|
||||||
|
|
||||||
|
**CUDA Context Initialization:**
|
||||||
|
```python
|
||||||
|
err, = cuda_driver.cuInit(0)
|
||||||
|
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### StreamDecoder
|
||||||
|
Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
|
||||||
|
|
||||||
|
**Key Features:**
|
||||||
|
- Thread-safe frame buffer (deque)
|
||||||
|
- Connection status tracking
|
||||||
|
- Automatic reconnection handling
|
||||||
|
- Background thread for continuous decoding
|
||||||
|
|
||||||
|
**Key Methods:**
|
||||||
|
- `start()`: Start decoding thread
|
||||||
|
- `stop()`: Stop and cleanup
|
||||||
|
- `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
|
||||||
|
- `is_connected()`: Check connection status
|
||||||
|
- `get_buffer_size()`: Current buffer size
|
||||||
|
|
||||||
|
#### JPEGEncoderFactory
|
||||||
|
Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
|
||||||
|
|
||||||
|
**Key Function:**
|
||||||
|
```python
|
||||||
|
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
|
||||||
|
Uses __cuda_array_interface__ for zero-copy operation.
|
||||||
|
|
||||||
|
Performance: 1-2ms per 720p frame
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
## Technical Implementation
|
||||||
|
|
||||||
|
### Shared CUDA Context Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Single shared context for all decoders
|
||||||
|
factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
|
||||||
|
# All decoders share same context
|
||||||
|
decoder1 = factory.create_decoder(url1, buffer_size=30)
|
||||||
|
decoder2 = factory.create_decoder(url2, buffer_size=30)
|
||||||
|
decoder3 = factory.create_decoder(url3, buffer_size=30)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- 70% VRAM reduction per stream
|
||||||
|
- Single decoder initialization overhead
|
||||||
|
- Efficient resource sharing
|
||||||
|
|
||||||
|
### NV12 to RGB Conversion (GPU)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
||||||
|
"""
|
||||||
|
Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
|
||||||
|
Uses BT.601 color space conversion.
|
||||||
|
|
||||||
|
Input: (height * 1.5, width) NV12 tensor
|
||||||
|
Output: (3, height, width) RGB tensor
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
1. Split Y and UV planes
|
||||||
|
2. Deinterleave UV components
|
||||||
|
3. Upsample chroma (bilinear interpolation)
|
||||||
|
4. Apply BT.601 color matrix
|
||||||
|
5. Clamp to [0, 255]
|
||||||
|
|
||||||
|
### Zero-Copy Operations
|
||||||
|
|
||||||
|
**DLPack for PyTorch ↔ nvImageCodec:**
|
||||||
|
```python
|
||||||
|
# GPU tensor stays on GPU
|
||||||
|
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
||||||
|
nv_image = nvimgcodec.as_image(rgb_hwc) # Uses __cuda_array_interface__
|
||||||
|
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Metrics
|
||||||
|
|
||||||
|
### VRAM Usage (Python Process)
|
||||||
|
|
||||||
|
| Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
|
||||||
|
|---------|-----------|----------|------------|---------------|
|
||||||
|
| 0 | 216 MB | 0 MB | - | - |
|
||||||
|
| 1 | 278 MB | 62 MB | 62.0 MB | 62 MB |
|
||||||
|
| 2 | 338 MB | 122 MB | 61.0 MB | 60 MB |
|
||||||
|
| 3 | 398 MB | 182 MB | 60.7 MB | 60 MB |
|
||||||
|
| 4 | 458 MB | 242 MB | 60.5 MB | 60 MB |
|
||||||
|
|
||||||
|
**Result:** Perfect linear scaling at ~60 MB per stream
|
||||||
|
|
||||||
|
### Capacity Estimates
|
||||||
|
|
||||||
|
With 60 MB per stream + 216 MB baseline:
|
||||||
|
|
||||||
|
- **16GB GPU**: ~269 cameras (conservative: ~250)
|
||||||
|
- **24GB GPU**: ~407 cameras (conservative: ~380)
|
||||||
|
- **48GB GPU**: ~815 cameras (conservative: ~780)
|
||||||
|
- **For 1000 streams**: ~60GB VRAM required
|
||||||
|
|
||||||
|
### Throughput
|
||||||
|
|
||||||
|
- **Frame Rate**: 7-7.5 FPS per stream @ 720p
|
||||||
|
- **JPEG Encoding**: 1-2ms per frame
|
||||||
|
- **Connection Time**: ~15s for stream stabilization
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
python-rtsp-worker/
|
||||||
|
├── app.py # FastAPI application
|
||||||
|
├── services/
|
||||||
|
│ ├── __init__.py # Package exports
|
||||||
|
│ ├── stream_decoder.py # StreamDecoder & Factory
|
||||||
|
│ └── jpeg_encoder.py # JPEG encoding utilities
|
||||||
|
├── test_stream.py # Single stream test
|
||||||
|
├── test_multi_stream.py # 4-stream test with monitoring
|
||||||
|
├── test_vram_scaling.py # System VRAM measurement
|
||||||
|
├── test_vram_process.py # Process VRAM measurement
|
||||||
|
├── test_jpeg_encode.py # JPEG encoding test
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── .env # Camera URLs (gitignored)
|
||||||
|
├── .env.example # Template for camera URLs
|
||||||
|
└── .gitignore
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
```
|
||||||
|
fastapi # Web framework
|
||||||
|
uvicorn[standard] # ASGI server
|
||||||
|
torch # GPU tensor operations
|
||||||
|
PyNvVideoCodec # NVDEC hardware decoding
|
||||||
|
av # FFmpeg/RTSP client
|
||||||
|
cuda-python # CUDA driver bindings
|
||||||
|
nvidia-nvimgcodec-cu12 # nvJPEG encoding
|
||||||
|
python-dotenv # Environment variables
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Environment Variables (.env)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# RTSP Camera URLs
|
||||||
|
CAMERA_URL_1=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_2=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_3=rtsp://user:pass@host/path
|
||||||
|
CAMERA_URL_4=rtsp://user:pass@host/path
|
||||||
|
# Add more as needed...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Loading URLs in Code
|
||||||
|
|
||||||
|
```python
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import os
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
camera_urls = []
|
||||||
|
i = 1
|
||||||
|
while True:
|
||||||
|
url = os.getenv(f'CAMERA_URL_{i}')
|
||||||
|
if url:
|
||||||
|
camera_urls.append(url)
|
||||||
|
i += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from services import StreamDecoderFactory, encode_frame_to_jpeg
|
||||||
|
|
||||||
|
# Create factory (shared CUDA context)
|
||||||
|
factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
|
||||||
|
# Create decoder
|
||||||
|
decoder = factory.create_decoder(
|
||||||
|
rtsp_url="rtsp://user:pass@host/path",
|
||||||
|
buffer_size=30
|
||||||
|
)
|
||||||
|
|
||||||
|
# Start decoding
|
||||||
|
decoder.start()
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
import time
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
# Get latest frame (GPU tensor)
|
||||||
|
rgb_frame = decoder.get_latest_frame()
|
||||||
|
if rgb_frame is not None:
|
||||||
|
# Encode to JPEG (on GPU)
|
||||||
|
jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
|
||||||
|
|
||||||
|
# Save or transmit jpeg_bytes
|
||||||
|
with open("frame.jpg", "wb") as f:
|
||||||
|
f.write(jpeg_bytes)
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
decoder.stop()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Stream Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from services import StreamDecoderFactory
|
||||||
|
import time
|
||||||
|
|
||||||
|
factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
|
||||||
|
# Create multiple decoders (all share context)
|
||||||
|
decoders = []
|
||||||
|
for url in camera_urls:
|
||||||
|
decoder = factory.create_decoder(url, buffer_size=30)
|
||||||
|
decoder.start()
|
||||||
|
decoders.append(decoder)
|
||||||
|
|
||||||
|
# Wait for connections
|
||||||
|
time.sleep(15)
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
status = decoder.get_status()
|
||||||
|
buffer_size = decoder.get_buffer_size()
|
||||||
|
connected = decoder.is_connected()
|
||||||
|
print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
|
||||||
|
|
||||||
|
# Process frames
|
||||||
|
for decoder in decoders:
|
||||||
|
frame = decoder.get_latest_frame()
|
||||||
|
if frame is not None:
|
||||||
|
# Process frame...
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
for decoder in decoders:
|
||||||
|
decoder.stop()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Run Single Stream Test
|
||||||
|
```bash
|
||||||
|
python test_stream.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run 4-Stream Test with VRAM Monitoring
|
||||||
|
```bash
|
||||||
|
python test_multi_stream.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Measure VRAM Scaling
|
||||||
|
```bash
|
||||||
|
python test_vram_process.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test JPEG Encoding
|
||||||
|
```bash
|
||||||
|
python test_jpeg_encode.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
|
||||||
|
### Segmentation Faults on Cleanup
|
||||||
|
**Status**: Non-critical
|
||||||
|
**Impact**: Occurs during cleanup, doesn't affect core functionality
|
||||||
|
**Cause**: Likely CUDA context cleanup order issues
|
||||||
|
**Workaround**: Functionality works correctly; cleanup errors can be ignored
|
||||||
|
|
||||||
|
## Technical Decisions
|
||||||
|
|
||||||
|
### Why PyNvVideoCodec?
|
||||||
|
- Direct access to NVDEC hardware decoder
|
||||||
|
- Minimal overhead compared to FFmpeg/torchaudio
|
||||||
|
- Returns GPU tensors via DLPack
|
||||||
|
- Better control over decode sessions
|
||||||
|
|
||||||
|
### Why Shared CUDA Context?
|
||||||
|
- Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
|
||||||
|
- Enables 1000-stream target on 60GB GPU
|
||||||
|
- Minimal complexity overhead with singleton pattern
|
||||||
|
|
||||||
|
### Why nvImageCodec?
|
||||||
|
- GPU-native JPEG encoding (nvJPEG)
|
||||||
|
- Zero-copy with PyTorch via `__cuda_array_interface__`
|
||||||
|
- 1-2ms encoding time per 720p frame
|
||||||
|
- Keeps data on GPU until final compression
|
||||||
|
|
||||||
|
### Why Thread-Safe Ring Buffer?
|
||||||
|
- Decouples decoding from inference pipeline
|
||||||
|
- Prevents frame drops during processing spikes
|
||||||
|
- Allows async frame access
|
||||||
|
- Configurable buffer size per stream
|
||||||
|
|
||||||
|
## Future Considerations
|
||||||
|
|
||||||
|
### Hardware Decode Session Limits
|
||||||
|
- NVIDIA GPUs typically support 5-30 concurrent decode sessions
|
||||||
|
- May need multiple GPUs for 1000 streams
|
||||||
|
- Test with actual hardware to verify limits
|
||||||
|
|
||||||
|
### Scaling Beyond 1000 Streams
|
||||||
|
- Multi-GPU support with context per GPU
|
||||||
|
- Load balancing across GPUs
|
||||||
|
- Network bandwidth considerations
|
||||||
|
|
||||||
|
### TensorRT Integration
|
||||||
|
- Next step: Integrate with TensorRT inference pipeline
|
||||||
|
- GPU frames → TensorRT → Results
|
||||||
|
- Keep entire pipeline on GPU
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
|
||||||
|
- [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
|
||||||
|
- [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
|
||||||
|
- [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.
|
||||||
11
requirements.dev.txt
Normal file
11
requirements.dev.txt
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
# Development Dependencies
|
||||||
|
# Install with: pip install -r requirements.dev.txt
|
||||||
|
|
||||||
|
# Model conversion tools
|
||||||
|
tensorrt
|
||||||
|
onnx
|
||||||
|
ultralytics # For YOLO models download and export
|
||||||
|
|
||||||
|
# Optional: Additional tools for model optimization
|
||||||
|
onnxruntime-gpu # ONNX runtime for testing
|
||||||
|
onnx-simplifier # Simplify ONNX models
|
||||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
|
|
@ -0,0 +1,8 @@
|
||||||
|
fastapi
|
||||||
|
uvicorn[standard]
|
||||||
|
torch
|
||||||
|
PyNvVideoCodec
|
||||||
|
av
|
||||||
|
cuda-python
|
||||||
|
nvidia-nvimgcodec-cu12 # GPU-accelerated JPEG encoding/decoding with nvJPEG
|
||||||
|
python-dotenv # Load environment variables from .env file
|
||||||
197
scripts/README.md
Normal file
197
scripts/README.md
Normal file
|
|
@ -0,0 +1,197 @@
|
||||||
|
# Scripts Directory
|
||||||
|
|
||||||
|
This directory contains utility scripts for the python-rtsp-worker project.
|
||||||
|
|
||||||
|
## convert_pt_to_tensorrt.py
|
||||||
|
|
||||||
|
Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
- **Multiple Precision Modes**: FP32, FP16, INT8
|
||||||
|
- **Dynamic Batch Size**: Support for variable batch sizes
|
||||||
|
- **Automatic Optimization**: Creates optimization profiles for best performance
|
||||||
|
- **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
|
||||||
|
- **Easy to Use**: Simple command-line interface
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
Make sure you have the following dependencies installed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install torch tensorrt onnx
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quick Start
|
||||||
|
|
||||||
|
**Basic conversion (FP32)**:
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model path/to/model.pt \
|
||||||
|
--output models/model.trt
|
||||||
|
```
|
||||||
|
|
||||||
|
**FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model yolov8n.pt \
|
||||||
|
--output models/yolov8n.trt \
|
||||||
|
--fp16
|
||||||
|
```
|
||||||
|
|
||||||
|
**Custom input shape**:
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model model.pt \
|
||||||
|
--output model.trt \
|
||||||
|
--input-shape 1,3,416,416
|
||||||
|
```
|
||||||
|
|
||||||
|
**Dynamic batch size** (for variable batch inference):
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model model.pt \
|
||||||
|
--output model.trt \
|
||||||
|
--dynamic-batch \
|
||||||
|
--max-batch 16
|
||||||
|
```
|
||||||
|
|
||||||
|
**Maximum optimization** (FP16 + INT8):
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model model.pt \
|
||||||
|
--output model.trt \
|
||||||
|
--fp16 \
|
||||||
|
--int8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Command-Line Arguments
|
||||||
|
|
||||||
|
| Argument | Required | Default | Description |
|
||||||
|
|----------|----------|---------|-------------|
|
||||||
|
| `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
|
||||||
|
| `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
|
||||||
|
| `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
|
||||||
|
| `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
|
||||||
|
| `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
|
||||||
|
| `--dynamic-batch` | No | False | Enable dynamic batch size support |
|
||||||
|
| `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
|
||||||
|
| `--workspace-size` | No | 4 | TensorRT workspace size in GB |
|
||||||
|
| `--gpu` | No | 0 | GPU device ID to use |
|
||||||
|
| `--input-names` | No | ["input"] | Custom input tensor names |
|
||||||
|
| `--output-names` | No | ["output"] | Custom output tensor names |
|
||||||
|
| `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
|
||||||
|
| `--verbose`, `-v` | No | False | Enable verbose logging |
|
||||||
|
|
||||||
|
### Performance Tips
|
||||||
|
|
||||||
|
1. **Always use FP16** unless you need FP32 precision:
|
||||||
|
- 2x faster inference
|
||||||
|
- 50% less VRAM usage
|
||||||
|
- Minimal accuracy loss for most models
|
||||||
|
|
||||||
|
2. **Use dynamic batching** for variable workloads:
|
||||||
|
- Process 1-16 images with same engine
|
||||||
|
- Automatic optimization for common batch sizes
|
||||||
|
|
||||||
|
3. **Increase workspace size** for complex models:
|
||||||
|
- Default 4GB works for most models
|
||||||
|
- Increase to 8GB for very large models
|
||||||
|
|
||||||
|
4. **INT8 quantization** for maximum speed:
|
||||||
|
- Requires calibration data (not included in basic conversion)
|
||||||
|
- 4x faster than FP32
|
||||||
|
- Best for deployment scenarios
|
||||||
|
|
||||||
|
### Integration with Model Repository
|
||||||
|
|
||||||
|
Once converted, use the TensorRT engine with the model repository:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from services.model_repository import TensorRTModelRepository
|
||||||
|
|
||||||
|
# Initialize repository
|
||||||
|
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||||
|
|
||||||
|
# Load the converted model
|
||||||
|
repo.load_model(
|
||||||
|
model_id="my_model",
|
||||||
|
file_path="models/model.trt",
|
||||||
|
num_contexts=4
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run inference
|
||||||
|
import torch
|
||||||
|
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id="my_model",
|
||||||
|
inputs={"input": input_tensor}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Troubleshooting
|
||||||
|
|
||||||
|
**Issue**: `Failed to parse ONNX model`
|
||||||
|
- Solution: Check if your PyTorch model is compatible with ONNX export
|
||||||
|
- Try updating PyTorch and ONNX versions
|
||||||
|
|
||||||
|
**Issue**: `FP16 not supported on this platform`
|
||||||
|
- Solution: Your GPU doesn't support FP16. Remove `--fp16` flag
|
||||||
|
|
||||||
|
**Issue**: `Out of memory during conversion`
|
||||||
|
- Solution: Reduce `--workspace-size` or free up GPU memory
|
||||||
|
|
||||||
|
**Issue**: `Model contains only state_dict`
|
||||||
|
- Solution: Your checkpoint only has weights. You need the full model architecture.
|
||||||
|
- Modify the script's `load_pytorch_model()` method to instantiate your model class
|
||||||
|
|
||||||
|
### Examples for Common Models
|
||||||
|
|
||||||
|
**YOLOv8**:
|
||||||
|
```bash
|
||||||
|
# Download model first
|
||||||
|
# yolo export model=yolov8n.pt format=engine device=0
|
||||||
|
|
||||||
|
# Or use this script
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model yolov8n.pt \
|
||||||
|
--output models/yolov8n.trt \
|
||||||
|
--input-shape 1,3,640,640 \
|
||||||
|
--fp16
|
||||||
|
```
|
||||||
|
|
||||||
|
**ResNet**:
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model resnet50.pt \
|
||||||
|
--output models/resnet50.trt \
|
||||||
|
--input-shape 1,3,224,224 \
|
||||||
|
--fp16 \
|
||||||
|
--dynamic-batch \
|
||||||
|
--max-batch 32
|
||||||
|
```
|
||||||
|
|
||||||
|
**Custom Model**:
|
||||||
|
```bash
|
||||||
|
python scripts/convert_pt_to_tensorrt.py \
|
||||||
|
--model custom_model.pt \
|
||||||
|
--output models/custom.trt \
|
||||||
|
--input-shape 1,3,512,512 \
|
||||||
|
--input-names image \
|
||||||
|
--output-names predictions \
|
||||||
|
--fp16 \
|
||||||
|
--verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
### Notes
|
||||||
|
|
||||||
|
- The script uses ONNX as an intermediate format, which is the recommended approach
|
||||||
|
- TensorRT engines are hardware-specific; rebuild for different GPUs
|
||||||
|
- Conversion time varies (30 seconds to 5 minutes depending on model size)
|
||||||
|
- The first inference after loading is slower (warmup)
|
||||||
|
|
||||||
|
### Support
|
||||||
|
|
||||||
|
For issues or questions, please check:
|
||||||
|
- TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
|
||||||
|
- PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html
|
||||||
562
scripts/convert_pt_to_tensorrt.py
Executable file
562
scripts/convert_pt_to_tensorrt.py
Executable file
|
|
@ -0,0 +1,562 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
PyTorch to TensorRT Model Conversion Script
|
||||||
|
|
||||||
|
This script converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized inference.
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Automatic FP32/FP16/INT8 precision modes
|
||||||
|
- Dynamic batch size support
|
||||||
|
- Input shape validation
|
||||||
|
- Optimization profiles for dynamic shapes
|
||||||
|
- ONNX intermediate format
|
||||||
|
- GPU-accelerated conversion
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python convert_pt_to_tensorrt.py --model path/to/model.pt --output models/model.trt
|
||||||
|
python convert_pt_to_tensorrt.py --model yolov8n.pt --input-shape 1 3 640 640 --fp16
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --dynamic-batch --max-batch 16
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Tuple, List, Optional
|
||||||
|
import torch
|
||||||
|
import tensorrt as trt
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
class TensorRTConverter:
|
||||||
|
"""Converts PyTorch models to TensorRT engines"""
|
||||||
|
|
||||||
|
def __init__(self, gpu_id: int = 0, verbose: bool = True):
|
||||||
|
"""
|
||||||
|
Initialize the converter.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
gpu_id: GPU device ID to use for conversion
|
||||||
|
verbose: Enable verbose logging
|
||||||
|
"""
|
||||||
|
self.gpu_id = gpu_id
|
||||||
|
self.device = torch.device(f'cuda:{gpu_id}')
|
||||||
|
|
||||||
|
# TensorRT logger
|
||||||
|
log_level = trt.Logger.VERBOSE if verbose else trt.Logger.WARNING
|
||||||
|
self.logger = trt.Logger(log_level)
|
||||||
|
|
||||||
|
# Set CUDA device
|
||||||
|
torch.cuda.set_device(gpu_id)
|
||||||
|
|
||||||
|
print(f"Initialized TensorRT Converter on GPU {gpu_id}")
|
||||||
|
print(f"PyTorch version: {torch.__version__}")
|
||||||
|
print(f"TensorRT version: {trt.__version__}")
|
||||||
|
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
print(f"CUDA device: {torch.cuda.get_device_name(gpu_id)}")
|
||||||
|
|
||||||
|
def load_pytorch_model(self, model_path: str) -> torch.nn.Module:
|
||||||
|
"""
|
||||||
|
Load PyTorch model from file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_path: Path to .pt or .pth file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Loaded PyTorch model in eval mode
|
||||||
|
"""
|
||||||
|
print(f"\nLoading PyTorch model from {model_path}...")
|
||||||
|
|
||||||
|
if not Path(model_path).exists():
|
||||||
|
raise FileNotFoundError(f"Model file not found: {model_path}")
|
||||||
|
|
||||||
|
# Load model (weights_only=False for models with custom classes)
|
||||||
|
checkpoint = torch.load(model_path, map_location=self.device, weights_only=False)
|
||||||
|
|
||||||
|
# Handle different checkpoint formats
|
||||||
|
if isinstance(checkpoint, dict):
|
||||||
|
if 'model' in checkpoint:
|
||||||
|
model = checkpoint['model']
|
||||||
|
elif 'state_dict' in checkpoint:
|
||||||
|
# Need model architecture - this is a limitation
|
||||||
|
raise ValueError(
|
||||||
|
"Checkpoint contains only state_dict. "
|
||||||
|
"Please provide the complete model or modify this script to load your architecture."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError("Unknown checkpoint format")
|
||||||
|
else:
|
||||||
|
model = checkpoint
|
||||||
|
|
||||||
|
# Set to eval mode
|
||||||
|
model.eval()
|
||||||
|
model.to(self.device)
|
||||||
|
|
||||||
|
print(f"✓ Model loaded successfully")
|
||||||
|
return model
|
||||||
|
|
||||||
|
def export_to_onnx(self, model: torch.nn.Module, input_shape: Tuple[int, ...],
|
||||||
|
onnx_path: str, dynamic_batch: bool = False,
|
||||||
|
input_names: List[str] = None, output_names: List[str] = None) -> str:
|
||||||
|
"""
|
||||||
|
Export PyTorch model to ONNX format (intermediate step).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model: PyTorch model
|
||||||
|
input_shape: Input tensor shape (B, C, H, W)
|
||||||
|
onnx_path: Output path for ONNX file
|
||||||
|
dynamic_batch: Enable dynamic batch dimension
|
||||||
|
input_names: List of input tensor names
|
||||||
|
output_names: List of output tensor names
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to exported ONNX file
|
||||||
|
"""
|
||||||
|
print(f"\nExporting to ONNX format...")
|
||||||
|
print(f"Input shape: {input_shape}")
|
||||||
|
print(f"Dynamic batch: {dynamic_batch}")
|
||||||
|
|
||||||
|
# Default names
|
||||||
|
if input_names is None:
|
||||||
|
input_names = ['input']
|
||||||
|
if output_names is None:
|
||||||
|
output_names = ['output']
|
||||||
|
|
||||||
|
# Create dummy input
|
||||||
|
dummy_input = torch.randn(*input_shape, device=self.device)
|
||||||
|
|
||||||
|
# Dynamic axes configuration
|
||||||
|
dynamic_axes = None
|
||||||
|
if dynamic_batch:
|
||||||
|
dynamic_axes = {
|
||||||
|
input_names[0]: {0: 'batch'},
|
||||||
|
output_names[0]: {0: 'batch'}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Export to ONNX
|
||||||
|
torch.onnx.export(
|
||||||
|
model,
|
||||||
|
dummy_input,
|
||||||
|
onnx_path,
|
||||||
|
input_names=input_names,
|
||||||
|
output_names=output_names,
|
||||||
|
dynamic_axes=dynamic_axes,
|
||||||
|
opset_version=17, # Use recent ONNX opset
|
||||||
|
do_constant_folding=True,
|
||||||
|
verbose=False
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"✓ ONNX model exported to {onnx_path}")
|
||||||
|
return onnx_path
|
||||||
|
|
||||||
|
def build_tensorrt_engine_from_onnx(self, onnx_path: str, engine_path: str,
|
||||||
|
fp16: bool = False, int8: bool = False,
|
||||||
|
max_workspace_size: int = 4,
|
||||||
|
min_batch: int = 1, opt_batch: int = 1, max_batch: int = 1) -> str:
|
||||||
|
"""
|
||||||
|
Build TensorRT engine from ONNX model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
onnx_path: Path to ONNX model
|
||||||
|
engine_path: Output path for TensorRT engine
|
||||||
|
fp16: Enable FP16 precision
|
||||||
|
int8: Enable INT8 precision (requires calibration)
|
||||||
|
max_workspace_size: Maximum workspace size in GB
|
||||||
|
min_batch: Minimum batch size for optimization
|
||||||
|
opt_batch: Optimal batch size for optimization
|
||||||
|
max_batch: Maximum batch size for optimization
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to built TensorRT engine
|
||||||
|
"""
|
||||||
|
print(f"\nBuilding TensorRT engine from ONNX...")
|
||||||
|
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||||
|
print(f"Workspace size: {max_workspace_size} GB")
|
||||||
|
|
||||||
|
# Create builder and network
|
||||||
|
builder = trt.Builder(self.logger)
|
||||||
|
network = builder.create_network(
|
||||||
|
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
|
||||||
|
)
|
||||||
|
parser = trt.OnnxParser(network, self.logger)
|
||||||
|
|
||||||
|
# Parse ONNX model
|
||||||
|
print(f"Loading ONNX file from {onnx_path}...")
|
||||||
|
with open(onnx_path, 'rb') as f:
|
||||||
|
if not parser.parse(f.read()):
|
||||||
|
print("ERROR: Failed to parse the ONNX file:")
|
||||||
|
for error in range(parser.num_errors):
|
||||||
|
print(f" {parser.get_error(error)}")
|
||||||
|
raise RuntimeError("Failed to parse ONNX model")
|
||||||
|
|
||||||
|
print(f"✓ ONNX model parsed successfully")
|
||||||
|
|
||||||
|
# Print network info
|
||||||
|
print(f"\nNetwork Information:")
|
||||||
|
print(f" Inputs: {network.num_inputs}")
|
||||||
|
for i in range(network.num_inputs):
|
||||||
|
inp = network.get_input(i)
|
||||||
|
print(f" [{i}] {inp.name}: {inp.shape} ({inp.dtype})")
|
||||||
|
|
||||||
|
print(f" Outputs: {network.num_outputs}")
|
||||||
|
for i in range(network.num_outputs):
|
||||||
|
out = network.get_output(i)
|
||||||
|
print(f" [{i}] {out.name}: {out.shape} ({out.dtype})")
|
||||||
|
|
||||||
|
# Create builder config
|
||||||
|
config = builder.create_builder_config()
|
||||||
|
|
||||||
|
# Set workspace size
|
||||||
|
config.set_memory_pool_limit(
|
||||||
|
trt.MemoryPoolType.WORKSPACE,
|
||||||
|
max_workspace_size * (1 << 30) # GB to bytes
|
||||||
|
)
|
||||||
|
|
||||||
|
# Enable precision modes
|
||||||
|
if fp16:
|
||||||
|
if not builder.platform_has_fast_fp16:
|
||||||
|
print("Warning: FP16 not supported on this platform, using FP32")
|
||||||
|
else:
|
||||||
|
config.set_flag(trt.BuilderFlag.FP16)
|
||||||
|
print("✓ FP16 mode enabled")
|
||||||
|
|
||||||
|
if int8:
|
||||||
|
if not builder.platform_has_fast_int8:
|
||||||
|
print("Warning: INT8 not supported on this platform, using FP32/FP16")
|
||||||
|
else:
|
||||||
|
config.set_flag(trt.BuilderFlag.INT8)
|
||||||
|
print("✓ INT8 mode enabled")
|
||||||
|
print("Note: INT8 calibration not implemented. Results may be suboptimal.")
|
||||||
|
|
||||||
|
# Set optimization profile for dynamic shapes
|
||||||
|
if max_batch > 1 or min_batch != max_batch:
|
||||||
|
profile = builder.create_optimization_profile()
|
||||||
|
|
||||||
|
for i in range(network.num_inputs):
|
||||||
|
inp = network.get_input(i)
|
||||||
|
shape = list(inp.shape)
|
||||||
|
|
||||||
|
# Handle dynamic batch dimension
|
||||||
|
if shape[0] == -1:
|
||||||
|
# Min, opt, max shapes
|
||||||
|
min_shape = [min_batch] + shape[1:]
|
||||||
|
opt_shape = [opt_batch] + shape[1:]
|
||||||
|
max_shape = [max_batch] + shape[1:]
|
||||||
|
|
||||||
|
profile.set_shape(inp.name, min_shape, opt_shape, max_shape)
|
||||||
|
print(f" Dynamic shape for {inp.name}:")
|
||||||
|
print(f" Min: {min_shape}")
|
||||||
|
print(f" Opt: {opt_shape}")
|
||||||
|
print(f" Max: {max_shape}")
|
||||||
|
|
||||||
|
config.add_optimization_profile(profile)
|
||||||
|
|
||||||
|
# Build engine
|
||||||
|
print(f"\nBuilding TensorRT engine (this may take a few minutes)...")
|
||||||
|
serialized_engine = builder.build_serialized_network(network, config)
|
||||||
|
|
||||||
|
if serialized_engine is None:
|
||||||
|
raise RuntimeError("Failed to build TensorRT engine")
|
||||||
|
|
||||||
|
# Save engine to file
|
||||||
|
print(f"Saving engine to {engine_path}...")
|
||||||
|
with open(engine_path, 'wb') as f:
|
||||||
|
f.write(serialized_engine)
|
||||||
|
|
||||||
|
# Get file size
|
||||||
|
file_size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
|
||||||
|
print(f"✓ TensorRT engine built successfully")
|
||||||
|
print(f" Engine size: {file_size_mb:.2f} MB")
|
||||||
|
|
||||||
|
return engine_path
|
||||||
|
|
||||||
|
def convert(self, model_path: str, output_path: str,
|
||||||
|
input_shape: Tuple[int, ...] = (1, 3, 640, 640),
|
||||||
|
fp16: bool = False, int8: bool = False,
|
||||||
|
dynamic_batch: bool = False,
|
||||||
|
max_batch: int = 16,
|
||||||
|
workspace_size: int = 4,
|
||||||
|
input_names: List[str] = None,
|
||||||
|
output_names: List[str] = None,
|
||||||
|
keep_onnx: bool = False) -> str:
|
||||||
|
"""
|
||||||
|
Convert PyTorch or ONNX model to TensorRT engine.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_path: Path to PyTorch model (.pt, .pth) or ONNX model (.onnx)
|
||||||
|
output_path: Path for output TensorRT engine (.trt)
|
||||||
|
input_shape: Input tensor shape (B, C, H, W) - required for PyTorch models
|
||||||
|
fp16: Enable FP16 precision
|
||||||
|
int8: Enable INT8 precision
|
||||||
|
dynamic_batch: Enable dynamic batch size
|
||||||
|
max_batch: Maximum batch size (for dynamic batching)
|
||||||
|
workspace_size: TensorRT workspace size in GB
|
||||||
|
input_names: Custom input names (for PyTorch export)
|
||||||
|
output_names: Custom output names (for PyTorch export)
|
||||||
|
keep_onnx: Keep intermediate ONNX file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to created TensorRT engine
|
||||||
|
"""
|
||||||
|
# Create output directory
|
||||||
|
output_dir = Path(output_path).parent
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Check if input is already ONNX
|
||||||
|
model_path_obj = Path(model_path)
|
||||||
|
is_onnx = model_path_obj.suffix.lower() == '.onnx'
|
||||||
|
|
||||||
|
if is_onnx:
|
||||||
|
# Direct ONNX to TensorRT conversion
|
||||||
|
print(f"Input is ONNX model, converting directly to TensorRT...")
|
||||||
|
|
||||||
|
min_batch = 1
|
||||||
|
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
|
||||||
|
max_batch_size = max_batch if dynamic_batch else input_shape[0]
|
||||||
|
|
||||||
|
engine_path = self.build_tensorrt_engine_from_onnx(
|
||||||
|
onnx_path=model_path,
|
||||||
|
engine_path=output_path,
|
||||||
|
fp16=fp16,
|
||||||
|
int8=int8,
|
||||||
|
max_workspace_size=workspace_size,
|
||||||
|
min_batch=min_batch,
|
||||||
|
opt_batch=opt_batch,
|
||||||
|
max_batch=max_batch_size
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\n{'=' * 80}")
|
||||||
|
print(f"CONVERSION COMPLETED SUCCESSFULLY")
|
||||||
|
print(f"{'=' * 80}")
|
||||||
|
print(f"Input: {model_path}")
|
||||||
|
print(f"Output: {engine_path}")
|
||||||
|
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||||
|
print(f"{'=' * 80}")
|
||||||
|
|
||||||
|
return engine_path
|
||||||
|
|
||||||
|
# PyTorch to TensorRT conversion (via ONNX)
|
||||||
|
# Temporary ONNX path
|
||||||
|
onnx_path = str(output_dir / "temp_model.onnx")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Step 1: Load PyTorch model
|
||||||
|
model = self.load_pytorch_model(model_path)
|
||||||
|
|
||||||
|
# Step 2: Export to ONNX
|
||||||
|
self.export_to_onnx(
|
||||||
|
model=model,
|
||||||
|
input_shape=input_shape,
|
||||||
|
onnx_path=onnx_path,
|
||||||
|
dynamic_batch=dynamic_batch,
|
||||||
|
input_names=input_names,
|
||||||
|
output_names=output_names
|
||||||
|
)
|
||||||
|
|
||||||
|
# Step 3: Build TensorRT engine
|
||||||
|
min_batch = 1
|
||||||
|
opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
|
||||||
|
max_batch_size = max_batch if dynamic_batch else input_shape[0]
|
||||||
|
|
||||||
|
engine_path = self.build_tensorrt_engine_from_onnx(
|
||||||
|
onnx_path=onnx_path,
|
||||||
|
engine_path=output_path,
|
||||||
|
fp16=fp16,
|
||||||
|
int8=int8,
|
||||||
|
max_workspace_size=workspace_size,
|
||||||
|
min_batch=min_batch,
|
||||||
|
opt_batch=opt_batch,
|
||||||
|
max_batch=max_batch_size
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\n{'=' * 80}")
|
||||||
|
print(f"CONVERSION COMPLETED SUCCESSFULLY")
|
||||||
|
print(f"{'=' * 80}")
|
||||||
|
print(f"Input: {model_path}")
|
||||||
|
print(f"Output: {engine_path}")
|
||||||
|
print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
|
||||||
|
print(f"Dynamic batch: {dynamic_batch}")
|
||||||
|
if dynamic_batch:
|
||||||
|
print(f"Batch range: [1, {max_batch}]")
|
||||||
|
print(f"{'=' * 80}")
|
||||||
|
|
||||||
|
return engine_path
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Cleanup temporary ONNX file
|
||||||
|
if not keep_onnx and Path(onnx_path).exists():
|
||||||
|
Path(onnx_path).unlink()
|
||||||
|
print(f"Cleaned up temporary ONNX file")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_shape(shape_str: str) -> Tuple[int, ...]:
|
||||||
|
"""Parse shape string like '1,3,640,640' to tuple"""
|
||||||
|
try:
|
||||||
|
return tuple(int(x) for x in shape_str.split(','))
|
||||||
|
except ValueError:
|
||||||
|
raise argparse.ArgumentTypeError(
|
||||||
|
f"Invalid shape format: {shape_str}. Expected format: 1,3,640,640"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Convert PyTorch models to TensorRT engines",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Basic conversion (FP32)
|
||||||
|
python convert_pt_to_tensorrt.py --model yolov8n.pt --output models/yolov8n.trt
|
||||||
|
|
||||||
|
# FP16 precision for faster inference
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --output model.trt --fp16
|
||||||
|
|
||||||
|
# Custom input shape
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||||
|
--input-shape 1,3,416,416
|
||||||
|
|
||||||
|
# Dynamic batch size (1 to 16)
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||||
|
--dynamic-batch --max-batch 16
|
||||||
|
|
||||||
|
# INT8 quantization for maximum speed (requires calibration)
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||||
|
--fp16 --int8
|
||||||
|
|
||||||
|
# Keep intermediate ONNX file for debugging
|
||||||
|
python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
|
||||||
|
--keep-onnx
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
# Required arguments
|
||||||
|
parser.add_argument(
|
||||||
|
'--model', '-m',
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help='Path to PyTorch model file (.pt or .pth)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--output', '-o',
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help='Output path for TensorRT engine (.trt or .engine)'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Optional arguments
|
||||||
|
parser.add_argument(
|
||||||
|
'--input-shape', '-s',
|
||||||
|
type=parse_shape,
|
||||||
|
default=(1, 3, 640, 640),
|
||||||
|
help='Input tensor shape as B,C,H,W (default: 1,3,640,640)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--fp16',
|
||||||
|
action='store_true',
|
||||||
|
help='Enable FP16 precision (faster inference, slightly lower accuracy)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--int8',
|
||||||
|
action='store_true',
|
||||||
|
help='Enable INT8 precision (fastest, requires calibration)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--dynamic-batch',
|
||||||
|
action='store_true',
|
||||||
|
help='Enable dynamic batch size support'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--max-batch',
|
||||||
|
type=int,
|
||||||
|
default=16,
|
||||||
|
help='Maximum batch size for dynamic batching (default: 16)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--workspace-size',
|
||||||
|
type=int,
|
||||||
|
default=4,
|
||||||
|
help='TensorRT workspace size in GB (default: 4)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--gpu',
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help='GPU device ID (default: 0)'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--input-names',
|
||||||
|
type=str,
|
||||||
|
nargs='+',
|
||||||
|
default=None,
|
||||||
|
help='Custom input tensor names (default: ["input"])'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--output-names',
|
||||||
|
type=str,
|
||||||
|
nargs='+',
|
||||||
|
default=None,
|
||||||
|
help='Custom output tensor names (default: ["output"])'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--keep-onnx',
|
||||||
|
action='store_true',
|
||||||
|
help='Keep intermediate ONNX file'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--verbose', '-v',
|
||||||
|
action='store_true',
|
||||||
|
help='Enable verbose logging'
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Validate arguments
|
||||||
|
if not Path(args.model).exists():
|
||||||
|
print(f"Error: Model file not found: {args.model}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if args.int8 and not args.fp16:
|
||||||
|
print("Warning: INT8 mode works best with FP16 enabled. Adding --fp16 flag.")
|
||||||
|
args.fp16 = True
|
||||||
|
|
||||||
|
# Run conversion
|
||||||
|
try:
|
||||||
|
converter = TensorRTConverter(gpu_id=args.gpu, verbose=args.verbose)
|
||||||
|
|
||||||
|
converter.convert(
|
||||||
|
model_path=args.model,
|
||||||
|
output_path=args.output,
|
||||||
|
input_shape=args.input_shape,
|
||||||
|
fp16=args.fp16,
|
||||||
|
int8=args.int8,
|
||||||
|
dynamic_batch=args.dynamic_batch,
|
||||||
|
max_batch=args.max_batch,
|
||||||
|
workspace_size=args.workspace_size,
|
||||||
|
input_names=args.input_names,
|
||||||
|
output_names=args.output_names,
|
||||||
|
keep_onnx=args.keep_onnx
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n✓ Conversion successful!")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n✗ Conversion failed: {e}")
|
||||||
|
if args.verbose:
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
380
services/README_MODEL_REPOSITORY.md
Normal file
380
services/README_MODEL_REPOSITORY.md
Normal file
|
|
@ -0,0 +1,380 @@
|
||||||
|
# TensorRT Model Repository
|
||||||
|
|
||||||
|
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Key Features
|
||||||
|
|
||||||
|
1. **Model Deduplication by File Hash**
|
||||||
|
- Multiple model IDs can point to the same model file
|
||||||
|
- Only one engine loaded in VRAM per unique file
|
||||||
|
- Example: 100 cameras with same model = 1 engine (not 100!)
|
||||||
|
|
||||||
|
2. **Context Pooling for Load Balancing**
|
||||||
|
- Each unique engine has N execution contexts (configurable)
|
||||||
|
- Contexts borrowed/returned via mutex-based queue
|
||||||
|
- Enables concurrent inference without context-per-model overhead
|
||||||
|
- Example: 100 cameras sharing 4 contexts efficiently
|
||||||
|
|
||||||
|
3. **GPU-to-GPU Inference**
|
||||||
|
- All inputs/outputs stay in VRAM (zero CPU transfers)
|
||||||
|
- Integrates seamlessly with StreamDecoder (frames already on GPU)
|
||||||
|
- Maximum performance for video inference pipelines
|
||||||
|
|
||||||
|
4. **Thread-Safe Concurrent Inference**
|
||||||
|
- Mutex-based context acquisition (TensorRT best practice)
|
||||||
|
- No shared IExecutionContext across threads (safe)
|
||||||
|
- Multiple threads can infer concurrently (limited by pool size)
|
||||||
|
|
||||||
|
## Design Rationale
|
||||||
|
|
||||||
|
### Why Context Pooling?
|
||||||
|
|
||||||
|
**Without pooling** (naive approach):
|
||||||
|
```
|
||||||
|
100 cameras → 100 model IDs → 100 execution contexts
|
||||||
|
```
|
||||||
|
- Problem: Each context consumes VRAM (layers, workspace, etc.)
|
||||||
|
- Problem: Context creation overhead per camera
|
||||||
|
- Problem: Doesn't scale to hundreds of cameras
|
||||||
|
|
||||||
|
**With pooling** (our approach):
|
||||||
|
```
|
||||||
|
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
|
||||||
|
```
|
||||||
|
- Solution: Contexts shared across all cameras using same model
|
||||||
|
- Solution: Borrow/return mechanism with mutex queue
|
||||||
|
- Solution: Scales to any number of cameras with fixed context count
|
||||||
|
|
||||||
|
### Memory Savings Example
|
||||||
|
|
||||||
|
YOLOv8n model (~6MB engine file):
|
||||||
|
|
||||||
|
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|
||||||
|
|----------|-----------|---------|----------|-------------|
|
||||||
|
| Naive | 100 | 100 | 100 | ~1.5 GB |
|
||||||
|
| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
|
||||||
|
|
||||||
|
**50x memory savings!**
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from services.model_repository import TensorRTModelRepository
|
||||||
|
|
||||||
|
# Initialize repository
|
||||||
|
repo = TensorRTModelRepository(
|
||||||
|
gpu_id=0,
|
||||||
|
default_num_contexts=4 # 4 contexts per unique engine
|
||||||
|
)
|
||||||
|
|
||||||
|
# Load model for camera 1
|
||||||
|
repo.load_model(
|
||||||
|
model_id="camera_1",
|
||||||
|
file_path="models/yolov8n.trt"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Load same model for camera 2 (deduplication happens automatically)
|
||||||
|
repo.load_model(
|
||||||
|
model_id="camera_2",
|
||||||
|
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run inference (GPU-to-GPU)
|
||||||
|
import torch
|
||||||
|
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||||
|
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id="camera_1",
|
||||||
|
inputs={"images": input_tensor},
|
||||||
|
synchronize=True,
|
||||||
|
timeout=5.0 # Wait up to 5s for available context
|
||||||
|
)
|
||||||
|
|
||||||
|
# Outputs stay on GPU
|
||||||
|
for name, tensor in outputs.items():
|
||||||
|
print(f"{name}: {tensor.shape} on {tensor.device}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Camera Scenario
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Setup multiple cameras
|
||||||
|
cameras = [f"camera_{i}" for i in range(100)]
|
||||||
|
|
||||||
|
# Load same model for all cameras
|
||||||
|
for camera_id in cameras:
|
||||||
|
repo.load_model(
|
||||||
|
model_id=camera_id,
|
||||||
|
file_path="models/yolov8n.trt" # Same file for all
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check efficiency
|
||||||
|
stats = repo.get_stats()
|
||||||
|
print(f"Model IDs: {stats['total_model_ids']}") # 100
|
||||||
|
print(f"Unique engines: {stats['unique_engines']}") # 1
|
||||||
|
print(f"Total contexts: {stats['total_contexts']}") # 4
|
||||||
|
```
|
||||||
|
|
||||||
|
### Integration with RTSP Decoder
|
||||||
|
|
||||||
|
```python
|
||||||
|
from services.stream_decoder import StreamDecoderFactory
|
||||||
|
from services.model_repository import TensorRTModelRepository
|
||||||
|
|
||||||
|
# Setup
|
||||||
|
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
model_repo = TensorRTModelRepository(gpu_id=0)
|
||||||
|
|
||||||
|
# Create decoder for camera
|
||||||
|
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
|
||||||
|
decoder.start()
|
||||||
|
|
||||||
|
# Load inference model
|
||||||
|
model_repo.load_model("camera_main", "models/yolov8n.trt")
|
||||||
|
|
||||||
|
# Process frames (everything on GPU)
|
||||||
|
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
|
||||||
|
|
||||||
|
# Preprocess (stays on GPU)
|
||||||
|
frame_gpu = frame_gpu.float() / 255.0
|
||||||
|
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
|
||||||
|
|
||||||
|
# Inference (GPU-to-GPU, zero copy)
|
||||||
|
outputs = model_repo.infer(
|
||||||
|
model_id="camera_main",
|
||||||
|
inputs={"images": frame_gpu}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Post-process outputs (can stay on GPU)
|
||||||
|
# ... NMS, bounding boxes, etc.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Concurrent Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
import threading
|
||||||
|
|
||||||
|
def process_camera(camera_id: str, model_id: str):
|
||||||
|
# Get frame from decoder (on GPU)
|
||||||
|
frame = decoder.get_latest_frame(rgb=True)
|
||||||
|
|
||||||
|
# Inference automatically borrows/returns context from pool
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id=model_id,
|
||||||
|
inputs={"images": frame},
|
||||||
|
timeout=10.0 # Wait for available context
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process outputs...
|
||||||
|
|
||||||
|
# Multiple threads can infer concurrently
|
||||||
|
threads = []
|
||||||
|
for i in range(10): # 10 threads
|
||||||
|
t = threading.Thread(
|
||||||
|
target=process_camera,
|
||||||
|
args=(f"camera_{i}", f"camera_{i}")
|
||||||
|
)
|
||||||
|
threads.append(t)
|
||||||
|
t.start()
|
||||||
|
|
||||||
|
for t in threads:
|
||||||
|
t.join()
|
||||||
|
|
||||||
|
# With 4 contexts: up to 4 inferences run in parallel
|
||||||
|
# Others wait in queue, contexts auto-balanced
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Reference
|
||||||
|
|
||||||
|
### TensorRTModelRepository
|
||||||
|
|
||||||
|
#### `__init__(gpu_id=0, default_num_contexts=4)`
|
||||||
|
Initialize the repository.
|
||||||
|
|
||||||
|
**Args:**
|
||||||
|
- `gpu_id`: GPU device ID
|
||||||
|
- `default_num_contexts`: Default context pool size per engine
|
||||||
|
|
||||||
|
#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
|
||||||
|
Load a TensorRT model.
|
||||||
|
|
||||||
|
**Args:**
|
||||||
|
- `model_id`: Unique identifier (e.g., "camera_1")
|
||||||
|
- `file_path`: Path to .trt/.engine file
|
||||||
|
- `num_contexts`: Context pool size (None = use default)
|
||||||
|
- `force_reload`: Reload if model_id exists
|
||||||
|
|
||||||
|
**Returns:** `ModelMetadata`
|
||||||
|
|
||||||
|
**Deduplication:** If file hash matches existing model, reuses engine + contexts.
|
||||||
|
|
||||||
|
#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
|
||||||
|
Run inference.
|
||||||
|
|
||||||
|
**Args:**
|
||||||
|
- `model_id`: Model identifier
|
||||||
|
- `inputs`: Dict mapping input names to CUDA tensors
|
||||||
|
- `synchronize`: Wait for completion
|
||||||
|
- `timeout`: Max wait time for context (seconds)
|
||||||
|
|
||||||
|
**Returns:** Dict mapping output names to CUDA tensors
|
||||||
|
|
||||||
|
**Thread-safe:** Borrows context from pool, returns after inference.
|
||||||
|
|
||||||
|
#### `unload_model(model_id)`
|
||||||
|
Unload a model.
|
||||||
|
|
||||||
|
If last reference to engine, fully unloads from VRAM.
|
||||||
|
|
||||||
|
#### `get_metadata(model_id)`
|
||||||
|
Get model metadata.
|
||||||
|
|
||||||
|
**Returns:** `ModelMetadata` or `None`
|
||||||
|
|
||||||
|
#### `get_model_info(model_id)`
|
||||||
|
Get detailed model information.
|
||||||
|
|
||||||
|
**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
|
||||||
|
|
||||||
|
#### `get_stats()`
|
||||||
|
Get repository statistics.
|
||||||
|
|
||||||
|
**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### 1. Set Appropriate Context Pool Size
|
||||||
|
|
||||||
|
```python
|
||||||
|
# For 10 cameras with same model, 4 contexts is usually enough
|
||||||
|
repo = TensorRTModelRepository(default_num_contexts=4)
|
||||||
|
|
||||||
|
# For high concurrency, increase pool size
|
||||||
|
repo = TensorRTModelRepository(default_num_contexts=8)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
|
||||||
|
|
||||||
|
### 2. Always Use GPU Tensors
|
||||||
|
|
||||||
|
```python
|
||||||
|
# ✅ Good: Input on GPU
|
||||||
|
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||||
|
outputs = repo.infer(model_id, {"images": input_gpu})
|
||||||
|
|
||||||
|
# ❌ Bad: Input on CPU (will cause error)
|
||||||
|
input_cpu = torch.rand(1, 3, 640, 640)
|
||||||
|
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Handle Timeout Gracefully
|
||||||
|
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id="camera_1",
|
||||||
|
inputs=inputs,
|
||||||
|
timeout=5.0
|
||||||
|
)
|
||||||
|
except RuntimeError as e:
|
||||||
|
# All contexts busy, increase pool size or add backpressure
|
||||||
|
print(f"Inference timeout: {e}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Use Same File for Deduplication
|
||||||
|
|
||||||
|
```python
|
||||||
|
# ✅ Good: Same file path → deduplication
|
||||||
|
repo.load_model("cam1", "/models/yolo.trt")
|
||||||
|
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
|
||||||
|
|
||||||
|
# ❌ Bad: Different paths (even if same content) → no deduplication
|
||||||
|
repo.load_model("cam1", "/models/yolo.trt")
|
||||||
|
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
|
||||||
|
```
|
||||||
|
|
||||||
|
## TensorRT Best Practices Implemented
|
||||||
|
|
||||||
|
Based on NVIDIA documentation and web search findings:
|
||||||
|
|
||||||
|
1. **Separate IExecutionContext per concurrent stream** ✅
|
||||||
|
- Each context has its own CUDA stream
|
||||||
|
- Contexts never shared across threads simultaneously
|
||||||
|
|
||||||
|
2. **Mutex-based context management** ✅
|
||||||
|
- Queue-based borrowing with locks
|
||||||
|
- Thread-safe acquire/release pattern
|
||||||
|
|
||||||
|
3. **GPU memory reuse** ✅
|
||||||
|
- Engines shared by file hash
|
||||||
|
- Contexts pooled and reused
|
||||||
|
|
||||||
|
4. **Zero-copy operations** ✅
|
||||||
|
- All data stays in VRAM
|
||||||
|
- DLPack integration with PyTorch
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "No execution context available within timeout"
|
||||||
|
|
||||||
|
**Cause:** All contexts busy with concurrent inferences.
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. Increase context pool size:
|
||||||
|
```python
|
||||||
|
repo.load_model(model_id, file_path, num_contexts=8)
|
||||||
|
```
|
||||||
|
2. Increase timeout:
|
||||||
|
```python
|
||||||
|
outputs = repo.infer(model_id, inputs, timeout=30.0)
|
||||||
|
```
|
||||||
|
3. Add backpressure/throttling to limit concurrent requests
|
||||||
|
|
||||||
|
### Out of Memory (OOM)
|
||||||
|
|
||||||
|
**Cause:** Too many unique engines or large context pools.
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. Ensure deduplication working (same file paths)
|
||||||
|
2. Reduce context pool sizes
|
||||||
|
3. Use smaller models or quantization (INT8/FP16)
|
||||||
|
|
||||||
|
### Import Error: "tensorrt could not be resolved"
|
||||||
|
|
||||||
|
**Solution:** Install TensorRT:
|
||||||
|
```bash
|
||||||
|
pip install tensorrt
|
||||||
|
# Or use NVIDIA's wheel for your CUDA version
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Tips
|
||||||
|
|
||||||
|
1. **Batch Processing:** Process multiple frames before synchronizing
|
||||||
|
```python
|
||||||
|
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||||
|
# ... more inferences ...
|
||||||
|
torch.cuda.synchronize() # Sync once at end
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Async Inference:** Don't synchronize if not needed immediately
|
||||||
|
```python
|
||||||
|
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||||
|
# GPU continues working, CPU continues
|
||||||
|
# Synchronize later when you need results
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Monitor Context Utilization:**
|
||||||
|
```python
|
||||||
|
stats = repo.get_stats()
|
||||||
|
print(f"Contexts: {stats['total_contexts']}")
|
||||||
|
|
||||||
|
# If timeouts occur frequently, increase pool size
|
||||||
|
```
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Part of python-rtsp-worker project.
|
||||||
14
services/__init__.py
Normal file
14
services/__init__.py
Normal file
|
|
@ -0,0 +1,14 @@
|
||||||
|
"""
|
||||||
|
Services package for RTSP stream processing with GPU acceleration.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .stream_decoder import StreamDecoderFactory, StreamDecoder, ConnectionStatus
|
||||||
|
from .jpeg_encoder import JPEGEncoderFactory, encode_frame_to_jpeg
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
'StreamDecoderFactory',
|
||||||
|
'StreamDecoder',
|
||||||
|
'ConnectionStatus',
|
||||||
|
'JPEGEncoderFactory',
|
||||||
|
'encode_frame_to_jpeg',
|
||||||
|
]
|
||||||
91
services/jpeg_encoder.py
Normal file
91
services/jpeg_encoder.py
Normal file
|
|
@ -0,0 +1,91 @@
|
||||||
|
"""
|
||||||
|
JPEG Encoder wrapper for GPU-accelerated JPEG encoding using nvImageCodec/nvJPEG.
|
||||||
|
Provides a shared encoder instance that can be used across multiple streams.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
import torch
|
||||||
|
import nvidia.nvimgcodec as nvimgcodec
|
||||||
|
|
||||||
|
|
||||||
|
class JPEGEncoderFactory:
|
||||||
|
"""
|
||||||
|
Factory for creating and managing a shared JPEG encoder instance.
|
||||||
|
Thread-safe singleton pattern for efficient resource sharing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
_instance = None
|
||||||
|
_encoder = None
|
||||||
|
|
||||||
|
def __new__(cls):
|
||||||
|
if cls._instance is None:
|
||||||
|
cls._instance = super(JPEGEncoderFactory, cls).__new__(cls)
|
||||||
|
cls._encoder = nvimgcodec.Encoder()
|
||||||
|
print("JPEGEncoderFactory initialized with shared nvJPEG encoder")
|
||||||
|
return cls._instance
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_encoder(cls):
|
||||||
|
"""Get the shared JPEG encoder instance"""
|
||||||
|
if cls._encoder is None:
|
||||||
|
cls() # Initialize if not already done
|
||||||
|
return cls._encoder
|
||||||
|
|
||||||
|
|
||||||
|
def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Encode an RGB frame to JPEG on GPU and return JPEG bytes.
|
||||||
|
|
||||||
|
This function:
|
||||||
|
1. Takes RGB frame from GPU (stays on GPU during encoding)
|
||||||
|
2. Converts PyTorch tensor to nvImageCodec image via as_image()
|
||||||
|
3. Encodes to JPEG using nvJPEG (GPU operation)
|
||||||
|
4. Transfers only JPEG bytes to CPU
|
||||||
|
5. Returns bytes for saving to disk
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rgb_frame: RGB tensor on GPU, shape (3, H, W) or (H, W, 3), dtype uint8
|
||||||
|
quality: JPEG quality (0-100, default 95)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JPEG encoded bytes or None if encoding fails
|
||||||
|
"""
|
||||||
|
if rgb_frame is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ensure we have (H, W, C) format and contiguous memory
|
||||||
|
if rgb_frame.dim() == 3:
|
||||||
|
if rgb_frame.shape[0] == 3:
|
||||||
|
# Convert from (C, H, W) to (H, W, C)
|
||||||
|
rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
|
||||||
|
else:
|
||||||
|
# Already (H, W, C)
|
||||||
|
rgb_hwc = rgb_frame.contiguous()
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Expected 3D tensor, got shape {rgb_frame.shape}")
|
||||||
|
|
||||||
|
# Get shared encoder
|
||||||
|
encoder = JPEGEncoderFactory.get_encoder()
|
||||||
|
|
||||||
|
# Create encode parameters with quality
|
||||||
|
# Quality is set via quality_value (0-100 scale)
|
||||||
|
jpeg_params = nvimgcodec.JpegEncodeParams(optimized_huffman=True)
|
||||||
|
encode_params = nvimgcodec.EncodeParams(
|
||||||
|
quality_value=float(quality),
|
||||||
|
jpeg_encode_params=jpeg_params
|
||||||
|
)
|
||||||
|
|
||||||
|
# Convert PyTorch GPU tensor to nvImageCodec image using __cuda_array_interface__
|
||||||
|
# This is zero-copy - nvimgcodec reads directly from GPU memory
|
||||||
|
nv_image = nvimgcodec.as_image(rgb_hwc)
|
||||||
|
|
||||||
|
# Encode to JPEG on GPU
|
||||||
|
# The encoding happens on GPU, only compressed JPEG bytes are transferred to CPU
|
||||||
|
jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
|
||||||
|
|
||||||
|
return bytes(jpeg_data)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error encoding frame to JPEG: {e}")
|
||||||
|
return None
|
||||||
631
services/model_repository.py
Normal file
631
services/model_repository.py
Normal file
|
|
@ -0,0 +1,631 @@
|
||||||
|
import threading
|
||||||
|
import hashlib
|
||||||
|
from typing import Optional, Dict, Any, List, Tuple
|
||||||
|
from pathlib import Path
|
||||||
|
from queue import Queue
|
||||||
|
import torch
|
||||||
|
import tensorrt as trt
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ModelMetadata:
|
||||||
|
"""Metadata for a loaded TensorRT model"""
|
||||||
|
file_path: str
|
||||||
|
file_hash: str
|
||||||
|
input_shapes: Dict[str, Tuple[int, ...]]
|
||||||
|
output_shapes: Dict[str, Tuple[int, ...]]
|
||||||
|
input_names: List[str]
|
||||||
|
output_names: List[str]
|
||||||
|
input_dtypes: Dict[str, torch.dtype]
|
||||||
|
output_dtypes: Dict[str, torch.dtype]
|
||||||
|
|
||||||
|
|
||||||
|
class ExecutionContext:
|
||||||
|
"""
|
||||||
|
Wrapper for TensorRT execution context with CUDA stream.
|
||||||
|
Used in context pool for load balancing.
|
||||||
|
"""
|
||||||
|
def __init__(self, context: trt.IExecutionContext, stream: torch.cuda.Stream,
|
||||||
|
context_id: int, device: torch.device):
|
||||||
|
self.context = context
|
||||||
|
self.stream = stream
|
||||||
|
self.context_id = context_id
|
||||||
|
self.device = device
|
||||||
|
self.in_use = False
|
||||||
|
self.lock = threading.Lock()
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"ExecutionContext(id={self.context_id}, in_use={self.in_use})"
|
||||||
|
|
||||||
|
|
||||||
|
class SharedEngine:
|
||||||
|
"""
|
||||||
|
Shared TensorRT engine with context pool for load balancing.
|
||||||
|
|
||||||
|
Architecture:
|
||||||
|
- One engine shared across all model_ids with same file hash
|
||||||
|
- Pool of N execution contexts for concurrent inference
|
||||||
|
- Contexts are borrowed/returned using mutex locks
|
||||||
|
- Load balancing: contexts distributed across requests
|
||||||
|
"""
|
||||||
|
def __init__(self, engine: trt.ICudaEngine, file_hash: str, file_path: str,
|
||||||
|
num_contexts: int, device: torch.device, metadata: ModelMetadata):
|
||||||
|
self.engine = engine
|
||||||
|
self.file_hash = file_hash
|
||||||
|
self.file_path = file_path
|
||||||
|
self.metadata = metadata
|
||||||
|
self.device = device
|
||||||
|
self.num_contexts = num_contexts
|
||||||
|
|
||||||
|
# Create context pool
|
||||||
|
self.context_pool: List[ExecutionContext] = []
|
||||||
|
self.available_contexts: Queue[ExecutionContext] = Queue()
|
||||||
|
|
||||||
|
for i in range(num_contexts):
|
||||||
|
ctx = engine.create_execution_context()
|
||||||
|
if ctx is None:
|
||||||
|
raise RuntimeError(f"Failed to create execution context {i}")
|
||||||
|
|
||||||
|
stream = torch.cuda.Stream(device=device)
|
||||||
|
exec_ctx = ExecutionContext(ctx, stream, i, device)
|
||||||
|
self.context_pool.append(exec_ctx)
|
||||||
|
self.available_contexts.put(exec_ctx)
|
||||||
|
|
||||||
|
# Model IDs referencing this engine
|
||||||
|
self.model_ids: set = set()
|
||||||
|
self.lock = threading.Lock()
|
||||||
|
|
||||||
|
print(f"Created context pool with {num_contexts} contexts for engine {file_hash[:8]}...")
|
||||||
|
|
||||||
|
def acquire_context(self, timeout: Optional[float] = None) -> Optional[ExecutionContext]:
|
||||||
|
"""
|
||||||
|
Acquire an available execution context from the pool.
|
||||||
|
Blocks if all contexts are in use.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
timeout: Max time to wait for context (None = wait forever)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ExecutionContext or None if timeout
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
exec_ctx = self.available_contexts.get(timeout=timeout)
|
||||||
|
with exec_ctx.lock:
|
||||||
|
exec_ctx.in_use = True
|
||||||
|
return exec_ctx
|
||||||
|
except:
|
||||||
|
return None
|
||||||
|
|
||||||
|
def release_context(self, exec_ctx: ExecutionContext):
|
||||||
|
"""
|
||||||
|
Return a context to the pool.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
exec_ctx: Context to release
|
||||||
|
"""
|
||||||
|
with exec_ctx.lock:
|
||||||
|
exec_ctx.in_use = False
|
||||||
|
self.available_contexts.put(exec_ctx)
|
||||||
|
|
||||||
|
def add_model_id(self, model_id: str):
|
||||||
|
"""Add a model_id reference to this engine"""
|
||||||
|
with self.lock:
|
||||||
|
self.model_ids.add(model_id)
|
||||||
|
|
||||||
|
def remove_model_id(self, model_id: str) -> int:
|
||||||
|
"""
|
||||||
|
Remove a model_id reference from this engine.
|
||||||
|
Returns the number of remaining references.
|
||||||
|
"""
|
||||||
|
with self.lock:
|
||||||
|
self.model_ids.discard(model_id)
|
||||||
|
return len(self.model_ids)
|
||||||
|
|
||||||
|
def get_reference_count(self) -> int:
|
||||||
|
"""Get number of model_ids using this engine"""
|
||||||
|
with self.lock:
|
||||||
|
return len(self.model_ids)
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""Cleanup all contexts"""
|
||||||
|
for exec_ctx in self.context_pool:
|
||||||
|
del exec_ctx.context
|
||||||
|
self.context_pool.clear()
|
||||||
|
del self.engine
|
||||||
|
|
||||||
|
|
||||||
|
class TensorRTModelRepository:
|
||||||
|
"""
|
||||||
|
Thread-safe repository for TensorRT models with context pooling and deduplication.
|
||||||
|
|
||||||
|
Architecture:
|
||||||
|
- Deduplication: Multiple model_ids with same file → share one engine
|
||||||
|
- Context Pool: Each unique engine has N execution contexts (configurable)
|
||||||
|
- Load Balancing: Contexts are borrowed/returned via mutex queue
|
||||||
|
- Scalability: Adding 100 cameras with same model = 1 engine + N contexts (not 100 contexts!)
|
||||||
|
|
||||||
|
Best Practices:
|
||||||
|
- GPU-to-GPU: All inputs/outputs stay in VRAM (zero CPU transfers)
|
||||||
|
- Thread Safety: Mutex-based context borrowing (TensorRT best practice)
|
||||||
|
- Memory Efficient: Deduplicate by file hash, share engine across model_ids
|
||||||
|
- Concurrent: N contexts allow N parallel inferences per unique model
|
||||||
|
|
||||||
|
Example:
|
||||||
|
# 100 cameras, same model file
|
||||||
|
for i in range(100):
|
||||||
|
repo.load_model(f"camera_{i}", "yolov8.trt")
|
||||||
|
# Result: 1 engine in VRAM, N contexts (e.g., 4), not 100 contexts!
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, gpu_id: int = 0, default_num_contexts: int = 4):
|
||||||
|
"""
|
||||||
|
Initialize the model repository.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
gpu_id: GPU device ID to use
|
||||||
|
default_num_contexts: Default number of execution contexts per unique engine
|
||||||
|
"""
|
||||||
|
self.gpu_id = gpu_id
|
||||||
|
self.device = torch.device(f'cuda:{gpu_id}')
|
||||||
|
self.default_num_contexts = default_num_contexts
|
||||||
|
|
||||||
|
# Model ID to engine mapping: model_id -> file_hash
|
||||||
|
self._model_to_hash: Dict[str, str] = {}
|
||||||
|
|
||||||
|
# Shared engines with context pools: file_hash -> SharedEngine
|
||||||
|
self._shared_engines: Dict[str, SharedEngine] = {}
|
||||||
|
|
||||||
|
# Locks for thread safety
|
||||||
|
self._repo_lock = threading.RLock()
|
||||||
|
|
||||||
|
# TensorRT logger
|
||||||
|
self.trt_logger = trt.Logger(trt.Logger.WARNING)
|
||||||
|
|
||||||
|
print(f"TensorRT Model Repository initialized on GPU {gpu_id}")
|
||||||
|
print(f"Default context pool size: {default_num_contexts} contexts per unique model")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def compute_file_hash(file_path: str) -> str:
|
||||||
|
"""
|
||||||
|
Compute SHA256 hash of a file for deduplication.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to the file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Hexadecimal hash string
|
||||||
|
"""
|
||||||
|
sha256_hash = hashlib.sha256()
|
||||||
|
with open(file_path, "rb") as f:
|
||||||
|
# Read in chunks to handle large files efficiently
|
||||||
|
for byte_block in iter(lambda: f.read(65536), b""):
|
||||||
|
sha256_hash.update(byte_block)
|
||||||
|
return sha256_hash.hexdigest()
|
||||||
|
|
||||||
|
def _load_engine(self, file_path: str) -> trt.ICudaEngine:
|
||||||
|
"""
|
||||||
|
Load TensorRT engine from file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to .trt or .engine file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
TensorRT engine
|
||||||
|
"""
|
||||||
|
runtime = trt.Runtime(self.trt_logger)
|
||||||
|
|
||||||
|
with open(file_path, 'rb') as f:
|
||||||
|
engine_data = f.read()
|
||||||
|
|
||||||
|
engine = runtime.deserialize_cuda_engine(engine_data)
|
||||||
|
if engine is None:
|
||||||
|
raise RuntimeError(f"Failed to load TensorRT engine from {file_path}")
|
||||||
|
|
||||||
|
return engine
|
||||||
|
|
||||||
|
def _extract_metadata(self, engine: trt.ICudaEngine,
|
||||||
|
file_path: str, file_hash: str) -> ModelMetadata:
|
||||||
|
"""
|
||||||
|
Extract metadata from TensorRT engine.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
engine: TensorRT engine
|
||||||
|
file_path: Path to model file
|
||||||
|
file_hash: SHA256 hash of model file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ModelMetadata object
|
||||||
|
"""
|
||||||
|
input_shapes = {}
|
||||||
|
output_shapes = {}
|
||||||
|
input_names = []
|
||||||
|
output_names = []
|
||||||
|
input_dtypes = {}
|
||||||
|
output_dtypes = {}
|
||||||
|
|
||||||
|
# TensorRT dtype to PyTorch dtype mapping
|
||||||
|
trt_to_torch_dtype = {
|
||||||
|
trt.DataType.FLOAT: torch.float32,
|
||||||
|
trt.DataType.HALF: torch.float16,
|
||||||
|
trt.DataType.INT8: torch.int8,
|
||||||
|
trt.DataType.INT32: torch.int32,
|
||||||
|
trt.DataType.BOOL: torch.bool,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Iterate through all tensors (inputs and outputs) - TensorRT 10.x API
|
||||||
|
for i in range(engine.num_io_tensors):
|
||||||
|
name = engine.get_tensor_name(i)
|
||||||
|
shape = tuple(engine.get_tensor_shape(name))
|
||||||
|
dtype = trt_to_torch_dtype.get(engine.get_tensor_dtype(name), torch.float32)
|
||||||
|
mode = engine.get_tensor_mode(name)
|
||||||
|
|
||||||
|
if mode == trt.TensorIOMode.INPUT:
|
||||||
|
input_names.append(name)
|
||||||
|
input_shapes[name] = shape
|
||||||
|
input_dtypes[name] = dtype
|
||||||
|
else:
|
||||||
|
output_names.append(name)
|
||||||
|
output_shapes[name] = shape
|
||||||
|
output_dtypes[name] = dtype
|
||||||
|
|
||||||
|
return ModelMetadata(
|
||||||
|
file_path=file_path,
|
||||||
|
file_hash=file_hash,
|
||||||
|
input_shapes=input_shapes,
|
||||||
|
output_shapes=output_shapes,
|
||||||
|
input_names=input_names,
|
||||||
|
output_names=output_names,
|
||||||
|
input_dtypes=input_dtypes,
|
||||||
|
output_dtypes=output_dtypes
|
||||||
|
)
|
||||||
|
|
||||||
|
def load_model(self, model_id: str, file_path: str,
|
||||||
|
num_contexts: Optional[int] = None,
|
||||||
|
force_reload: bool = False) -> ModelMetadata:
|
||||||
|
"""
|
||||||
|
Load a TensorRT model with the given ID.
|
||||||
|
|
||||||
|
Deduplication: If a model with the same file hash is already loaded, the model_id
|
||||||
|
is simply mapped to the existing SharedEngine (no new engine or contexts created).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: User-defined identifier for this model (e.g., "camera_1")
|
||||||
|
file_path: Path to TensorRT engine file (.trt or .engine)
|
||||||
|
num_contexts: Number of execution contexts in pool (None = use default)
|
||||||
|
force_reload: If True, reload even if model_id exists
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ModelMetadata for the loaded model
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
FileNotFoundError: If model file doesn't exist
|
||||||
|
RuntimeError: If engine loading fails
|
||||||
|
ValueError: If model_id already exists and force_reload is False
|
||||||
|
"""
|
||||||
|
file_path = str(Path(file_path).resolve())
|
||||||
|
|
||||||
|
if not Path(file_path).exists():
|
||||||
|
raise FileNotFoundError(f"Model file not found: {file_path}")
|
||||||
|
|
||||||
|
if num_contexts is None:
|
||||||
|
num_contexts = self.default_num_contexts
|
||||||
|
|
||||||
|
with self._repo_lock:
|
||||||
|
# Check if model_id already exists
|
||||||
|
if model_id in self._model_to_hash and not force_reload:
|
||||||
|
raise ValueError(
|
||||||
|
f"Model ID '{model_id}' already exists. "
|
||||||
|
f"Use force_reload=True to reload or choose a different ID."
|
||||||
|
)
|
||||||
|
|
||||||
|
# Unload existing model if force_reload
|
||||||
|
if model_id in self._model_to_hash and force_reload:
|
||||||
|
self.unload_model(model_id)
|
||||||
|
|
||||||
|
# Compute file hash for deduplication
|
||||||
|
print(f"Computing hash for {file_path}...")
|
||||||
|
file_hash = self.compute_file_hash(file_path)
|
||||||
|
print(f"File hash: {file_hash[:16]}...")
|
||||||
|
|
||||||
|
# Check if this file is already loaded (deduplication)
|
||||||
|
if file_hash in self._shared_engines:
|
||||||
|
shared_engine = self._shared_engines[file_hash]
|
||||||
|
print(f"Engine already loaded (hash match), reusing engine and context pool...")
|
||||||
|
print(f" Existing model_ids using this engine: {shared_engine.model_ids}")
|
||||||
|
else:
|
||||||
|
# Load new engine
|
||||||
|
print(f"Loading TensorRT engine from {file_path}...")
|
||||||
|
engine = self._load_engine(file_path)
|
||||||
|
|
||||||
|
# Extract metadata
|
||||||
|
metadata = self._extract_metadata(engine, file_path, file_hash)
|
||||||
|
|
||||||
|
# Create shared engine with context pool
|
||||||
|
shared_engine = SharedEngine(
|
||||||
|
engine=engine,
|
||||||
|
file_hash=file_hash,
|
||||||
|
file_path=file_path,
|
||||||
|
num_contexts=num_contexts,
|
||||||
|
device=self.device,
|
||||||
|
metadata=metadata
|
||||||
|
)
|
||||||
|
self._shared_engines[file_hash] = shared_engine
|
||||||
|
|
||||||
|
# Add this model_id to the shared engine
|
||||||
|
shared_engine.add_model_id(model_id)
|
||||||
|
|
||||||
|
# Map model_id to file_hash
|
||||||
|
self._model_to_hash[model_id] = file_hash
|
||||||
|
|
||||||
|
print(f"Model '{model_id}' loaded successfully")
|
||||||
|
print(f" Inputs: {shared_engine.metadata.input_names}")
|
||||||
|
for name in shared_engine.metadata.input_names:
|
||||||
|
print(f" {name}: {shared_engine.metadata.input_shapes[name]} ({shared_engine.metadata.input_dtypes[name]})")
|
||||||
|
print(f" Outputs: {shared_engine.metadata.output_names}")
|
||||||
|
for name in shared_engine.metadata.output_names:
|
||||||
|
print(f" {name}: {shared_engine.metadata.output_shapes[name]} ({shared_engine.metadata.output_dtypes[name]})")
|
||||||
|
print(f" Context pool size: {num_contexts}")
|
||||||
|
print(f" Model IDs sharing this engine: {shared_engine.get_reference_count()}")
|
||||||
|
print(f" Unique engines in VRAM: {len(self._shared_engines)}")
|
||||||
|
|
||||||
|
return shared_engine.metadata
|
||||||
|
|
||||||
|
def infer(self, model_id: str, inputs: Dict[str, torch.Tensor],
|
||||||
|
synchronize: bool = True, timeout: Optional[float] = 5.0) -> Dict[str, torch.Tensor]:
|
||||||
|
"""
|
||||||
|
Run GPU-to-GPU inference with the specified model using context pooling.
|
||||||
|
|
||||||
|
All inputs must be CUDA tensors and outputs will be CUDA tensors (stays in VRAM).
|
||||||
|
Thread-safe: Borrows an execution context from the pool with mutex locking.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: Model identifier
|
||||||
|
inputs: Dictionary mapping input names to CUDA tensors
|
||||||
|
synchronize: If True, wait for inference to complete. If False, async execution.
|
||||||
|
timeout: Max time to wait for available context (seconds)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary mapping output names to CUDA tensors (in VRAM)
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
KeyError: If model_id not found
|
||||||
|
ValueError: If inputs don't match expected shapes or are not on GPU
|
||||||
|
RuntimeError: If no context available within timeout
|
||||||
|
"""
|
||||||
|
# Get shared engine
|
||||||
|
if model_id not in self._model_to_hash:
|
||||||
|
raise KeyError(f"Model '{model_id}' not found. Available: {list(self._model_to_hash.keys())}")
|
||||||
|
|
||||||
|
file_hash = self._model_to_hash[model_id]
|
||||||
|
shared_engine = self._shared_engines[file_hash]
|
||||||
|
metadata = shared_engine.metadata
|
||||||
|
|
||||||
|
# Validate inputs
|
||||||
|
for name in metadata.input_names:
|
||||||
|
if name not in inputs:
|
||||||
|
raise ValueError(f"Missing required input: {name}")
|
||||||
|
|
||||||
|
tensor = inputs[name]
|
||||||
|
if not tensor.is_cuda:
|
||||||
|
raise ValueError(f"Input '{name}' must be a CUDA tensor (on GPU)")
|
||||||
|
|
||||||
|
# Check device
|
||||||
|
if tensor.device != self.device:
|
||||||
|
print(f"Warning: Input '{name}' on {tensor.device}, moving to {self.device}")
|
||||||
|
inputs[name] = tensor.to(self.device)
|
||||||
|
|
||||||
|
# Acquire context from pool (mutex-based)
|
||||||
|
exec_ctx = shared_engine.acquire_context(timeout=timeout)
|
||||||
|
if exec_ctx is None:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"No execution context available for model '{model_id}' within {timeout}s. "
|
||||||
|
f"All {shared_engine.num_contexts} contexts are busy."
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Prepare output tensors
|
||||||
|
outputs = {}
|
||||||
|
|
||||||
|
# Set input tensors - TensorRT 10.x API
|
||||||
|
for name in metadata.input_names:
|
||||||
|
input_tensor = inputs[name].contiguous()
|
||||||
|
exec_ctx.context.set_tensor_address(name, input_tensor.data_ptr())
|
||||||
|
|
||||||
|
# Allocate and set output tensors
|
||||||
|
for name in metadata.output_names:
|
||||||
|
output_shape = metadata.output_shapes[name]
|
||||||
|
output_dtype = metadata.output_dtypes[name]
|
||||||
|
|
||||||
|
output_tensor = torch.empty(
|
||||||
|
output_shape,
|
||||||
|
dtype=output_dtype,
|
||||||
|
device=self.device
|
||||||
|
)
|
||||||
|
|
||||||
|
outputs[name] = output_tensor
|
||||||
|
exec_ctx.context.set_tensor_address(name, output_tensor.data_ptr())
|
||||||
|
|
||||||
|
# Execute inference on context's stream - TensorRT 10.x API
|
||||||
|
with torch.cuda.stream(exec_ctx.stream):
|
||||||
|
success = exec_ctx.context.execute_async_v3(
|
||||||
|
stream_handle=exec_ctx.stream.cuda_stream
|
||||||
|
)
|
||||||
|
|
||||||
|
if not success:
|
||||||
|
raise RuntimeError(f"Inference failed for model '{model_id}'")
|
||||||
|
|
||||||
|
# Synchronize if requested
|
||||||
|
if synchronize:
|
||||||
|
exec_ctx.stream.synchronize()
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Always release context back to pool
|
||||||
|
shared_engine.release_context(exec_ctx)
|
||||||
|
|
||||||
|
def infer_batch(self, model_id: str, batch_inputs: List[Dict[str, torch.Tensor]],
|
||||||
|
synchronize: bool = True) -> List[Dict[str, torch.Tensor]]:
|
||||||
|
"""
|
||||||
|
Run inference on multiple inputs.
|
||||||
|
Contexts are borrowed/returned for each input, enabling parallel processing.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: Model identifier
|
||||||
|
batch_inputs: List of input dictionaries
|
||||||
|
synchronize: If True, wait for all inferences to complete
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of output dictionaries
|
||||||
|
"""
|
||||||
|
results = []
|
||||||
|
for inputs in batch_inputs:
|
||||||
|
outputs = self.infer(model_id, inputs, synchronize=synchronize)
|
||||||
|
results.append(outputs)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def unload_model(self, model_id: str):
|
||||||
|
"""
|
||||||
|
Unload a model from the repository.
|
||||||
|
|
||||||
|
Removes the model_id reference from the shared engine. If this was the last
|
||||||
|
reference, the engine and all its contexts will be fully unloaded from VRAM.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: Model identifier to unload
|
||||||
|
"""
|
||||||
|
with self._repo_lock:
|
||||||
|
if model_id not in self._model_to_hash:
|
||||||
|
print(f"Warning: Model '{model_id}' not found")
|
||||||
|
return
|
||||||
|
|
||||||
|
file_hash = self._model_to_hash[model_id]
|
||||||
|
|
||||||
|
# Remove model_id from shared engine
|
||||||
|
if file_hash in self._shared_engines:
|
||||||
|
shared_engine = self._shared_engines[file_hash]
|
||||||
|
remaining_refs = shared_engine.remove_model_id(model_id)
|
||||||
|
|
||||||
|
# If no more references, cleanup engine and contexts
|
||||||
|
if remaining_refs == 0:
|
||||||
|
shared_engine.cleanup()
|
||||||
|
del self._shared_engines[file_hash]
|
||||||
|
print(f"Model '{model_id}' unloaded, engine removed from VRAM (0 references)")
|
||||||
|
else:
|
||||||
|
print(f"Model '{model_id}' unloaded, engine kept in VRAM ({remaining_refs} references)")
|
||||||
|
|
||||||
|
# Remove from model_id mapping
|
||||||
|
del self._model_to_hash[model_id]
|
||||||
|
|
||||||
|
def get_metadata(self, model_id: str) -> Optional[ModelMetadata]:
|
||||||
|
"""
|
||||||
|
Get metadata for a loaded model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: Model identifier
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ModelMetadata or None if not found
|
||||||
|
"""
|
||||||
|
if model_id not in self._model_to_hash:
|
||||||
|
return None
|
||||||
|
|
||||||
|
file_hash = self._model_to_hash[model_id]
|
||||||
|
if file_hash not in self._shared_engines:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return self._shared_engines[file_hash].metadata
|
||||||
|
|
||||||
|
def list_models(self) -> Dict[str, ModelMetadata]:
|
||||||
|
"""
|
||||||
|
List all loaded models.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary mapping model_id to ModelMetadata
|
||||||
|
"""
|
||||||
|
with self._repo_lock:
|
||||||
|
result = {}
|
||||||
|
for model_id, file_hash in self._model_to_hash.items():
|
||||||
|
if file_hash in self._shared_engines:
|
||||||
|
result[model_id] = self._shared_engines[file_hash].metadata
|
||||||
|
return result
|
||||||
|
|
||||||
|
def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Get detailed information about a loaded model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_id: Model identifier
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with model information or None if not found
|
||||||
|
"""
|
||||||
|
if model_id not in self._model_to_hash:
|
||||||
|
return None
|
||||||
|
|
||||||
|
file_hash = self._model_to_hash[model_id]
|
||||||
|
if file_hash not in self._shared_engines:
|
||||||
|
return None
|
||||||
|
|
||||||
|
shared_engine = self._shared_engines[file_hash]
|
||||||
|
metadata = shared_engine.metadata
|
||||||
|
|
||||||
|
return {
|
||||||
|
'model_id': model_id,
|
||||||
|
'file_path': metadata.file_path,
|
||||||
|
'file_hash': metadata.file_hash[:16] + '...',
|
||||||
|
'engine_references': shared_engine.get_reference_count(),
|
||||||
|
'context_pool_size': shared_engine.num_contexts,
|
||||||
|
'shared_with_model_ids': list(shared_engine.model_ids),
|
||||||
|
'inputs': {
|
||||||
|
name: {
|
||||||
|
'shape': metadata.input_shapes[name],
|
||||||
|
'dtype': str(metadata.input_dtypes[name])
|
||||||
|
}
|
||||||
|
for name in metadata.input_names
|
||||||
|
},
|
||||||
|
'outputs': {
|
||||||
|
name: {
|
||||||
|
'shape': metadata.output_shapes[name],
|
||||||
|
'dtype': str(metadata.output_dtypes[name])
|
||||||
|
}
|
||||||
|
for name in metadata.output_names
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
def get_stats(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Get repository statistics.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with stats about loaded models and memory usage
|
||||||
|
"""
|
||||||
|
with self._repo_lock:
|
||||||
|
total_contexts = sum(
|
||||||
|
engine.num_contexts
|
||||||
|
for engine in self._shared_engines.values()
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'total_model_ids': len(self._model_to_hash),
|
||||||
|
'unique_engines': len(self._shared_engines),
|
||||||
|
'total_contexts': total_contexts,
|
||||||
|
'memory_efficiency': f"{len(self._model_to_hash)} model IDs using only {len(self._shared_engines)} engines",
|
||||||
|
'gpu_id': self.gpu_id,
|
||||||
|
'models': list(self._model_to_hash.keys())
|
||||||
|
}
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
with self._repo_lock:
|
||||||
|
return (f"TensorRTModelRepository(gpu={self.gpu_id}, "
|
||||||
|
f"model_ids={len(self._model_to_hash)}, "
|
||||||
|
f"unique_engines={len(self._shared_engines)})")
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
"""Cleanup all models on deletion"""
|
||||||
|
with self._repo_lock:
|
||||||
|
model_ids = list(self._model_to_hash.keys())
|
||||||
|
for model_id in model_ids:
|
||||||
|
self.unload_model(model_id)
|
||||||
481
services/stream_decoder.py
Normal file
481
services/stream_decoder.py
Normal file
|
|
@ -0,0 +1,481 @@
|
||||||
|
import threading
|
||||||
|
from typing import Optional
|
||||||
|
from collections import deque
|
||||||
|
from enum import Enum
|
||||||
|
import torch
|
||||||
|
import PyNvVideoCodec as nvc
|
||||||
|
import av
|
||||||
|
import numpy as np
|
||||||
|
from cuda.bindings import driver as cuda_driver
|
||||||
|
from .jpeg_encoder import encode_frame_to_jpeg
|
||||||
|
|
||||||
|
|
||||||
|
def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
|
||||||
|
"""
|
||||||
|
Convert NV12 format to RGB on GPU using PyTorch operations.
|
||||||
|
|
||||||
|
NV12 format:
|
||||||
|
- Y plane: height x width (luminance)
|
||||||
|
- UV plane: (height/2) x width (interleaved U and V, subsampled by 2)
|
||||||
|
|
||||||
|
Total tensor size: (height * 3/2) x width
|
||||||
|
|
||||||
|
Args:
|
||||||
|
nv12_tensor: Input tensor in NV12 format, shape (H*3/2, W)
|
||||||
|
height: Original frame height
|
||||||
|
width: Original frame width
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
RGB tensor, shape (3, H, W) in range [0, 255]
|
||||||
|
"""
|
||||||
|
device = nv12_tensor.device
|
||||||
|
|
||||||
|
# Split Y and UV planes
|
||||||
|
y_plane = nv12_tensor[:height, :].float() # (H, W)
|
||||||
|
uv_plane = nv12_tensor[height:, :].float() # (H/2, W)
|
||||||
|
|
||||||
|
# Reshape UV plane to separate U and V channels
|
||||||
|
# UV is interleaved: U0V0U1V1... we need to deinterleave
|
||||||
|
uv_plane = uv_plane.reshape(height // 2, width // 2, 2) # (H/2, W/2, 2)
|
||||||
|
u_plane = uv_plane[:, :, 0] # (H/2, W/2)
|
||||||
|
v_plane = uv_plane[:, :, 1] # (H/2, W/2)
|
||||||
|
|
||||||
|
# Upsample U and V to full resolution using bilinear interpolation
|
||||||
|
u_upsampled = torch.nn.functional.interpolate(
|
||||||
|
u_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
|
||||||
|
size=(height, width),
|
||||||
|
mode='bilinear',
|
||||||
|
align_corners=False
|
||||||
|
).squeeze(0).squeeze(0) # (H, W)
|
||||||
|
|
||||||
|
v_upsampled = torch.nn.functional.interpolate(
|
||||||
|
v_plane.unsqueeze(0).unsqueeze(0), # (1, 1, H/2, W/2)
|
||||||
|
size=(height, width),
|
||||||
|
mode='bilinear',
|
||||||
|
align_corners=False
|
||||||
|
).squeeze(0).squeeze(0) # (H, W)
|
||||||
|
|
||||||
|
# YUV to RGB conversion using BT.601 standard
|
||||||
|
# R = Y + 1.402 * (V - 128)
|
||||||
|
# G = Y - 0.344136 * (U - 128) - 0.714136 * (V - 128)
|
||||||
|
# B = Y + 1.772 * (U - 128)
|
||||||
|
|
||||||
|
y = y_plane
|
||||||
|
u = u_upsampled - 128.0
|
||||||
|
v = v_upsampled - 128.0
|
||||||
|
|
||||||
|
r = y + 1.402 * v
|
||||||
|
g = y - 0.344136 * u - 0.714136 * v
|
||||||
|
b = y + 1.772 * u
|
||||||
|
|
||||||
|
# Clamp to [0, 255] and convert to uint8
|
||||||
|
r = torch.clamp(r, 0, 255).to(torch.uint8)
|
||||||
|
g = torch.clamp(g, 0, 255).to(torch.uint8)
|
||||||
|
b = torch.clamp(b, 0, 255).to(torch.uint8)
|
||||||
|
|
||||||
|
# Stack to (3, H, W)
|
||||||
|
rgb = torch.stack([r, g, b], dim=0)
|
||||||
|
|
||||||
|
return rgb
|
||||||
|
|
||||||
|
|
||||||
|
class ConnectionStatus(Enum):
|
||||||
|
DISCONNECTED = "disconnected"
|
||||||
|
CONNECTING = "connecting"
|
||||||
|
CONNECTED = "connected"
|
||||||
|
ERROR = "error"
|
||||||
|
RECONNECTING = "reconnecting"
|
||||||
|
|
||||||
|
|
||||||
|
class StreamDecoderFactory:
|
||||||
|
"""
|
||||||
|
Factory for creating StreamDecoder instances with shared CUDA context.
|
||||||
|
This minimizes VRAM overhead by sharing the CUDA context across all decoders.
|
||||||
|
"""
|
||||||
|
|
||||||
|
_instance = None
|
||||||
|
_lock = threading.Lock()
|
||||||
|
|
||||||
|
def __new__(cls, gpu_id: int = 0):
|
||||||
|
if cls._instance is None:
|
||||||
|
with cls._lock:
|
||||||
|
if cls._instance is None:
|
||||||
|
cls._instance = super(StreamDecoderFactory, cls).__new__(cls)
|
||||||
|
cls._instance._initialized = False
|
||||||
|
return cls._instance
|
||||||
|
|
||||||
|
def __init__(self, gpu_id: int = 0):
|
||||||
|
if self._initialized:
|
||||||
|
return
|
||||||
|
|
||||||
|
self.gpu_id = gpu_id
|
||||||
|
|
||||||
|
# Initialize CUDA and get device
|
||||||
|
err, = cuda_driver.cuInit(0)
|
||||||
|
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||||
|
raise RuntimeError(f"Failed to initialize CUDA: {err}")
|
||||||
|
|
||||||
|
# Get CUDA device
|
||||||
|
err, self.cuda_device = cuda_driver.cuDeviceGet(gpu_id)
|
||||||
|
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||||
|
raise RuntimeError(f"Failed to get CUDA device {gpu_id}: {err}")
|
||||||
|
|
||||||
|
# Retain primary context (shared across all decoders)
|
||||||
|
err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
|
||||||
|
if err != cuda_driver.CUresult.CUDA_SUCCESS:
|
||||||
|
raise RuntimeError(f"Failed to retain CUDA primary context: {err}")
|
||||||
|
|
||||||
|
self._initialized = True
|
||||||
|
print(f"StreamDecoderFactory initialized with shared CUDA context on GPU {gpu_id}")
|
||||||
|
|
||||||
|
def create_decoder(self, rtsp_url: str, buffer_size: int = 30,
|
||||||
|
codec: str = "h264") -> 'StreamDecoder':
|
||||||
|
"""
|
||||||
|
Create a new StreamDecoder instance with shared CUDA context.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rtsp_url: RTSP stream URL
|
||||||
|
buffer_size: Number of frames to buffer in VRAM
|
||||||
|
codec: Video codec (h264, hevc, etc.)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
StreamDecoder instance
|
||||||
|
"""
|
||||||
|
return StreamDecoder(
|
||||||
|
rtsp_url=rtsp_url,
|
||||||
|
cuda_context=self.cuda_context,
|
||||||
|
gpu_id=self.gpu_id,
|
||||||
|
buffer_size=buffer_size,
|
||||||
|
codec=codec
|
||||||
|
)
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
"""Cleanup shared CUDA context on factory destruction"""
|
||||||
|
if hasattr(self, 'cuda_device') and hasattr(self, 'gpu_id'):
|
||||||
|
cuda_driver.cuDevicePrimaryCtxRelease(self.cuda_device)
|
||||||
|
|
||||||
|
|
||||||
|
class StreamDecoder:
|
||||||
|
"""
|
||||||
|
Decodes RTSP stream using NVDEC and maintains a ring buffer of frames in GPU VRAM.
|
||||||
|
Thread-safe for concurrent read/write operations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, rtsp_url: str, cuda_context, gpu_id: int,
|
||||||
|
buffer_size: int = 30, codec: str = "h264"):
|
||||||
|
"""
|
||||||
|
Initialize StreamDecoder.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rtsp_url: RTSP stream URL
|
||||||
|
cuda_context: Shared CUDA context handle
|
||||||
|
gpu_id: GPU device ID
|
||||||
|
buffer_size: Number of frames to keep in ring buffer
|
||||||
|
codec: Video codec type
|
||||||
|
"""
|
||||||
|
self.rtsp_url = rtsp_url
|
||||||
|
self.cuda_context = cuda_context
|
||||||
|
self.gpu_id = gpu_id
|
||||||
|
self.buffer_size = buffer_size
|
||||||
|
self.codec = codec
|
||||||
|
|
||||||
|
# Connection status
|
||||||
|
self.status = ConnectionStatus.DISCONNECTED
|
||||||
|
self._status_lock = threading.Lock()
|
||||||
|
|
||||||
|
# Frame buffer (ring buffer) - stores CUDA device pointers
|
||||||
|
self.frame_buffer = deque(maxlen=buffer_size)
|
||||||
|
self._buffer_lock = threading.RLock()
|
||||||
|
|
||||||
|
# Decoder and container instances
|
||||||
|
self.decoder = None
|
||||||
|
self.container = None
|
||||||
|
|
||||||
|
# Decode thread
|
||||||
|
self._decode_thread: Optional[threading.Thread] = None
|
||||||
|
self._stop_flag = threading.Event()
|
||||||
|
|
||||||
|
# Frame metadata
|
||||||
|
self.frame_width: Optional[int] = None
|
||||||
|
self.frame_height: Optional[int] = None
|
||||||
|
self.frame_count: int = 0
|
||||||
|
|
||||||
|
def start(self):
|
||||||
|
"""Start the RTSP stream decoding in background thread"""
|
||||||
|
if self._decode_thread is not None and self._decode_thread.is_alive():
|
||||||
|
print(f"Decoder already running for {self.rtsp_url}")
|
||||||
|
return
|
||||||
|
|
||||||
|
self._stop_flag.clear()
|
||||||
|
self._decode_thread = threading.Thread(target=self._decode_loop, daemon=True)
|
||||||
|
self._decode_thread.start()
|
||||||
|
print(f"Started decoder thread for {self.rtsp_url}")
|
||||||
|
|
||||||
|
def stop(self):
|
||||||
|
"""Stop the decoding thread and cleanup resources"""
|
||||||
|
self._stop_flag.set()
|
||||||
|
if self._decode_thread is not None:
|
||||||
|
self._decode_thread.join(timeout=5.0)
|
||||||
|
self._cleanup()
|
||||||
|
print(f"Stopped decoder for {self.rtsp_url}")
|
||||||
|
|
||||||
|
def _set_status(self, status: ConnectionStatus):
|
||||||
|
"""Thread-safe status update"""
|
||||||
|
with self._status_lock:
|
||||||
|
self.status = status
|
||||||
|
|
||||||
|
def get_status(self) -> ConnectionStatus:
|
||||||
|
"""Get current connection status"""
|
||||||
|
with self._status_lock:
|
||||||
|
return self.status
|
||||||
|
|
||||||
|
def _init_rtsp_connection(self) -> bool:
|
||||||
|
"""Initialize RTSP connection using PyAV + PyNvVideoCodec"""
|
||||||
|
try:
|
||||||
|
self._set_status(ConnectionStatus.CONNECTING)
|
||||||
|
|
||||||
|
# Open RTSP stream with PyAV
|
||||||
|
options = {
|
||||||
|
'rtsp_transport': 'tcp',
|
||||||
|
'max_delay': '500000', # 500ms
|
||||||
|
'rtsp_flags': 'prefer_tcp',
|
||||||
|
'timeout': '5000000', # 5 seconds
|
||||||
|
}
|
||||||
|
|
||||||
|
self.container = av.open(self.rtsp_url, options=options)
|
||||||
|
|
||||||
|
# Get video stream
|
||||||
|
video_stream = self.container.streams.video[0]
|
||||||
|
self.frame_width = video_stream.width
|
||||||
|
self.frame_height = video_stream.height
|
||||||
|
|
||||||
|
print(f"RTSP connected: {self.frame_width}x{self.frame_height}")
|
||||||
|
|
||||||
|
# Map codec name to PyNvVideoCodec codec enum
|
||||||
|
codec_map = {
|
||||||
|
'h264': nvc.cudaVideoCodec.H264,
|
||||||
|
'hevc': nvc.cudaVideoCodec.HEVC,
|
||||||
|
'h265': nvc.cudaVideoCodec.HEVC,
|
||||||
|
}
|
||||||
|
|
||||||
|
codec_id = codec_map.get(self.codec.lower(), nvc.cudaVideoCodec.H264)
|
||||||
|
|
||||||
|
# Initialize NVDEC decoder with shared CUDA context
|
||||||
|
self.decoder = nvc.CreateDecoder(
|
||||||
|
gpuid=self.gpu_id,
|
||||||
|
codec=codec_id,
|
||||||
|
cudacontext=self.cuda_context,
|
||||||
|
usedevicememory=True
|
||||||
|
)
|
||||||
|
|
||||||
|
self._set_status(ConnectionStatus.CONNECTED)
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to connect to RTSP stream {self.rtsp_url}: {e}")
|
||||||
|
self._set_status(ConnectionStatus.ERROR)
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _decode_loop(self):
|
||||||
|
"""Main decode loop running in background thread"""
|
||||||
|
retry_count = 0
|
||||||
|
max_retries = 5
|
||||||
|
|
||||||
|
while not self._stop_flag.is_set():
|
||||||
|
# Initialize connection
|
||||||
|
if not self._init_rtsp_connection():
|
||||||
|
retry_count += 1
|
||||||
|
if retry_count >= max_retries:
|
||||||
|
print(f"Max retries reached for {self.rtsp_url}")
|
||||||
|
self._set_status(ConnectionStatus.ERROR)
|
||||||
|
break
|
||||||
|
|
||||||
|
self._set_status(ConnectionStatus.RECONNECTING)
|
||||||
|
self._stop_flag.wait(timeout=2.0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
retry_count = 0 # Reset on successful connection
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Decode loop - iterate through packets from PyAV
|
||||||
|
for packet in self.container.demux(video=0):
|
||||||
|
if self._stop_flag.is_set():
|
||||||
|
break
|
||||||
|
|
||||||
|
if packet.dts is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Convert packet to numpy array
|
||||||
|
packet_data = np.frombuffer(bytes(packet), dtype=np.uint8)
|
||||||
|
|
||||||
|
# Create PacketData and pass numpy array pointer
|
||||||
|
pkt = nvc.PacketData()
|
||||||
|
pkt.bsl_data = packet_data.ctypes.data
|
||||||
|
pkt.bsl = len(packet_data)
|
||||||
|
|
||||||
|
# Decode using NVDEC
|
||||||
|
decoded_frames = self.decoder.Decode(pkt)
|
||||||
|
|
||||||
|
if not decoded_frames:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Add frames to ring buffer (thread-safe)
|
||||||
|
with self._buffer_lock:
|
||||||
|
for frame in decoded_frames:
|
||||||
|
self.frame_buffer.append(frame)
|
||||||
|
self.frame_count += 1
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error in decode loop for {self.rtsp_url}: {e}")
|
||||||
|
self._set_status(ConnectionStatus.RECONNECTING)
|
||||||
|
self._cleanup()
|
||||||
|
self._stop_flag.wait(timeout=2.0)
|
||||||
|
|
||||||
|
def _cleanup(self):
|
||||||
|
"""Cleanup resources"""
|
||||||
|
if self.container:
|
||||||
|
try:
|
||||||
|
self.container.close()
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
self.container = None
|
||||||
|
|
||||||
|
self.decoder = None
|
||||||
|
|
||||||
|
with self._buffer_lock:
|
||||||
|
self.frame_buffer.clear()
|
||||||
|
|
||||||
|
def get_frame(self, index: int = -1, rgb: bool = True) -> Optional[torch.Tensor]:
|
||||||
|
"""
|
||||||
|
Get a frame from the buffer as a CUDA tensor (in VRAM).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
|
||||||
|
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
torch.Tensor in CUDA memory (device tensor) or None if buffer empty
|
||||||
|
- If rgb=True: Shape (3, H, W) in RGB format, dtype uint8
|
||||||
|
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||||
|
"""
|
||||||
|
with self._buffer_lock:
|
||||||
|
if len(self.frame_buffer) == 0:
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
decoded_frame = self.frame_buffer[index]
|
||||||
|
|
||||||
|
# Convert DecodedFrame to PyTorch tensor using DLPack (zero-copy)
|
||||||
|
# This keeps the data in GPU memory
|
||||||
|
nv12_tensor = torch.from_dlpack(decoded_frame)
|
||||||
|
|
||||||
|
if not rgb:
|
||||||
|
# Return raw NV12 format
|
||||||
|
return nv12_tensor
|
||||||
|
|
||||||
|
# Convert NV12 to RGB on GPU
|
||||||
|
if self.frame_height is None or self.frame_width is None:
|
||||||
|
print("Frame dimensions not available")
|
||||||
|
return None
|
||||||
|
|
||||||
|
rgb_tensor = nv12_to_rgb_gpu(nv12_tensor, self.frame_height, self.frame_width)
|
||||||
|
return rgb_tensor
|
||||||
|
|
||||||
|
except (IndexError, Exception) as e:
|
||||||
|
print(f"Error getting frame: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_latest_frame(self, rgb: bool = True) -> Optional[torch.Tensor]:
|
||||||
|
"""
|
||||||
|
Get the most recent decoded frame as CUDA tensor.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rgb: If True, convert to RGB. If False, return raw NV12.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
torch.Tensor on GPU in RGB (3, H, W) or NV12 (H*3/2, W) format
|
||||||
|
"""
|
||||||
|
return self.get_frame(-1, rgb=rgb)
|
||||||
|
|
||||||
|
def get_frame_cpu(self, index: int = -1, rgb: bool = True) -> Optional[np.ndarray]:
|
||||||
|
"""
|
||||||
|
Get a frame from the buffer and copy it to CPU memory as numpy array.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
|
||||||
|
rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
numpy.ndarray in CPU memory or None if buffer empty
|
||||||
|
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8 (HWC format for easy display)
|
||||||
|
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||||
|
"""
|
||||||
|
# Get frame on GPU
|
||||||
|
gpu_frame = self.get_frame(index=index, rgb=rgb)
|
||||||
|
|
||||||
|
if gpu_frame is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Transfer from GPU to CPU
|
||||||
|
cpu_tensor = gpu_frame.cpu()
|
||||||
|
|
||||||
|
# Convert to numpy array
|
||||||
|
if rgb:
|
||||||
|
# Convert from (3, H, W) to (H, W, 3) for standard image format
|
||||||
|
cpu_array = cpu_tensor.permute(1, 2, 0).numpy()
|
||||||
|
else:
|
||||||
|
# Keep NV12 format as-is
|
||||||
|
cpu_array = cpu_tensor.numpy()
|
||||||
|
|
||||||
|
return cpu_array
|
||||||
|
|
||||||
|
def get_latest_frame_cpu(self, rgb: bool = True) -> Optional[np.ndarray]:
|
||||||
|
"""
|
||||||
|
Get the most recent decoded frame as CPU numpy array.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rgb: If True, convert to RGB. If False, return raw NV12.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
numpy.ndarray in CPU memory
|
||||||
|
- If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8
|
||||||
|
- If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
|
||||||
|
"""
|
||||||
|
return self.get_frame_cpu(-1, rgb=rgb)
|
||||||
|
|
||||||
|
def get_buffer_size(self) -> int:
|
||||||
|
"""Get current number of frames in buffer"""
|
||||||
|
with self._buffer_lock:
|
||||||
|
return len(self.frame_buffer)
|
||||||
|
|
||||||
|
def is_connected(self) -> bool:
|
||||||
|
"""Check if stream is actively connected"""
|
||||||
|
return self.get_status() == ConnectionStatus.CONNECTED
|
||||||
|
|
||||||
|
def get_frame_as_jpeg(self, index: int = -1, quality: int = 95) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Get a frame from the buffer and encode to JPEG.
|
||||||
|
|
||||||
|
This method:
|
||||||
|
1. Gets RGB frame from buffer (stays on GPU)
|
||||||
|
2. Encodes to JPEG using nvJPEG (GPU operation via shared encoder)
|
||||||
|
3. Transfers JPEG bytes to CPU
|
||||||
|
4. Returns bytes for saving to disk
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index: Frame index in buffer (-1 for latest)
|
||||||
|
quality: JPEG quality (0-100, default 95)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JPEG encoded bytes or None if frame unavailable
|
||||||
|
"""
|
||||||
|
# Get RGB frame (on GPU)
|
||||||
|
rgb_frame = self.get_frame(index=index, rgb=True)
|
||||||
|
|
||||||
|
# Use the shared JPEG encoder from jpeg_encoder module
|
||||||
|
return encode_frame_to_jpeg(rgb_frame, quality=quality)
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return (f"StreamDecoder(url={self.rtsp_url}, status={self.status.value}, "
|
||||||
|
f"buffer={self.get_buffer_size()}/{self.buffer_size}, "
|
||||||
|
f"frames_decoded={self.frame_count})")
|
||||||
174
test_jpeg_encode.py
Executable file
174
test_jpeg_encode.py
Executable file
|
|
@ -0,0 +1,174 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test script for JPEG encoding with nvImageCodec
|
||||||
|
Tests GPU-accelerated JPEG encoding from RTSP stream frames
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from services import StreamDecoderFactory
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Test JPEG encoding from RTSP stream')
|
||||||
|
parser.add_argument(
|
||||||
|
'--rtsp-url',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='RTSP stream URL (defaults to CAMERA_URL_1 from .env)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--output-dir',
|
||||||
|
type=str,
|
||||||
|
default='./snapshots',
|
||||||
|
help='Output directory for JPEG files'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--num-frames',
|
||||||
|
type=int,
|
||||||
|
default=10,
|
||||||
|
help='Number of frames to capture'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--interval',
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help='Interval between captures in seconds'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--quality',
|
||||||
|
type=int,
|
||||||
|
default=95,
|
||||||
|
help='JPEG quality (0-100)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--gpu-id',
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help='GPU device ID'
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Get RTSP URL from command line or environment
|
||||||
|
rtsp_url = args.rtsp_url
|
||||||
|
if not rtsp_url:
|
||||||
|
rtsp_url = os.getenv('CAMERA_URL_1')
|
||||||
|
if not rtsp_url:
|
||||||
|
print("Error: No RTSP URL provided")
|
||||||
|
print("Please either:")
|
||||||
|
print(" 1. Use --rtsp-url argument, or")
|
||||||
|
print(" 2. Add CAMERA_URL_1 to your .env file")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Create output directory
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("RTSP Stream JPEG Encoding Test")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"RTSP URL: {rtsp_url}")
|
||||||
|
print(f"Output Directory: {output_dir}")
|
||||||
|
print(f"Number of Frames: {args.num_frames}")
|
||||||
|
print(f"Capture Interval: {args.interval}s")
|
||||||
|
print(f"JPEG Quality: {args.quality}")
|
||||||
|
print(f"GPU ID: {args.gpu_id}")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initialize factory and decoder
|
||||||
|
print("[1/3] Initializing StreamDecoderFactory...")
|
||||||
|
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||||
|
print("✓ Factory initialized\n")
|
||||||
|
|
||||||
|
print("[2/3] Creating and starting decoder...")
|
||||||
|
decoder = factory.create_decoder(
|
||||||
|
rtsp_url=rtsp_url,
|
||||||
|
buffer_size=30
|
||||||
|
)
|
||||||
|
decoder.start()
|
||||||
|
print("✓ Decoder started\n")
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
print("[3/3] Waiting for stream to connect...")
|
||||||
|
max_wait = 10
|
||||||
|
for i in range(max_wait):
|
||||||
|
if decoder.is_connected():
|
||||||
|
print("✓ Stream connected\n")
|
||||||
|
break
|
||||||
|
time.sleep(1)
|
||||||
|
print(f" Waiting... {i+1}/{max_wait}s")
|
||||||
|
else:
|
||||||
|
print("✗ Failed to connect to stream")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Capture frames
|
||||||
|
print(f"Capturing {args.num_frames} frames...")
|
||||||
|
print("-" * 80)
|
||||||
|
|
||||||
|
captured = 0
|
||||||
|
for i in range(args.num_frames):
|
||||||
|
# Get frame as JPEG
|
||||||
|
start_time = time.time()
|
||||||
|
jpeg_bytes = decoder.get_frame_as_jpeg(quality=args.quality)
|
||||||
|
encode_time = (time.time() - start_time) * 1000 # ms
|
||||||
|
|
||||||
|
if jpeg_bytes:
|
||||||
|
# Save to file
|
||||||
|
filename = output_dir / f"frame_{i:04d}.jpg"
|
||||||
|
with open(filename, 'wb') as f:
|
||||||
|
f.write(jpeg_bytes)
|
||||||
|
|
||||||
|
size_kb = len(jpeg_bytes) / 1024
|
||||||
|
print(f"[{i+1}/{args.num_frames}] Saved {filename.name} "
|
||||||
|
f"({size_kb:.1f} KB, encoded in {encode_time:.2f}ms)")
|
||||||
|
captured += 1
|
||||||
|
else:
|
||||||
|
print(f"[{i+1}/{args.num_frames}] Failed to get frame")
|
||||||
|
|
||||||
|
# Wait before next capture (except for last frame)
|
||||||
|
if i < args.num_frames - 1:
|
||||||
|
time.sleep(args.interval)
|
||||||
|
|
||||||
|
print("-" * 80)
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Capture Complete")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"Successfully captured: {captured}/{args.num_frames} frames")
|
||||||
|
print(f"Output directory: {output_dir.absolute()}")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n✗ Interrupted by user")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n\n✗ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
if 'decoder' in locals():
|
||||||
|
print("\nCleaning up...")
|
||||||
|
decoder.stop()
|
||||||
|
print("✓ Decoder stopped")
|
||||||
|
|
||||||
|
print("\n✓ Test completed successfully")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
310
test_model_inference.py
Normal file
310
test_model_inference.py
Normal file
|
|
@ -0,0 +1,310 @@
|
||||||
|
"""
|
||||||
|
Test script for TensorRT Model Repository with multi-camera inference.
|
||||||
|
|
||||||
|
This demonstrates:
|
||||||
|
1. Loading the same model for multiple cameras (deduplication)
|
||||||
|
2. Context pool load balancing
|
||||||
|
3. GPU-to-GPU inference from RTSP streams
|
||||||
|
4. Memory efficiency with shared engines
|
||||||
|
"""
|
||||||
|
|
||||||
|
import time
|
||||||
|
import torch
|
||||||
|
from services.model_repository import TensorRTModelRepository
|
||||||
|
from services.stream_decoder import StreamDecoderFactory
|
||||||
|
|
||||||
|
|
||||||
|
def test_multi_camera_inference():
|
||||||
|
"""
|
||||||
|
Simulate multi-camera inference scenario.
|
||||||
|
|
||||||
|
Example: 100 cameras, all using the same YOLOv8 model
|
||||||
|
- Without pooling: 100 engines + 100 contexts in VRAM
|
||||||
|
- With pooling: 1 engine + 4 contexts in VRAM (huge savings!)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Initialize model repository with context pooling
|
||||||
|
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||||
|
|
||||||
|
# Camera configurations (simulated)
|
||||||
|
camera_configs = [
|
||||||
|
{"id": "camera_1", "rtsp_url": "rtsp://camera1.local/stream"},
|
||||||
|
{"id": "camera_2", "rtsp_url": "rtsp://camera2.local/stream"},
|
||||||
|
{"id": "camera_3", "rtsp_url": "rtsp://camera3.local/stream"},
|
||||||
|
# ... imagine 100 cameras here
|
||||||
|
]
|
||||||
|
|
||||||
|
# Load the same model for all cameras
|
||||||
|
model_file = "models/yolov8n.trt" # Same file for all cameras
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("LOADING MODELS FOR MULTIPLE CAMERAS")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
for config in camera_configs:
|
||||||
|
try:
|
||||||
|
# Each camera gets its own model_id, but shares the same engine!
|
||||||
|
metadata = repo.load_model(
|
||||||
|
model_id=config["id"],
|
||||||
|
file_path=model_file,
|
||||||
|
num_contexts=4 # 4 contexts shared across all cameras
|
||||||
|
)
|
||||||
|
print(f"\n✓ Loaded model for {config['id']}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n✗ Failed to load model for {config['id']}: {e}")
|
||||||
|
|
||||||
|
# Show repository stats
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("REPOSITORY STATISTICS")
|
||||||
|
print("=" * 80)
|
||||||
|
stats = repo.get_stats()
|
||||||
|
print(f"Total model IDs: {stats['total_model_ids']}")
|
||||||
|
print(f"Unique engines in VRAM: {stats['unique_engines']}")
|
||||||
|
print(f"Total contexts: {stats['total_contexts']}")
|
||||||
|
print(f"Memory efficiency: {stats['memory_efficiency']}")
|
||||||
|
|
||||||
|
# Get detailed info for one camera
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("DETAILED MODEL INFO (camera_1)")
|
||||||
|
print("=" * 80)
|
||||||
|
info = repo.get_model_info("camera_1")
|
||||||
|
if info:
|
||||||
|
print(f"Model ID: {info['model_id']}")
|
||||||
|
print(f"File: {info['file_path']}")
|
||||||
|
print(f"File hash: {info['file_hash']}")
|
||||||
|
print(f"Engine references: {info['engine_references']}")
|
||||||
|
print(f"Context pool size: {info['context_pool_size']}")
|
||||||
|
print(f"Shared with: {info['shared_with_model_ids']}")
|
||||||
|
print(f"\nInputs:")
|
||||||
|
for name, spec in info['inputs'].items():
|
||||||
|
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||||
|
print(f"\nOutputs:")
|
||||||
|
for name, spec in info['outputs'].items():
|
||||||
|
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||||
|
|
||||||
|
# Simulate inference from multiple cameras
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("RUNNING INFERENCE (GPU-to-GPU)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Create dummy input tensors (simulating frames from cameras)
|
||||||
|
# In real scenario, these come from StreamDecoder.get_frame()
|
||||||
|
batch_size = 1
|
||||||
|
channels = 3
|
||||||
|
height = 640
|
||||||
|
width = 640
|
||||||
|
|
||||||
|
for config in camera_configs:
|
||||||
|
try:
|
||||||
|
# Simulate getting frame from camera (already on GPU)
|
||||||
|
input_tensor = torch.rand(
|
||||||
|
batch_size, channels, height, width,
|
||||||
|
dtype=torch.float32,
|
||||||
|
device='cuda:0'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run inference (stays in GPU)
|
||||||
|
start = time.time()
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id=config["id"],
|
||||||
|
inputs={"images": input_tensor}, # Adjust input name based on your model
|
||||||
|
synchronize=True,
|
||||||
|
timeout=5.0
|
||||||
|
)
|
||||||
|
elapsed = (time.time() - start) * 1000 # Convert to ms
|
||||||
|
|
||||||
|
print(f"\n{config['id']}: Inference completed in {elapsed:.2f}ms")
|
||||||
|
for name, tensor in outputs.items():
|
||||||
|
print(f" Output '{name}': {tensor.shape} on {tensor.device}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n{config['id']}: Inference failed: {e}")
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("CLEANUP")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
for config in camera_configs:
|
||||||
|
repo.unload_model(config["id"])
|
||||||
|
|
||||||
|
print("\nAll models unloaded.")
|
||||||
|
|
||||||
|
|
||||||
|
def test_rtsp_stream_with_inference():
|
||||||
|
"""
|
||||||
|
Real-world example: Decode RTSP stream and run inference.
|
||||||
|
Everything stays in GPU memory (zero CPU transfers).
|
||||||
|
"""
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("RTSP STREAM + TENSORRT INFERENCE (GPU-to-GPU)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Initialize components
|
||||||
|
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
model_repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||||
|
|
||||||
|
# Setup camera stream
|
||||||
|
rtsp_url = "rtsp://your-camera-ip/stream"
|
||||||
|
decoder = decoder_factory.create_decoder(rtsp_url, buffer_size=30)
|
||||||
|
decoder.start()
|
||||||
|
|
||||||
|
# Load inference model
|
||||||
|
try:
|
||||||
|
model_repo.load_model(
|
||||||
|
model_id="camera_main",
|
||||||
|
file_path="models/yolov8n.trt"
|
||||||
|
)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print("\n⚠ Model file not found. Please export your model to TensorRT:")
|
||||||
|
print(" Example: yolo export model=yolov8n.pt format=engine device=0")
|
||||||
|
return
|
||||||
|
|
||||||
|
print("\nWaiting for stream to buffer frames...")
|
||||||
|
time.sleep(3)
|
||||||
|
|
||||||
|
# Process frames
|
||||||
|
for i in range(10):
|
||||||
|
# Get frame from decoder (already on GPU)
|
||||||
|
frame_gpu = decoder.get_latest_frame(rgb=True) # Returns torch.Tensor on CUDA
|
||||||
|
|
||||||
|
if frame_gpu is None:
|
||||||
|
print(f"Frame {i}: No frame available")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Preprocess if needed (stays on GPU)
|
||||||
|
# For YOLOv8: normalize, resize, etc.
|
||||||
|
# Example preprocessing (adjust for your model):
|
||||||
|
frame_gpu = frame_gpu.float() / 255.0 # Normalize to [0, 1]
|
||||||
|
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dimension: (1, 3, H, W)
|
||||||
|
|
||||||
|
# Run inference (GPU-to-GPU, zero copy)
|
||||||
|
try:
|
||||||
|
outputs = model_repo.infer(
|
||||||
|
model_id="camera_main",
|
||||||
|
inputs={"images": frame_gpu},
|
||||||
|
synchronize=True
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\nFrame {i}: Inference successful")
|
||||||
|
for name, tensor in outputs.items():
|
||||||
|
print(f" {name}: {tensor.shape} on {tensor.device}")
|
||||||
|
|
||||||
|
# Post-process results (can stay on GPU or move to CPU as needed)
|
||||||
|
# Example: NMS, bounding box extraction, etc.
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\nFrame {i}: Inference failed: {e}")
|
||||||
|
|
||||||
|
time.sleep(0.1) # Simulate processing interval
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
decoder.stop()
|
||||||
|
model_repo.unload_model("camera_main")
|
||||||
|
print("\n✓ Test completed successfully")
|
||||||
|
|
||||||
|
|
||||||
|
def test_concurrent_inference():
|
||||||
|
"""
|
||||||
|
Test concurrent inference from multiple threads.
|
||||||
|
Demonstrates context pool load balancing.
|
||||||
|
"""
|
||||||
|
import threading
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("CONCURRENT INFERENCE TEST (Context Pool Load Balancing)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
|
||||||
|
|
||||||
|
# Load model
|
||||||
|
try:
|
||||||
|
repo.load_model("shared_model", "models/yolov8n.trt", num_contexts=4)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to load model: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
|
def worker(worker_id: int, num_inferences: int):
|
||||||
|
"""Worker thread performing inference"""
|
||||||
|
for i in range(num_inferences):
|
||||||
|
try:
|
||||||
|
# Create dummy input
|
||||||
|
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0', dtype=torch.float32)
|
||||||
|
|
||||||
|
# Acquire context from pool, run inference, release context
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id="shared_model",
|
||||||
|
inputs={"images": input_tensor},
|
||||||
|
timeout=10.0
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Worker {worker_id}, Inference {i}: SUCCESS")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Worker {worker_id}, Inference {i}: FAILED - {e}")
|
||||||
|
|
||||||
|
time.sleep(0.01) # Small delay
|
||||||
|
|
||||||
|
# Launch multiple worker threads (more workers than contexts!)
|
||||||
|
threads = []
|
||||||
|
num_workers = 10 # 10 workers sharing 4 contexts
|
||||||
|
inferences_per_worker = 5
|
||||||
|
|
||||||
|
print(f"\nLaunching {num_workers} workers (only 4 contexts available)")
|
||||||
|
print("Contexts will be borrowed/returned automatically\n")
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
for worker_id in range(num_workers):
|
||||||
|
t = threading.Thread(target=worker, args=(worker_id, inferences_per_worker))
|
||||||
|
threads.append(t)
|
||||||
|
t.start()
|
||||||
|
|
||||||
|
# Wait for all workers
|
||||||
|
for t in threads:
|
||||||
|
t.join()
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
total_inferences = num_workers * inferences_per_worker
|
||||||
|
|
||||||
|
print(f"\n✓ Completed {total_inferences} inferences in {elapsed:.2f}s")
|
||||||
|
print(f" Throughput: {total_inferences / elapsed:.2f} inferences/sec")
|
||||||
|
print(f" With only 4 contexts for {num_workers} workers!")
|
||||||
|
|
||||||
|
repo.unload_model("shared_model")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("TENSORRT MODEL REPOSITORY - TEST SUITE")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Test 1: Multi-camera model loading
|
||||||
|
print("\n\nTEST 1: Multi-Camera Model Loading with Deduplication")
|
||||||
|
print("-" * 80)
|
||||||
|
try:
|
||||||
|
test_multi_camera_inference()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Test 1 failed: {e}")
|
||||||
|
|
||||||
|
# Test 2: RTSP stream + inference (commented out by default)
|
||||||
|
# Uncomment if you have a real RTSP stream
|
||||||
|
# print("\n\nTEST 2: RTSP Stream + Inference")
|
||||||
|
# print("-" * 80)
|
||||||
|
# try:
|
||||||
|
# test_rtsp_stream_with_inference()
|
||||||
|
# except Exception as e:
|
||||||
|
# print(f"Test 2 failed: {e}")
|
||||||
|
|
||||||
|
# Test 3: Concurrent inference
|
||||||
|
print("\n\nTEST 3: Concurrent Inference with Context Pooling")
|
||||||
|
print("-" * 80)
|
||||||
|
try:
|
||||||
|
test_concurrent_inference()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Test 3 failed: {e}")
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("ALL TESTS COMPLETED")
|
||||||
|
print("=" * 80)
|
||||||
255
test_multi_stream.py
Executable file
255
test_multi_stream.py
Executable file
|
|
@ -0,0 +1,255 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Multi-stream test script to verify CUDA context sharing efficiency.
|
||||||
|
Tests multiple RTSP streams simultaneously and monitors VRAM usage.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
import subprocess
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from services import StreamDecoderFactory, ConnectionStatus
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
def get_gpu_memory_usage(gpu_id: int = 0) -> int:
|
||||||
|
"""Get current GPU memory usage in MB using nvidia-smi"""
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits', f'--id={gpu_id}'],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=True
|
||||||
|
)
|
||||||
|
return int(result.stdout.strip())
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Warning: Could not get GPU memory usage: {e}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Test multi-stream decoding with context sharing')
|
||||||
|
parser.add_argument(
|
||||||
|
'--gpu-id',
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help='GPU device ID'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--duration',
|
||||||
|
type=int,
|
||||||
|
default=20,
|
||||||
|
help='Test duration in seconds'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--capture-snapshots',
|
||||||
|
action='store_true',
|
||||||
|
help='Capture JPEG snapshots during test'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--output-dir',
|
||||||
|
type=str,
|
||||||
|
default='./multi_stream_snapshots',
|
||||||
|
help='Output directory for snapshots'
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Load camera URLs from environment
|
||||||
|
camera_urls = []
|
||||||
|
i = 1
|
||||||
|
while True:
|
||||||
|
url = os.getenv(f'CAMERA_URL_{i}')
|
||||||
|
if url:
|
||||||
|
camera_urls.append(url)
|
||||||
|
i += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not camera_urls:
|
||||||
|
print("Error: No camera URLs found in .env file")
|
||||||
|
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Create output directory if capturing snapshots
|
||||||
|
if args.capture_snapshots:
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("Multi-Stream RTSP Decoder Test - Context Sharing Verification")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"Number of Streams: {len(camera_urls)}")
|
||||||
|
print(f"GPU ID: {args.gpu_id}")
|
||||||
|
print(f"Test Duration: {args.duration} seconds")
|
||||||
|
print(f"Capture Snapshots: {args.capture_snapshots}")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get baseline GPU memory
|
||||||
|
print("[Baseline] Measuring initial GPU memory usage...")
|
||||||
|
baseline_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
print(f"✓ Baseline VRAM: {baseline_memory} MB\n")
|
||||||
|
|
||||||
|
# Initialize factory (shared CUDA context)
|
||||||
|
print("[1/4] Initializing StreamDecoderFactory with shared CUDA context...")
|
||||||
|
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||||
|
|
||||||
|
factory_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
factory_overhead = factory_memory - baseline_memory
|
||||||
|
print(f"✓ Factory initialized")
|
||||||
|
print(f" VRAM after factory: {factory_memory} MB (+{factory_overhead} MB)\n")
|
||||||
|
|
||||||
|
# Create all decoders
|
||||||
|
print(f"[2/4] Creating {len(camera_urls)} StreamDecoder instances...")
|
||||||
|
decoders = []
|
||||||
|
for i, url in enumerate(camera_urls):
|
||||||
|
decoder = factory.create_decoder(
|
||||||
|
rtsp_url=url,
|
||||||
|
buffer_size=30,
|
||||||
|
codec='h264'
|
||||||
|
)
|
||||||
|
decoders.append(decoder)
|
||||||
|
print(f" ✓ Decoder {i+1} created for camera {url.split('@')[1].split('/')[0]}")
|
||||||
|
|
||||||
|
decoders_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
decoders_overhead = decoders_memory - factory_memory
|
||||||
|
print(f"\n VRAM after creating {len(decoders)} decoders: {decoders_memory} MB (+{decoders_overhead} MB)")
|
||||||
|
print(f" Average per decoder: {decoders_overhead / len(decoders):.1f} MB\n")
|
||||||
|
|
||||||
|
# Start all decoders
|
||||||
|
print(f"[3/4] Starting all {len(decoders)} decoders...")
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
decoder.start()
|
||||||
|
print(f" ✓ Decoder {i+1} started")
|
||||||
|
|
||||||
|
started_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
started_overhead = started_memory - decoders_memory
|
||||||
|
print(f"\n VRAM after starting decoders: {started_memory} MB (+{started_overhead} MB)")
|
||||||
|
print(f" Average per running decoder: {started_overhead / len(decoders):.1f} MB\n")
|
||||||
|
|
||||||
|
# Wait for all streams to connect
|
||||||
|
print("[4/4] Waiting for all streams to connect...")
|
||||||
|
max_wait = 15
|
||||||
|
for wait_time in range(max_wait):
|
||||||
|
connected = sum(1 for d in decoders if d.is_connected())
|
||||||
|
print(f" Connected: {connected}/{len(decoders)} streams", end='\r')
|
||||||
|
|
||||||
|
if connected == len(decoders):
|
||||||
|
print(f"\n✓ All {len(decoders)} streams connected!\n")
|
||||||
|
break
|
||||||
|
|
||||||
|
time.sleep(1)
|
||||||
|
else:
|
||||||
|
connected = sum(1 for d in decoders if d.is_connected())
|
||||||
|
print(f"\n⚠ Only {connected}/{len(decoders)} streams connected after {max_wait}s\n")
|
||||||
|
|
||||||
|
connected_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
connected_overhead = connected_memory - started_memory
|
||||||
|
print(f" VRAM after connection: {connected_memory} MB (+{connected_overhead} MB)\n")
|
||||||
|
|
||||||
|
# Monitor streams
|
||||||
|
print(f"Monitoring streams for {args.duration} seconds...")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"{'Time':<8} {'VRAM':<10} {'Stream 1':<12} {'Stream 2':<12} {'Stream 3':<12} {'Stream 4':<12}")
|
||||||
|
print("-" * 80)
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
snapshot_interval = args.duration // 3 if args.capture_snapshots else 0
|
||||||
|
last_snapshot = 0
|
||||||
|
|
||||||
|
while time.time() - start_time < args.duration:
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
current_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
|
||||||
|
# Get stats for each decoder
|
||||||
|
stats = []
|
||||||
|
for decoder in decoders:
|
||||||
|
status = decoder.get_status().value[:8]
|
||||||
|
buffer = decoder.get_buffer_size()
|
||||||
|
frames = decoder.frame_count
|
||||||
|
stats.append(f"{status:8s} {buffer:2d}/30 {frames:4d}")
|
||||||
|
|
||||||
|
print(f"{elapsed:6.1f}s {current_memory:6d}MB {stats[0]:<12} {stats[1]:<12} {stats[2]:<12} {stats[3]:<12}")
|
||||||
|
|
||||||
|
# Capture snapshots
|
||||||
|
if args.capture_snapshots and snapshot_interval > 0:
|
||||||
|
if elapsed - last_snapshot >= snapshot_interval:
|
||||||
|
print("\n → Capturing snapshots from all streams...")
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
jpeg_bytes = decoder.get_frame_as_jpeg(quality=85)
|
||||||
|
if jpeg_bytes:
|
||||||
|
filename = output_dir / f"camera_{i+1}_t{int(elapsed)}s.jpg"
|
||||||
|
with open(filename, 'wb') as f:
|
||||||
|
f.write(jpeg_bytes)
|
||||||
|
print(f" Saved {filename.name} ({len(jpeg_bytes)/1024:.1f} KB)")
|
||||||
|
print()
|
||||||
|
last_snapshot = elapsed
|
||||||
|
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Final memory analysis
|
||||||
|
final_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
total_overhead = final_memory - baseline_memory
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Memory Usage Analysis")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"Baseline VRAM: {baseline_memory:6d} MB")
|
||||||
|
print(f"After Factory Init: {factory_memory:6d} MB (+{factory_overhead:4d} MB)")
|
||||||
|
print(f"After Creating {len(decoders)} Decoders: {decoders_memory:6d} MB (+{decoders_overhead:4d} MB)")
|
||||||
|
print(f"After Starting Decoders: {started_memory:6d} MB (+{started_overhead:4d} MB)")
|
||||||
|
print(f"After Connection: {connected_memory:6d} MB (+{connected_overhead:4d} MB)")
|
||||||
|
print(f"Final (after {args.duration}s): {final_memory:6d} MB (+{total_overhead:4d} MB total)")
|
||||||
|
print("-" * 80)
|
||||||
|
print(f"Average VRAM per stream: {total_overhead / len(decoders):6.1f} MB")
|
||||||
|
print(f"Context sharing efficiency: {'EXCELLENT' if total_overhead < 500 else 'GOOD' if total_overhead < 800 else 'POOR'}")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Final stats
|
||||||
|
print("\nFinal Stream Statistics:")
|
||||||
|
print("-" * 80)
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
status = decoder.get_status().value
|
||||||
|
buffer = decoder.get_buffer_size()
|
||||||
|
frames = decoder.frame_count
|
||||||
|
fps = frames / args.duration if args.duration > 0 else 0
|
||||||
|
print(f"Stream {i+1}: {status:12s} | Buffer: {buffer:2d}/{decoder.buffer_size} | "
|
||||||
|
f"Frames: {frames:5d} | Avg FPS: {fps:5.2f}")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n✗ Interrupted by user")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n\n✗ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
if 'decoders' in locals():
|
||||||
|
print("\nCleaning up...")
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
decoder.stop()
|
||||||
|
print(f" ✓ Decoder {i+1} stopped")
|
||||||
|
|
||||||
|
cleanup_memory = get_gpu_memory_usage(args.gpu_id)
|
||||||
|
print(f"\nVRAM after cleanup: {cleanup_memory} MB")
|
||||||
|
|
||||||
|
print("\n✓ Multi-stream test completed successfully")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
152
test_stream.py
Executable file
152
test_stream.py
Executable file
|
|
@ -0,0 +1,152 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
CLI test script for StreamDecoder
|
||||||
|
Tests RTSP stream decoding with NVDEC hardware acceleration
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
from services.stream_decoder import StreamDecoderFactory, ConnectionStatus
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Test RTSP stream decoder with NVDEC')
|
||||||
|
parser.add_argument(
|
||||||
|
'--rtsp-url',
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help='RTSP stream URL (e.g., rtsp://user:pass@host/path)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--gpu-id',
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help='GPU device ID'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--buffer-size',
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help='Frame buffer size'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--duration',
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help='Test duration in seconds'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--check-interval',
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help='Status check interval in seconds'
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("RTSP Stream Decoder Test")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"RTSP URL: {args.rtsp_url}")
|
||||||
|
print(f"GPU ID: {args.gpu_id}")
|
||||||
|
print(f"Buffer Size: {args.buffer_size} frames")
|
||||||
|
print(f"Test Duration: {args.duration} seconds")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create factory with shared CUDA context
|
||||||
|
print("[1/4] Initializing StreamDecoderFactory...")
|
||||||
|
factory = StreamDecoderFactory(gpu_id=args.gpu_id)
|
||||||
|
print("✓ Factory initialized with shared CUDA context\n")
|
||||||
|
|
||||||
|
# Create decoder
|
||||||
|
print("[2/4] Creating StreamDecoder...")
|
||||||
|
decoder = factory.create_decoder(
|
||||||
|
rtsp_url=args.rtsp_url,
|
||||||
|
buffer_size=args.buffer_size,
|
||||||
|
codec='h264'
|
||||||
|
)
|
||||||
|
print(f"✓ Decoder created: {decoder}\n")
|
||||||
|
|
||||||
|
# Start decoding
|
||||||
|
print("[3/4] Starting decoder thread...")
|
||||||
|
decoder.start()
|
||||||
|
print("✓ Decoder thread started\n")
|
||||||
|
|
||||||
|
# Monitor for specified duration
|
||||||
|
print(f"[4/4] Monitoring stream for {args.duration} seconds...")
|
||||||
|
print("-" * 80)
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
last_frame_count = 0
|
||||||
|
|
||||||
|
while time.time() - start_time < args.duration:
|
||||||
|
time.sleep(args.check_interval)
|
||||||
|
|
||||||
|
# Get status
|
||||||
|
status = decoder.get_status()
|
||||||
|
buffer_size = decoder.get_buffer_size()
|
||||||
|
frame_count = decoder.frame_count
|
||||||
|
fps = (frame_count - last_frame_count) / args.check_interval
|
||||||
|
last_frame_count = frame_count
|
||||||
|
|
||||||
|
# Print status
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
print(f"[{elapsed:6.1f}s] Status: {status.value:12s} | "
|
||||||
|
f"Buffer: {buffer_size:2d}/{args.buffer_size:2d} | "
|
||||||
|
f"Frames: {frame_count:5d} | "
|
||||||
|
f"FPS: {fps:5.1f}")
|
||||||
|
|
||||||
|
# Try to get latest frame
|
||||||
|
if status == ConnectionStatus.CONNECTED:
|
||||||
|
frame = decoder.get_latest_frame()
|
||||||
|
if frame is not None:
|
||||||
|
print(f" Frame shape: {frame.shape}, dtype: {frame.dtype}, "
|
||||||
|
f"device: {frame.device}")
|
||||||
|
|
||||||
|
# Check for errors
|
||||||
|
if status == ConnectionStatus.ERROR:
|
||||||
|
print("\n✗ ERROR: Stream connection failed!")
|
||||||
|
break
|
||||||
|
|
||||||
|
print("-" * 80)
|
||||||
|
|
||||||
|
# Final statistics
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Test Complete - Final Statistics")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"Total Frames Decoded: {decoder.frame_count}")
|
||||||
|
print(f"Average FPS: {decoder.frame_count / args.duration:.2f}")
|
||||||
|
print(f"Final Status: {decoder.get_status().value}")
|
||||||
|
print(f"Buffer Utilization: {decoder.get_buffer_size()}/{args.buffer_size}")
|
||||||
|
|
||||||
|
if decoder.frame_width and decoder.frame_height:
|
||||||
|
print(f"Frame Resolution: {decoder.frame_width}x{decoder.frame_height}")
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n✗ Interrupted by user")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n\n✗ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
if 'decoder' in locals():
|
||||||
|
print("\nCleaning up...")
|
||||||
|
decoder.stop()
|
||||||
|
print("✓ Decoder stopped")
|
||||||
|
|
||||||
|
print("\n✓ Test completed successfully")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
143
test_vram_process.py
Normal file
143
test_vram_process.py
Normal file
|
|
@ -0,0 +1,143 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
VRAM scaling test - measures Python process memory usage for 1, 2, 3, and 4 streams.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import subprocess
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from services import StreamDecoderFactory
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
# Load camera URLs from environment
|
||||||
|
camera_urls = []
|
||||||
|
i = 1
|
||||||
|
while True:
|
||||||
|
url = os.getenv(f'CAMERA_URL_{i}')
|
||||||
|
if url:
|
||||||
|
camera_urls.append(url)
|
||||||
|
i += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not camera_urls:
|
||||||
|
print("Error: No camera URLs found in .env file")
|
||||||
|
print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
|
||||||
|
exit(1)
|
||||||
|
|
||||||
|
def get_python_gpu_memory():
|
||||||
|
"""Get Python process GPU memory usage in MB"""
|
||||||
|
try:
|
||||||
|
pid = os.getpid()
|
||||||
|
result = subprocess.run(
|
||||||
|
['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'],
|
||||||
|
capture_output=True, text=True, check=True
|
||||||
|
)
|
||||||
|
for line in result.stdout.strip().split('\n'):
|
||||||
|
if line:
|
||||||
|
parts = line.split(',')
|
||||||
|
if len(parts) >= 2 and int(parts[0].strip()) == pid:
|
||||||
|
return int(parts[1].strip())
|
||||||
|
return 0
|
||||||
|
except:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
def test_n_streams(n, wait_time=15):
|
||||||
|
"""Test with n streams"""
|
||||||
|
print(f"\n{'='*80}")
|
||||||
|
print(f"Testing with {n} stream(s)")
|
||||||
|
print('='*80)
|
||||||
|
|
||||||
|
mem_before = get_python_gpu_memory()
|
||||||
|
print(f"Python process VRAM before: {mem_before} MB")
|
||||||
|
|
||||||
|
# Create factory
|
||||||
|
factory = StreamDecoderFactory(gpu_id=0)
|
||||||
|
time.sleep(1)
|
||||||
|
mem_after_factory = get_python_gpu_memory()
|
||||||
|
print(f"After factory: {mem_after_factory} MB (+{mem_after_factory - mem_before} MB)")
|
||||||
|
|
||||||
|
# Create decoders
|
||||||
|
decoders = []
|
||||||
|
for i in range(n):
|
||||||
|
decoder = factory.create_decoder(camera_urls[i], buffer_size=30)
|
||||||
|
decoders.append(decoder)
|
||||||
|
|
||||||
|
time.sleep(1)
|
||||||
|
mem_after_create = get_python_gpu_memory()
|
||||||
|
print(f"After creating {n} decoder(s): {mem_after_create} MB (+{mem_after_create - mem_after_factory} MB)")
|
||||||
|
|
||||||
|
# Start decoders
|
||||||
|
for decoder in decoders:
|
||||||
|
decoder.start()
|
||||||
|
|
||||||
|
time.sleep(2)
|
||||||
|
mem_after_start = get_python_gpu_memory()
|
||||||
|
print(f"After starting {n} decoder(s): {mem_after_start} MB (+{mem_after_start - mem_after_create} MB)")
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
print(f"Waiting {wait_time}s for streams to connect and stabilize...")
|
||||||
|
time.sleep(wait_time)
|
||||||
|
|
||||||
|
# Check connection status
|
||||||
|
connected = sum(1 for d in decoders if d.is_connected())
|
||||||
|
mem_stable = get_python_gpu_memory()
|
||||||
|
|
||||||
|
print(f"Connected: {connected}/{n} streams")
|
||||||
|
print(f"Python process VRAM (stable): {mem_stable} MB")
|
||||||
|
|
||||||
|
# Get frame stats
|
||||||
|
for i, decoder in enumerate(decoders):
|
||||||
|
print(f" Stream {i+1}: {decoder.get_status().value:10s} "
|
||||||
|
f"Buffer: {decoder.get_buffer_size()}/30 "
|
||||||
|
f"Frames: {decoder.frame_count}")
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
for decoder in decoders:
|
||||||
|
decoder.stop()
|
||||||
|
|
||||||
|
time.sleep(2)
|
||||||
|
mem_after_cleanup = get_python_gpu_memory()
|
||||||
|
print(f"After cleanup: {mem_after_cleanup} MB")
|
||||||
|
|
||||||
|
return mem_stable
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
print("Python VRAM Scaling Test")
|
||||||
|
print(f"PID: {os.getpid()}")
|
||||||
|
|
||||||
|
baseline = get_python_gpu_memory()
|
||||||
|
print(f"Baseline Python process VRAM: {baseline} MB\n")
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
for n in [1, 2, 3, 4]:
|
||||||
|
mem = test_n_streams(n, wait_time=15)
|
||||||
|
results[n] = mem
|
||||||
|
print(f"\n→ {n} stream(s): {mem} MB (process total)")
|
||||||
|
|
||||||
|
# Give time between tests
|
||||||
|
if n < 4:
|
||||||
|
print("\nWaiting 5s before next test...")
|
||||||
|
time.sleep(5)
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("Python Process VRAM Scaling Summary")
|
||||||
|
print("="*80)
|
||||||
|
print(f"Baseline: {baseline:4d} MB")
|
||||||
|
for n in [1, 2, 3, 4]:
|
||||||
|
total = results[n]
|
||||||
|
overhead = total - baseline
|
||||||
|
per_stream = overhead / n if n > 0 else 0
|
||||||
|
print(f"{n} stream(s): {total:4d} MB (+{overhead:3d} MB total, {per_stream:5.1f} MB per stream)")
|
||||||
|
|
||||||
|
# Calculate marginal cost
|
||||||
|
print("\nMarginal cost per additional stream:")
|
||||||
|
for n in [2, 3, 4]:
|
||||||
|
marginal = results[n] - results[n-1]
|
||||||
|
print(f" Stream {n}: +{marginal} MB")
|
||||||
|
|
||||||
|
print("="*80)
|
||||||
85
verify_tensorrt_model.py
Normal file
85
verify_tensorrt_model.py
Normal file
|
|
@ -0,0 +1,85 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Quick verification script for TensorRT model
|
||||||
|
"""
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from services.model_repository import TensorRTModelRepository
|
||||||
|
|
||||||
|
def verify_model():
|
||||||
|
print("=" * 80)
|
||||||
|
print("TensorRT Model Verification")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Initialize repository
|
||||||
|
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=2)
|
||||||
|
|
||||||
|
# Load the model
|
||||||
|
print("\nLoading YOLOv8n TensorRT engine...")
|
||||||
|
try:
|
||||||
|
metadata = repo.load_model(
|
||||||
|
model_id="yolov8n_test",
|
||||||
|
file_path="models/yolov8n.trt",
|
||||||
|
num_contexts=2
|
||||||
|
)
|
||||||
|
print("✓ Model loaded successfully!")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Failed to load model: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Get model info
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Model Information")
|
||||||
|
print("=" * 80)
|
||||||
|
info = repo.get_model_info("yolov8n_test")
|
||||||
|
if info:
|
||||||
|
print(f"Model ID: {info['model_id']}")
|
||||||
|
print(f"File: {info['file_path']}")
|
||||||
|
print(f"File hash: {info['file_hash']}")
|
||||||
|
print(f"\nInputs:")
|
||||||
|
for name, spec in info['inputs'].items():
|
||||||
|
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||||
|
print(f"\nOutputs:")
|
||||||
|
for name, spec in info['outputs'].items():
|
||||||
|
print(f" {name}: {spec['shape']} ({spec['dtype']})")
|
||||||
|
|
||||||
|
# Run test inference
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Running Test Inference")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create dummy input (simulating a 640x640 image)
|
||||||
|
input_tensor = torch.rand(1, 3, 640, 640, dtype=torch.float32, device='cuda:0')
|
||||||
|
print(f"Input tensor: {input_tensor.shape} on {input_tensor.device}")
|
||||||
|
|
||||||
|
# Run inference
|
||||||
|
outputs = repo.infer(
|
||||||
|
model_id="yolov8n_test",
|
||||||
|
inputs={"images": input_tensor},
|
||||||
|
synchronize=True
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n✓ Inference successful!")
|
||||||
|
print("\nOutputs:")
|
||||||
|
for name, tensor in outputs.items():
|
||||||
|
print(f" {name}: {tensor.shape} on {tensor.device} ({tensor.dtype})")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n✗ Inference failed: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Cleanup")
|
||||||
|
print("=" * 80)
|
||||||
|
repo.unload_model("yolov8n_test")
|
||||||
|
print("✓ Model unloaded")
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("Verification Complete!")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
verify_model()
|
||||||
Loading…
Add table
Add a link
Reference in a new issue