9.9 KiB
TensorRT Model Repository
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
Architecture
Key Features
-
Model Deduplication by File Hash
- Multiple model IDs can point to the same model file
- Only one engine loaded in VRAM per unique file
- Example: 100 cameras with same model = 1 engine (not 100!)
-
Context Pooling for Load Balancing
- Each unique engine has N execution contexts (configurable)
- Contexts borrowed/returned via mutex-based queue
- Enables concurrent inference without context-per-model overhead
- Example: 100 cameras sharing 4 contexts efficiently
-
GPU-to-GPU Inference
- All inputs/outputs stay in VRAM (zero CPU transfers)
- Integrates seamlessly with StreamDecoder (frames already on GPU)
- Maximum performance for video inference pipelines
-
Thread-Safe Concurrent Inference
- Mutex-based context acquisition (TensorRT best practice)
- No shared IExecutionContext across threads (safe)
- Multiple threads can infer concurrently (limited by pool size)
Design Rationale
Why Context Pooling?
Without pooling (naive approach):
100 cameras → 100 model IDs → 100 execution contexts
- Problem: Each context consumes VRAM (layers, workspace, etc.)
- Problem: Context creation overhead per camera
- Problem: Doesn't scale to hundreds of cameras
With pooling (our approach):
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
- Solution: Contexts shared across all cameras using same model
- Solution: Borrow/return mechanism with mutex queue
- Solution: Scales to any number of cameras with fixed context count
Memory Savings Example
YOLOv8n model (~6MB engine file):
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|---|---|---|---|---|
| Naive | 100 | 100 | 100 | ~1.5 GB |
| Ours (pooled) | 100 | 1 | 4 | ~30 MB |
50x memory savings!
Usage
Basic Usage
from services.model_repository import TensorRTModelRepository
# Initialize repository
repo = TensorRTModelRepository(
gpu_id=0,
default_num_contexts=4 # 4 contexts per unique engine
)
# Load model for camera 1
repo.load_model(
model_id="camera_1",
file_path="models/yolov8n.trt"
)
# Load same model for camera 2 (deduplication happens automatically)
repo.load_model(
model_id="camera_2",
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
)
# Run inference (GPU-to-GPU)
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(
model_id="camera_1",
inputs={"images": input_tensor},
synchronize=True,
timeout=5.0 # Wait up to 5s for available context
)
# Outputs stay on GPU
for name, tensor in outputs.items():
print(f"{name}: {tensor.shape} on {tensor.device}")
Multi-Camera Scenario
# Setup multiple cameras
cameras = [f"camera_{i}" for i in range(100)]
# Load same model for all cameras
for camera_id in cameras:
repo.load_model(
model_id=camera_id,
file_path="models/yolov8n.trt" # Same file for all
)
# Check efficiency
stats = repo.get_stats()
print(f"Model IDs: {stats['total_model_ids']}") # 100
print(f"Unique engines: {stats['unique_engines']}") # 1
print(f"Total contexts: {stats['total_contexts']}") # 4
Integration with RTSP Decoder
from services.stream_decoder import StreamDecoderFactory
from services.model_repository import TensorRTModelRepository
# Setup
decoder_factory = StreamDecoderFactory(gpu_id=0)
model_repo = TensorRTModelRepository(gpu_id=0)
# Create decoder for camera
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
decoder.start()
# Load inference model
model_repo.load_model("camera_main", "models/yolov8n.trt")
# Process frames (everything on GPU)
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
# Preprocess (stays on GPU)
frame_gpu = frame_gpu.float() / 255.0
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
# Inference (GPU-to-GPU, zero copy)
outputs = model_repo.infer(
model_id="camera_main",
inputs={"images": frame_gpu}
)
# Post-process outputs (can stay on GPU)
# ... NMS, bounding boxes, etc.
Concurrent Inference
import threading
def process_camera(camera_id: str, model_id: str):
# Get frame from decoder (on GPU)
frame = decoder.get_latest_frame(rgb=True)
# Inference automatically borrows/returns context from pool
outputs = repo.infer(
model_id=model_id,
inputs={"images": frame},
timeout=10.0 # Wait for available context
)
# Process outputs...
# Multiple threads can infer concurrently
threads = []
for i in range(10): # 10 threads
t = threading.Thread(
target=process_camera,
args=(f"camera_{i}", f"camera_{i}")
)
threads.append(t)
t.start()
for t in threads:
t.join()
# With 4 contexts: up to 4 inferences run in parallel
# Others wait in queue, contexts auto-balanced
API Reference
TensorRTModelRepository
__init__(gpu_id=0, default_num_contexts=4)
Initialize the repository.
Args:
gpu_id: GPU device IDdefault_num_contexts: Default context pool size per engine
load_model(model_id, file_path, num_contexts=None, force_reload=False)
Load a TensorRT model.
Args:
model_id: Unique identifier (e.g., "camera_1")file_path: Path to .trt/.engine filenum_contexts: Context pool size (None = use default)force_reload: Reload if model_id exists
Returns: ModelMetadata
Deduplication: If file hash matches existing model, reuses engine + contexts.
infer(model_id, inputs, synchronize=True, timeout=5.0)
Run inference.
Args:
model_id: Model identifierinputs: Dict mapping input names to CUDA tensorssynchronize: Wait for completiontimeout: Max wait time for context (seconds)
Returns: Dict mapping output names to CUDA tensors
Thread-safe: Borrows context from pool, returns after inference.
unload_model(model_id)
Unload a model.
If last reference to engine, fully unloads from VRAM.
get_metadata(model_id)
Get model metadata.
Returns: ModelMetadata or None
get_model_info(model_id)
Get detailed model information.
Returns: Dict with engine references, context pool size, shared model IDs, etc.
get_stats()
Get repository statistics.
Returns: Dict with total models, unique engines, contexts, memory efficiency.
Best Practices
1. Set Appropriate Context Pool Size
# For 10 cameras with same model, 4 contexts is usually enough
repo = TensorRTModelRepository(default_num_contexts=4)
# For high concurrency, increase pool size
repo = TensorRTModelRepository(default_num_contexts=8)
Rule of thumb: Start with 4 contexts, increase if you see timeout errors.
2. Always Use GPU Tensors
# ✅ Good: Input on GPU
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(model_id, {"images": input_gpu})
# ❌ Bad: Input on CPU (will cause error)
input_cpu = torch.rand(1, 3, 640, 640)
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
3. Handle Timeout Gracefully
try:
outputs = repo.infer(
model_id="camera_1",
inputs=inputs,
timeout=5.0
)
except RuntimeError as e:
# All contexts busy, increase pool size or add backpressure
print(f"Inference timeout: {e}")
4. Use Same File for Deduplication
# ✅ Good: Same file path → deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
# ❌ Bad: Different paths (even if same content) → no deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
TensorRT Best Practices Implemented
Based on NVIDIA documentation and web search findings:
-
Separate IExecutionContext per concurrent stream ✅
- Each context has its own CUDA stream
- Contexts never shared across threads simultaneously
-
Mutex-based context management ✅
- Queue-based borrowing with locks
- Thread-safe acquire/release pattern
-
GPU memory reuse ✅
- Engines shared by file hash
- Contexts pooled and reused
-
Zero-copy operations ✅
- All data stays in VRAM
- DLPack integration with PyTorch
Troubleshooting
"No execution context available within timeout"
Cause: All contexts busy with concurrent inferences.
Solutions:
- Increase context pool size:
repo.load_model(model_id, file_path, num_contexts=8) - Increase timeout:
outputs = repo.infer(model_id, inputs, timeout=30.0) - Add backpressure/throttling to limit concurrent requests
Out of Memory (OOM)
Cause: Too many unique engines or large context pools.
Solutions:
- Ensure deduplication working (same file paths)
- Reduce context pool sizes
- Use smaller models or quantization (INT8/FP16)
Import Error: "tensorrt could not be resolved"
Solution: Install TensorRT:
pip install tensorrt
# Or use NVIDIA's wheel for your CUDA version
Performance Tips
-
Batch Processing: Process multiple frames before synchronizing
outputs = repo.infer(model_id, inputs, synchronize=False) # ... more inferences ... torch.cuda.synchronize() # Sync once at end -
Async Inference: Don't synchronize if not needed immediately
outputs = repo.infer(model_id, inputs, synchronize=False) # GPU continues working, CPU continues # Synchronize later when you need results -
Monitor Context Utilization:
stats = repo.get_stats() print(f"Contexts: {stats['total_contexts']}") # If timeouts occur frequently, increase pool size
License
Part of python-rtsp-worker project.