# TensorRT Model Repository Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference. ## Architecture ### Key Features 1. **Model Deduplication by File Hash** - Multiple model IDs can point to the same model file - Only one engine loaded in VRAM per unique file - Example: 100 cameras with same model = 1 engine (not 100!) 2. **Context Pooling for Load Balancing** - Each unique engine has N execution contexts (configurable) - Contexts borrowed/returned via mutex-based queue - Enables concurrent inference without context-per-model overhead - Example: 100 cameras sharing 4 contexts efficiently 3. **GPU-to-GPU Inference** - All inputs/outputs stay in VRAM (zero CPU transfers) - Integrates seamlessly with StreamDecoder (frames already on GPU) - Maximum performance for video inference pipelines 4. **Thread-Safe Concurrent Inference** - Mutex-based context acquisition (TensorRT best practice) - No shared IExecutionContext across threads (safe) - Multiple threads can infer concurrently (limited by pool size) ## Design Rationale ### Why Context Pooling? **Without pooling** (naive approach): ``` 100 cameras → 100 model IDs → 100 execution contexts ``` - Problem: Each context consumes VRAM (layers, workspace, etc.) - Problem: Context creation overhead per camera - Problem: Doesn't scale to hundreds of cameras **With pooling** (our approach): ``` 100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool) ``` - Solution: Contexts shared across all cameras using same model - Solution: Borrow/return mechanism with mutex queue - Solution: Scales to any number of cameras with fixed context count ### Memory Savings Example YOLOv8n model (~6MB engine file): | Approach | Model IDs | Engines | Contexts | Approx VRAM | |----------|-----------|---------|----------|-------------| | Naive | 100 | 100 | 100 | ~1.5 GB | | **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** | **50x memory savings!** ## Usage ### Basic Usage ```python from services.model_repository import TensorRTModelRepository # Initialize repository repo = TensorRTModelRepository( gpu_id=0, default_num_contexts=4 # 4 contexts per unique engine ) # Load model for camera 1 repo.load_model( model_id="camera_1", file_path="models/yolov8n.trt" ) # Load same model for camera 2 (deduplication happens automatically) repo.load_model( model_id="camera_2", file_path="models/yolov8n.trt" # Same file → shares engine and contexts! ) # Run inference (GPU-to-GPU) import torch input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0') outputs = repo.infer( model_id="camera_1", inputs={"images": input_tensor}, synchronize=True, timeout=5.0 # Wait up to 5s for available context ) # Outputs stay on GPU for name, tensor in outputs.items(): print(f"{name}: {tensor.shape} on {tensor.device}") ``` ### Multi-Camera Scenario ```python # Setup multiple cameras cameras = [f"camera_{i}" for i in range(100)] # Load same model for all cameras for camera_id in cameras: repo.load_model( model_id=camera_id, file_path="models/yolov8n.trt" # Same file for all ) # Check efficiency stats = repo.get_stats() print(f"Model IDs: {stats['total_model_ids']}") # 100 print(f"Unique engines: {stats['unique_engines']}") # 1 print(f"Total contexts: {stats['total_contexts']}") # 4 ``` ### Integration with RTSP Decoder ```python from services.stream_decoder import StreamDecoderFactory from services.model_repository import TensorRTModelRepository # Setup decoder_factory = StreamDecoderFactory(gpu_id=0) model_repo = TensorRTModelRepository(gpu_id=0) # Create decoder for camera decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream") decoder.start() # Load inference model model_repo.load_model("camera_main", "models/yolov8n.trt") # Process frames (everything on GPU) frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA # Preprocess (stays on GPU) frame_gpu = frame_gpu.float() / 255.0 frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim # Inference (GPU-to-GPU, zero copy) outputs = model_repo.infer( model_id="camera_main", inputs={"images": frame_gpu} ) # Post-process outputs (can stay on GPU) # ... NMS, bounding boxes, etc. ``` ### Concurrent Inference ```python import threading def process_camera(camera_id: str, model_id: str): # Get frame from decoder (on GPU) frame = decoder.get_latest_frame(rgb=True) # Inference automatically borrows/returns context from pool outputs = repo.infer( model_id=model_id, inputs={"images": frame}, timeout=10.0 # Wait for available context ) # Process outputs... # Multiple threads can infer concurrently threads = [] for i in range(10): # 10 threads t = threading.Thread( target=process_camera, args=(f"camera_{i}", f"camera_{i}") ) threads.append(t) t.start() for t in threads: t.join() # With 4 contexts: up to 4 inferences run in parallel # Others wait in queue, contexts auto-balanced ``` ## API Reference ### TensorRTModelRepository #### `__init__(gpu_id=0, default_num_contexts=4)` Initialize the repository. **Args:** - `gpu_id`: GPU device ID - `default_num_contexts`: Default context pool size per engine #### `load_model(model_id, file_path, num_contexts=None, force_reload=False)` Load a TensorRT model. **Args:** - `model_id`: Unique identifier (e.g., "camera_1") - `file_path`: Path to .trt/.engine file - `num_contexts`: Context pool size (None = use default) - `force_reload`: Reload if model_id exists **Returns:** `ModelMetadata` **Deduplication:** If file hash matches existing model, reuses engine + contexts. #### `infer(model_id, inputs, synchronize=True, timeout=5.0)` Run inference. **Args:** - `model_id`: Model identifier - `inputs`: Dict mapping input names to CUDA tensors - `synchronize`: Wait for completion - `timeout`: Max wait time for context (seconds) **Returns:** Dict mapping output names to CUDA tensors **Thread-safe:** Borrows context from pool, returns after inference. #### `unload_model(model_id)` Unload a model. If last reference to engine, fully unloads from VRAM. #### `get_metadata(model_id)` Get model metadata. **Returns:** `ModelMetadata` or `None` #### `get_model_info(model_id)` Get detailed model information. **Returns:** Dict with engine references, context pool size, shared model IDs, etc. #### `get_stats()` Get repository statistics. **Returns:** Dict with total models, unique engines, contexts, memory efficiency. ## Best Practices ### 1. Set Appropriate Context Pool Size ```python # For 10 cameras with same model, 4 contexts is usually enough repo = TensorRTModelRepository(default_num_contexts=4) # For high concurrency, increase pool size repo = TensorRTModelRepository(default_num_contexts=8) ``` **Rule of thumb:** Start with 4 contexts, increase if you see timeout errors. ### 2. Always Use GPU Tensors ```python # ✅ Good: Input on GPU input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0') outputs = repo.infer(model_id, {"images": input_gpu}) # ❌ Bad: Input on CPU (will cause error) input_cpu = torch.rand(1, 3, 640, 640) outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError! ``` ### 3. Handle Timeout Gracefully ```python try: outputs = repo.infer( model_id="camera_1", inputs=inputs, timeout=5.0 ) except RuntimeError as e: # All contexts busy, increase pool size or add backpressure print(f"Inference timeout: {e}") ``` ### 4. Use Same File for Deduplication ```python # ✅ Good: Same file path → deduplication repo.load_model("cam1", "/models/yolo.trt") repo.load_model("cam2", "/models/yolo.trt") # Shares engine! # ❌ Bad: Different paths (even if same content) → no deduplication repo.load_model("cam1", "/models/yolo.trt") repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine ``` ## TensorRT Best Practices Implemented Based on NVIDIA documentation and web search findings: 1. **Separate IExecutionContext per concurrent stream** ✅ - Each context has its own CUDA stream - Contexts never shared across threads simultaneously 2. **Mutex-based context management** ✅ - Queue-based borrowing with locks - Thread-safe acquire/release pattern 3. **GPU memory reuse** ✅ - Engines shared by file hash - Contexts pooled and reused 4. **Zero-copy operations** ✅ - All data stays in VRAM - DLPack integration with PyTorch ## Troubleshooting ### "No execution context available within timeout" **Cause:** All contexts busy with concurrent inferences. **Solutions:** 1. Increase context pool size: ```python repo.load_model(model_id, file_path, num_contexts=8) ``` 2. Increase timeout: ```python outputs = repo.infer(model_id, inputs, timeout=30.0) ``` 3. Add backpressure/throttling to limit concurrent requests ### Out of Memory (OOM) **Cause:** Too many unique engines or large context pools. **Solutions:** 1. Ensure deduplication working (same file paths) 2. Reduce context pool sizes 3. Use smaller models or quantization (INT8/FP16) ### Import Error: "tensorrt could not be resolved" **Solution:** Install TensorRT: ```bash pip install tensorrt # Or use NVIDIA's wheel for your CUDA version ``` ## Performance Tips 1. **Batch Processing:** Process multiple frames before synchronizing ```python outputs = repo.infer(model_id, inputs, synchronize=False) # ... more inferences ... torch.cuda.synchronize() # Sync once at end ``` 2. **Async Inference:** Don't synchronize if not needed immediately ```python outputs = repo.infer(model_id, inputs, synchronize=False) # GPU continues working, CPU continues # Synchronize later when you need results ``` 3. **Monitor Context Utilization:** ```python stats = repo.get_stats() print(f"Contexts: {stats['total_contexts']}") # If timeouts occur frequently, increase pool size ``` ## License Part of python-rtsp-worker project.