python-rtsp-worker/services/README_MODEL_REPOSITORY.md

9.9 KiB

TensorRT Model Repository

Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.

Architecture

Key Features

  1. Model Deduplication by File Hash

    • Multiple model IDs can point to the same model file
    • Only one engine loaded in VRAM per unique file
    • Example: 100 cameras with same model = 1 engine (not 100!)
  2. Context Pooling for Load Balancing

    • Each unique engine has N execution contexts (configurable)
    • Contexts borrowed/returned via mutex-based queue
    • Enables concurrent inference without context-per-model overhead
    • Example: 100 cameras sharing 4 contexts efficiently
  3. GPU-to-GPU Inference

    • All inputs/outputs stay in VRAM (zero CPU transfers)
    • Integrates seamlessly with StreamDecoder (frames already on GPU)
    • Maximum performance for video inference pipelines
  4. Thread-Safe Concurrent Inference

    • Mutex-based context acquisition (TensorRT best practice)
    • No shared IExecutionContext across threads (safe)
    • Multiple threads can infer concurrently (limited by pool size)

Design Rationale

Why Context Pooling?

Without pooling (naive approach):

100 cameras → 100 model IDs → 100 execution contexts
  • Problem: Each context consumes VRAM (layers, workspace, etc.)
  • Problem: Context creation overhead per camera
  • Problem: Doesn't scale to hundreds of cameras

With pooling (our approach):

100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
  • Solution: Contexts shared across all cameras using same model
  • Solution: Borrow/return mechanism with mutex queue
  • Solution: Scales to any number of cameras with fixed context count

Memory Savings Example

YOLOv8n model (~6MB engine file):

Approach Model IDs Engines Contexts Approx VRAM
Naive 100 100 100 ~1.5 GB
Ours (pooled) 100 1 4 ~30 MB

50x memory savings!

Usage

Basic Usage

from services.model_repository import TensorRTModelRepository

# Initialize repository
repo = TensorRTModelRepository(
    gpu_id=0,
    default_num_contexts=4  # 4 contexts per unique engine
)

# Load model for camera 1
repo.load_model(
    model_id="camera_1",
    file_path="models/yolov8n.trt"
)

# Load same model for camera 2 (deduplication happens automatically)
repo.load_model(
    model_id="camera_2",
    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
)

# Run inference (GPU-to-GPU)
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')

outputs = repo.infer(
    model_id="camera_1",
    inputs={"images": input_tensor},
    synchronize=True,
    timeout=5.0  # Wait up to 5s for available context
)

# Outputs stay on GPU
for name, tensor in outputs.items():
    print(f"{name}: {tensor.shape} on {tensor.device}")

Multi-Camera Scenario

# Setup multiple cameras
cameras = [f"camera_{i}" for i in range(100)]

# Load same model for all cameras
for camera_id in cameras:
    repo.load_model(
        model_id=camera_id,
        file_path="models/yolov8n.trt"  # Same file for all
    )

# Check efficiency
stats = repo.get_stats()
print(f"Model IDs: {stats['total_model_ids']}")  # 100
print(f"Unique engines: {stats['unique_engines']}")  # 1
print(f"Total contexts: {stats['total_contexts']}")  # 4

Integration with RTSP Decoder

from services.stream_decoder import StreamDecoderFactory
from services.model_repository import TensorRTModelRepository

# Setup
decoder_factory = StreamDecoderFactory(gpu_id=0)
model_repo = TensorRTModelRepository(gpu_id=0)

# Create decoder for camera
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
decoder.start()

# Load inference model
model_repo.load_model("camera_main", "models/yolov8n.trt")

# Process frames (everything on GPU)
frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA

# Preprocess (stays on GPU)
frame_gpu = frame_gpu.float() / 255.0
frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim

# Inference (GPU-to-GPU, zero copy)
outputs = model_repo.infer(
    model_id="camera_main",
    inputs={"images": frame_gpu}
)

# Post-process outputs (can stay on GPU)
# ... NMS, bounding boxes, etc.

Concurrent Inference

import threading

def process_camera(camera_id: str, model_id: str):
    # Get frame from decoder (on GPU)
    frame = decoder.get_latest_frame(rgb=True)

    # Inference automatically borrows/returns context from pool
    outputs = repo.infer(
        model_id=model_id,
        inputs={"images": frame},
        timeout=10.0  # Wait for available context
    )

    # Process outputs...

# Multiple threads can infer concurrently
threads = []
for i in range(10):  # 10 threads
    t = threading.Thread(
        target=process_camera,
        args=(f"camera_{i}", f"camera_{i}")
    )
    threads.append(t)
    t.start()

for t in threads:
    t.join()

# With 4 contexts: up to 4 inferences run in parallel
# Others wait in queue, contexts auto-balanced

API Reference

TensorRTModelRepository

__init__(gpu_id=0, default_num_contexts=4)

Initialize the repository.

Args:

  • gpu_id: GPU device ID
  • default_num_contexts: Default context pool size per engine

load_model(model_id, file_path, num_contexts=None, force_reload=False)

Load a TensorRT model.

Args:

  • model_id: Unique identifier (e.g., "camera_1")
  • file_path: Path to .trt/.engine file
  • num_contexts: Context pool size (None = use default)
  • force_reload: Reload if model_id exists

Returns: ModelMetadata

Deduplication: If file hash matches existing model, reuses engine + contexts.

infer(model_id, inputs, synchronize=True, timeout=5.0)

Run inference.

Args:

  • model_id: Model identifier
  • inputs: Dict mapping input names to CUDA tensors
  • synchronize: Wait for completion
  • timeout: Max wait time for context (seconds)

Returns: Dict mapping output names to CUDA tensors

Thread-safe: Borrows context from pool, returns after inference.

unload_model(model_id)

Unload a model.

If last reference to engine, fully unloads from VRAM.

get_metadata(model_id)

Get model metadata.

Returns: ModelMetadata or None

get_model_info(model_id)

Get detailed model information.

Returns: Dict with engine references, context pool size, shared model IDs, etc.

get_stats()

Get repository statistics.

Returns: Dict with total models, unique engines, contexts, memory efficiency.

Best Practices

1. Set Appropriate Context Pool Size

# For 10 cameras with same model, 4 contexts is usually enough
repo = TensorRTModelRepository(default_num_contexts=4)

# For high concurrency, increase pool size
repo = TensorRTModelRepository(default_num_contexts=8)

Rule of thumb: Start with 4 contexts, increase if you see timeout errors.

2. Always Use GPU Tensors

# ✅ Good: Input on GPU
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(model_id, {"images": input_gpu})

# ❌ Bad: Input on CPU (will cause error)
input_cpu = torch.rand(1, 3, 640, 640)
outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!

3. Handle Timeout Gracefully

try:
    outputs = repo.infer(
        model_id="camera_1",
        inputs=inputs,
        timeout=5.0
    )
except RuntimeError as e:
    # All contexts busy, increase pool size or add backpressure
    print(f"Inference timeout: {e}")

4. Use Same File for Deduplication

# ✅ Good: Same file path → deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!

# ❌ Bad: Different paths (even if same content) → no deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine

TensorRT Best Practices Implemented

Based on NVIDIA documentation and web search findings:

  1. Separate IExecutionContext per concurrent stream

    • Each context has its own CUDA stream
    • Contexts never shared across threads simultaneously
  2. Mutex-based context management

    • Queue-based borrowing with locks
    • Thread-safe acquire/release pattern
  3. GPU memory reuse

    • Engines shared by file hash
    • Contexts pooled and reused
  4. Zero-copy operations

    • All data stays in VRAM
    • DLPack integration with PyTorch

Troubleshooting

"No execution context available within timeout"

Cause: All contexts busy with concurrent inferences.

Solutions:

  1. Increase context pool size:
    repo.load_model(model_id, file_path, num_contexts=8)
    
  2. Increase timeout:
    outputs = repo.infer(model_id, inputs, timeout=30.0)
    
  3. Add backpressure/throttling to limit concurrent requests

Out of Memory (OOM)

Cause: Too many unique engines or large context pools.

Solutions:

  1. Ensure deduplication working (same file paths)
  2. Reduce context pool sizes
  3. Use smaller models or quantization (INT8/FP16)

Import Error: "tensorrt could not be resolved"

Solution: Install TensorRT:

pip install tensorrt
# Or use NVIDIA's wheel for your CUDA version

Performance Tips

  1. Batch Processing: Process multiple frames before synchronizing

    outputs = repo.infer(model_id, inputs, synchronize=False)
    # ... more inferences ...
    torch.cuda.synchronize()  # Sync once at end
    
  2. Async Inference: Don't synchronize if not needed immediately

    outputs = repo.infer(model_id, inputs, synchronize=False)
    # GPU continues working, CPU continues
    # Synchronize later when you need results
    
  3. Monitor Context Utilization:

    stats = repo.get_stats()
    print(f"Contexts: {stats['total_contexts']}")
    
    # If timeouts occur frequently, increase pool size
    

License

Part of python-rtsp-worker project.