adsist-cms/python-rtsp-worker

Fork 0

Siwat Sirichai 3c83a57e44 feat: inference subsystem and optimization to decoder

2025-11-09 00:57:08 +07:00

9.9 KiB

Raw Blame History

TensorRT Model Repository

Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.

Architecture

Key Features

Model Deduplication by File Hash
- Multiple model IDs can point to the same model file
- Only one engine loaded in VRAM per unique file
- Example: 100 cameras with same model = 1 engine (not 100!)
Context Pooling for Load Balancing
- Each unique engine has N execution contexts (configurable)
- Contexts borrowed/returned via mutex-based queue
- Enables concurrent inference without context-per-model overhead
- Example: 100 cameras sharing 4 contexts efficiently
GPU-to-GPU Inference
- All inputs/outputs stay in VRAM (zero CPU transfers)
- Integrates seamlessly with StreamDecoder (frames already on GPU)
- Maximum performance for video inference pipelines
Thread-Safe Concurrent Inference
- Mutex-based context acquisition (TensorRT best practice)
- No shared IExecutionContext across threads (safe)
- Multiple threads can infer concurrently (limited by pool size)

Design Rationale

Why Context Pooling?

Without pooling (naive approach):

100 cameras → 100 model IDs → 100 execution contexts

Problem: Each context consumes VRAM (layers, workspace, etc.)
Problem: Context creation overhead per camera
Problem: Doesn't scale to hundreds of cameras

With pooling (our approach):

100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)

Solution: Contexts shared across all cameras using same model
Solution: Borrow/return mechanism with mutex queue
Solution: Scales to any number of cameras with fixed context count

Memory Savings Example

YOLOv8n model (~6MB engine file):

Approach	Model IDs	Engines	Contexts	Approx VRAM
Naive	100	100	100	~1.5 GB
Ours (pooled)	100	1	4	~30 MB

50x memory savings!

Usage

Basic Usage

from services.model_repository import TensorRTModelRepository

# Initialize repository
repo = TensorRTModelRepository(
    gpu_id=0,
    default_num_contexts=4  # 4 contexts per unique engine
)

# Load model for camera 1
repo.load_model(
    model_id="camera_1",
    file_path="models/yolov8n.trt"
)

# Load same model for camera 2 (deduplication happens automatically)
repo.load_model(
    model_id="camera_2",
    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
)

# Run inference (GPU-to-GPU)
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')

outputs = repo.infer(
    model_id="camera_1",
    inputs={"images": input_tensor},
    synchronize=True,
    timeout=5.0  # Wait up to 5s for available context
)

# Outputs stay on GPU
for name, tensor in outputs.items():
    print(f"{name}: {tensor.shape} on {tensor.device}")

Multi-Camera Scenario

# Setup multiple cameras
cameras = [f"camera_{i}" for i in range(100)]

# Load same model for all cameras
for camera_id in cameras:
    repo.load_model(
        model_id=camera_id,
        file_path="models/yolov8n.trt"  # Same file for all
    )

# Check efficiency
stats = repo.get_stats()
print(f"Model IDs: {stats['total_model_ids']}")  # 100
print(f"Unique engines: {stats['unique_engines']}")  # 1
print(f"Total contexts: {stats['total_contexts']}")  # 4

Integration with RTSP Decoder

from services.stream_decoder import StreamDecoderFactory
from services.model_repository import TensorRTModelRepository

# Setup
decoder_factory = StreamDecoderFactory(gpu_id=0)
model_repo = TensorRTModelRepository(gpu_id=0)

# Create decoder for camera
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
decoder.start()

# Load inference model
model_repo.load_model("camera_main", "models/yolov8n.trt")

# Process frames (everything on GPU)
frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA

# Preprocess (stays on GPU)
frame_gpu = frame_gpu.float() / 255.0
frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim

# Inference (GPU-to-GPU, zero copy)
outputs = model_repo.infer(
    model_id="camera_main",
    inputs={"images": frame_gpu}
)

# Post-process outputs (can stay on GPU)
# ... NMS, bounding boxes, etc.

Concurrent Inference

import threading

def process_camera(camera_id: str, model_id: str):
    # Get frame from decoder (on GPU)
    frame = decoder.get_latest_frame(rgb=True)

    # Inference automatically borrows/returns context from pool
    outputs = repo.infer(
        model_id=model_id,
        inputs={"images": frame},
        timeout=10.0  # Wait for available context
    )

    # Process outputs...

# Multiple threads can infer concurrently
threads = []
for i in range(10):  # 10 threads
    t = threading.Thread(
        target=process_camera,
        args=(f"camera_{i}", f"camera_{i}")
    )
    threads.append(t)
    t.start()

for t in threads:
    t.join()

# With 4 contexts: up to 4 inferences run in parallel
# Others wait in queue, contexts auto-balanced

API Reference

TensorRTModelRepository

`init(gpu_id=0, default_num_contexts=4)`

Initialize the repository.

Args:

gpu_id: GPU device ID
default_num_contexts: Default context pool size per engine

`load_model(model_id, file_path, num_contexts=None, force_reload=False)`

Load a TensorRT model.

Args:

model_id: Unique identifier (e.g., "camera_1")
file_path: Path to .trt/.engine file
num_contexts: Context pool size (None = use default)
force_reload: Reload if model_id exists

Returns: ModelMetadata

Deduplication: If file hash matches existing model, reuses engine + contexts.

`infer(model_id, inputs, synchronize=True, timeout=5.0)`

Run inference.

Args:

model_id: Model identifier
inputs: Dict mapping input names to CUDA tensors
synchronize: Wait for completion
timeout: Max wait time for context (seconds)

Returns: Dict mapping output names to CUDA tensors

Thread-safe: Borrows context from pool, returns after inference.

`unload_model(model_id)`

Unload a model.

If last reference to engine, fully unloads from VRAM.

`get_metadata(model_id)`

Get model metadata.

Returns: ModelMetadata or None

`get_model_info(model_id)`

Get detailed model information.

Returns: Dict with engine references, context pool size, shared model IDs, etc.

`get_stats()`

Get repository statistics.

Returns: Dict with total models, unique engines, contexts, memory efficiency.

Best Practices

1. Set Appropriate Context Pool Size

# For 10 cameras with same model, 4 contexts is usually enough
repo = TensorRTModelRepository(default_num_contexts=4)

# For high concurrency, increase pool size
repo = TensorRTModelRepository(default_num_contexts=8)

Rule of thumb: Start with 4 contexts, increase if you see timeout errors.

2. Always Use GPU Tensors

# ✅ Good: Input on GPU
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(model_id, {"images": input_gpu})

# ❌ Bad: Input on CPU (will cause error)
input_cpu = torch.rand(1, 3, 640, 640)
outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!

3. Handle Timeout Gracefully

try:
    outputs = repo.infer(
        model_id="camera_1",
        inputs=inputs,
        timeout=5.0
    )
except RuntimeError as e:
    # All contexts busy, increase pool size or add backpressure
    print(f"Inference timeout: {e}")

4. Use Same File for Deduplication

# ✅ Good: Same file path → deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!

# ❌ Bad: Different paths (even if same content) → no deduplication
repo.load_model("cam1", "/models/yolo.trt")
repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine

TensorRT Best Practices Implemented

Based on NVIDIA documentation and web search findings:

Separate IExecutionContext per concurrent stream ✅
- Each context has its own CUDA stream
- Contexts never shared across threads simultaneously
Mutex-based context management ✅
- Queue-based borrowing with locks
- Thread-safe acquire/release pattern
GPU memory reuse ✅
- Engines shared by file hash
- Contexts pooled and reused
Zero-copy operations ✅
- All data stays in VRAM
- DLPack integration with PyTorch

Troubleshooting

"No execution context available within timeout"

Cause: All contexts busy with concurrent inferences.

Solutions:

Increase context pool size:

repo.load_model(model_id, file_path, num_contexts=8)

Increase timeout:

outputs = repo.infer(model_id, inputs, timeout=30.0)

Add backpressure/throttling to limit concurrent requests

Out of Memory (OOM)

Cause: Too many unique engines or large context pools.

Solutions:

Ensure deduplication working (same file paths)
Reduce context pool sizes
Use smaller models or quantization (INT8/FP16)

Import Error: "tensorrt could not be resolved"

Solution: Install TensorRT:

pip install tensorrt
# Or use NVIDIA's wheel for your CUDA version

Performance Tips

Batch Processing: Process multiple frames before synchronizing

outputs = repo.infer(model_id, inputs, synchronize=False)
# ... more inferences ...
torch.cuda.synchronize()  # Sync once at end

Async Inference: Don't synchronize if not needed immediately

outputs = repo.infer(model_id, inputs, synchronize=False)
# GPU continues working, CPU continues
# Synchronize later when you need results

Monitor Context Utilization:

stats = repo.get_stats()
print(f"Contexts: {stats['total_contexts']}")

# If timeouts occur frequently, increase pool size

License

Part of python-rtsp-worker project.

9.9 KiB Raw Blame History

TensorRT Model Repository

Architecture

Key Features

Design Rationale

Why Context Pooling?

Memory Savings Example

Usage

Basic Usage

Multi-Camera Scenario

Integration with RTSP Decoder

Concurrent Inference

API Reference

TensorRTModelRepository

__init__(gpu_id=0, default_num_contexts=4)

load_model(model_id, file_path, num_contexts=None, force_reload=False)

infer(model_id, inputs, synchronize=True, timeout=5.0)

unload_model(model_id)

get_metadata(model_id)

get_model_info(model_id)

get_stats()

Best Practices

1. Set Appropriate Context Pool Size

2. Always Use GPU Tensors

3. Handle Timeout Gracefully

4. Use Same File for Deduplication

TensorRT Best Practices Implemented

Troubleshooting

"No execution context available within timeout"

Out of Memory (OOM)

Import Error: "tensorrt could not be resolved"

Performance Tips

License

9.9 KiB

Raw Blame History

`init(gpu_id=0, default_num_contexts=4)`

`load_model(model_id, file_path, num_contexts=None, force_reload=False)`

`infer(model_id, inputs, synchronize=True, timeout=5.0)`

`unload_model(model_id)`

`get_metadata(model_id)`

`get_model_info(model_id)`

`get_stats()`