remove unrelated docs
This commit is contained in:
parent
8e20496fa7
commit
56a65a3377
2 changed files with 0 additions and 648 deletions
|
|
@ -1,268 +0,0 @@
|
||||||
# Performance Optimization Summary
|
|
||||||
|
|
||||||
## Investigation: Multi-Camera FPS Drop
|
|
||||||
|
|
||||||
### Initial Problem
|
|
||||||
**Symptom**: Severe FPS degradation in multi-camera mode
|
|
||||||
- Single camera: 3.01 FPS
|
|
||||||
- Multi-camera (4 cams): 0.70 FPS per camera
|
|
||||||
- **76.8% FPS drop per camera**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Root Cause Analysis
|
|
||||||
|
|
||||||
### Profiling Results (BEFORE Optimization)
|
|
||||||
|
|
||||||
| Component | Time | FPS | Status |
|
|
||||||
|-----------|------|-----|--------|
|
|
||||||
| Video Decoding (NVDEC) | 0.24 ms | 4165 FPS | ✓ Fast |
|
|
||||||
| Preprocessing | 0.14 ms | 7158 FPS | ✓ Fast |
|
|
||||||
| TensorRT Inference | 1.79 ms | 558 FPS | ✓ Fast |
|
|
||||||
| **Postprocessing (NMS)** | **404.87 ms** | **2.47 FPS** | ⚠️ **CRITICAL BOTTLENECK** |
|
|
||||||
| Full Pipeline | 1952 ms | 0.51 FPS | ⚠️ Slow |
|
|
||||||
|
|
||||||
**Bottleneck Identified**: Postprocessing was **226x slower than inference!**
|
|
||||||
|
|
||||||
### Why Postprocessing Was So Slow
|
|
||||||
|
|
||||||
```python
|
|
||||||
# BEFORE: services/yolo.py (SLOW - 404ms)
|
|
||||||
for detection in output[0]: # Python loop over 8400 anchor points
|
|
||||||
bbox = detection[:4]
|
|
||||||
class_scores = detection[4:]
|
|
||||||
max_score, class_id = torch.max(class_scores, 0)
|
|
||||||
|
|
||||||
if max_score > conf_threshold:
|
|
||||||
cx, cy, w, h = bbox
|
|
||||||
x1 = cx - w / 2 # Individual operations
|
|
||||||
# ...
|
|
||||||
detections.append([
|
|
||||||
x1.item(), # GPU→CPU sync (very slow!)
|
|
||||||
y1.item(),
|
|
||||||
# ...
|
|
||||||
])
|
|
||||||
```
|
|
||||||
|
|
||||||
**Problems**:
|
|
||||||
1. **Python loop** over 8400 anchor points (not vectorized)
|
|
||||||
2. **`.item()` calls** causing GPU→CPU synchronization stalls
|
|
||||||
3. **List building** then converting back to tensor (inefficient)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Solution 1: Vectorized Postprocessing
|
|
||||||
|
|
||||||
### Implementation
|
|
||||||
|
|
||||||
```python
|
|
||||||
# AFTER: services/yolo.py (FAST - 7ms)
|
|
||||||
# Vectorized operations (no Python loops)
|
|
||||||
output = output.transpose(1, 2).squeeze(0) # (8400, 84)
|
|
||||||
|
|
||||||
# Split bbox and scores (vectorized)
|
|
||||||
bboxes = output[:, :4] # (8400, 4)
|
|
||||||
class_scores = output[:, 4:] # (8400, 80)
|
|
||||||
|
|
||||||
# Get max scores for ALL anchors at once
|
|
||||||
max_scores, class_ids = torch.max(class_scores, dim=1)
|
|
||||||
|
|
||||||
# Filter by confidence (vectorized)
|
|
||||||
mask = max_scores > conf_threshold
|
|
||||||
filtered_bboxes = bboxes[mask]
|
|
||||||
filtered_scores = max_scores[mask]
|
|
||||||
filtered_class_ids = class_ids[mask]
|
|
||||||
|
|
||||||
# Convert bbox format (vectorized)
|
|
||||||
cx, cy, w, h = filtered_bboxes[:, 0], filtered_bboxes[:, 1], ...
|
|
||||||
x1 = cx - w / 2 # Operates on entire tensor
|
|
||||||
x2 = cx + w / 2
|
|
||||||
|
|
||||||
# Stack into detections (pure GPU operations, no .item())
|
|
||||||
detections_tensor = torch.stack([x1, y1, x2, y2, filtered_scores, ...], dim=1)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Results (AFTER Optimization)
|
|
||||||
|
|
||||||
| Component | Time (Before) | Time (After) | Improvement |
|
|
||||||
|-----------|---------------|--------------|-------------|
|
|
||||||
| Postprocessing | 404.87 ms | **7.33 ms** | **55x faster** |
|
|
||||||
| Full Pipeline | 1952 ms | **714 ms** | **2.7x faster** |
|
|
||||||
| Multi-Camera (4 cams) | 5859 ms | **1228 ms** | **4.8x faster** |
|
|
||||||
|
|
||||||
**Key Achievement**: Eliminated 98.2% of postprocessing time!
|
|
||||||
|
|
||||||
### FPS Benchmark Comparison
|
|
||||||
|
|
||||||
| Metric | Before | After | Improvement |
|
|
||||||
|--------|--------|-------|-------------|
|
|
||||||
| **Single Camera** | 3.01 FPS | **558.03 FPS** | **185x faster** |
|
|
||||||
| **Multi-Camera (per cam)** | 0.70 FPS | **147.06 FPS** | **210x faster** |
|
|
||||||
| **Combined Throughput** | 2.79 FPS | **588.22 FPS** | **211x faster** |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Solution 2: Batch Inference (Optional)
|
|
||||||
|
|
||||||
### Remaining Issue
|
|
||||||
Even after vectorization, there's still a **73.6% FPS drop** in multi-camera mode.
|
|
||||||
|
|
||||||
**Root Cause**: **Sequential Processing**
|
|
||||||
```python
|
|
||||||
# Current approach: Process cameras one-by-one
|
|
||||||
for camera in cameras:
|
|
||||||
frame = camera.get_frame()
|
|
||||||
result = model.infer(frame) # Wait for each inference
|
|
||||||
# Total time = inference_time × num_cameras
|
|
||||||
```
|
|
||||||
|
|
||||||
### Batch Inference Solution
|
|
||||||
|
|
||||||
**Concept**: Process all cameras in a single batched inference call
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Collect frames from all cameras
|
|
||||||
frames = [cam.get_frame() for cam in cameras]
|
|
||||||
|
|
||||||
# Stack into batch: (4, 3, 640, 640)
|
|
||||||
batch_input = preprocess_batch(frames)
|
|
||||||
|
|
||||||
# Single inference for ALL cameras
|
|
||||||
outputs = model.infer(batch_input) # Process 4 frames together!
|
|
||||||
|
|
||||||
# Split results per camera
|
|
||||||
results = postprocess_batch(outputs)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Requirements
|
|
||||||
|
|
||||||
1. **Rebuild model with dynamic batching**:
|
|
||||||
```bash
|
|
||||||
./scripts/build_batch_model.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This creates `models/yolov8n_batch4.trt` with support for batch sizes 1-4.
|
|
||||||
|
|
||||||
2. **Use batch preprocessing/postprocessing**:
|
|
||||||
- `preprocess_batch(frames)` - Stack frames into batch
|
|
||||||
- `postprocess_batch(outputs)` - Split batched results
|
|
||||||
|
|
||||||
### Expected Performance
|
|
||||||
|
|
||||||
| Approach | Single Cam FPS | Multi-Cam (4) Per-Cam FPS | Efficiency |
|
|
||||||
|----------|---------------|---------------------------|------------|
|
|
||||||
| Sequential | 558 FPS | 147 FPS (73.6% drop) | Poor |
|
|
||||||
| **Batched** | 558 FPS | **300-400+ FPS** (40-28% drop) | **Excellent** |
|
|
||||||
|
|
||||||
**Why Batched is Faster**:
|
|
||||||
- GPU processes 4 frames in parallel (better utilization)
|
|
||||||
- Single kernel launch instead of 4 separate calls
|
|
||||||
- Reduced CPU-GPU synchronization overhead
|
|
||||||
- Better memory bandwidth usage
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary of Optimizations
|
|
||||||
|
|
||||||
### 1. Vectorized Postprocessing ✓ (Completed)
|
|
||||||
- **Impact**: 185x single-camera speedup, 210x multi-camera speedup
|
|
||||||
- **Effort**: Low (code refactor only)
|
|
||||||
- **Status**: ✓ Implemented in `services/yolo.py`
|
|
||||||
|
|
||||||
### 2. Batch Inference 🔄 (Optional)
|
|
||||||
- **Impact**: Additional 2-3x multi-camera speedup
|
|
||||||
- **Effort**: Medium (requires model rebuild + code changes)
|
|
||||||
- **Status**: Infrastructure ready, needs model rebuild
|
|
||||||
|
|
||||||
### 3. Alternative Optimizations (Not Needed)
|
|
||||||
- CUDA streams: Complex, batch inference is simpler
|
|
||||||
- Multi-threading: Limited gains due to GIL
|
|
||||||
- Lower resolution: Reduces accuracy
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## How to Test Batch Inference
|
|
||||||
|
|
||||||
### Step 1: Rebuild Model
|
|
||||||
```bash
|
|
||||||
./scripts/build_batch_model.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Run Benchmark
|
|
||||||
```bash
|
|
||||||
python test_batch_inference.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will compare:
|
|
||||||
- Sequential processing (current method)
|
|
||||||
- Batched processing (optimized method)
|
|
||||||
|
|
||||||
### Step 3: Integrate into Production
|
|
||||||
See `test_batch_inference.py` for example implementation:
|
|
||||||
- `preprocess_batch()` - Stack frames
|
|
||||||
- `postprocess_batch()` - Split results
|
|
||||||
- Single `model_repo.infer()` call for all cameras
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Files Modified/Created
|
|
||||||
|
|
||||||
### Modified:
|
|
||||||
- `services/yolo.py` - Vectorized postprocessing (55x faster)
|
|
||||||
|
|
||||||
### Created:
|
|
||||||
- `test_profiling.py` - Component-level profiling
|
|
||||||
- `test_fps_benchmark.py` - Single vs multi-camera FPS
|
|
||||||
- `test_batch_inference.py` - Batch inference test
|
|
||||||
- `scripts/build_batch_model.sh` - Build batch-enabled model
|
|
||||||
- `OPTIMIZATION_SUMMARY.md` - This document
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Performance Timeline
|
|
||||||
|
|
||||||
```
|
|
||||||
Initial State (Before Investigation):
|
|
||||||
Single Camera: 3.01 FPS
|
|
||||||
Multi-Camera: 0.70 FPS per camera
|
|
||||||
⚠️ CRITICAL PERFORMANCE ISSUE
|
|
||||||
|
|
||||||
After Vectorization:
|
|
||||||
Single Camera: 558.03 FPS (+185x)
|
|
||||||
Multi-Camera: 147.06 FPS (+210x)
|
|
||||||
✓ BOTTLENECK ELIMINATED
|
|
||||||
|
|
||||||
After Batch Inference (Projected):
|
|
||||||
Single Camera: 558.03 FPS (unchanged)
|
|
||||||
Multi-Camera: 300-400 FPS (+2-3x additional)
|
|
||||||
✓ OPTIMAL PERFORMANCE
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Lessons Learned
|
|
||||||
|
|
||||||
1. **Profile First**: Initial assumption was inference bottleneck, but it was postprocessing
|
|
||||||
2. **Python Loops Are Slow**: Vectorize everything when working with tensors
|
|
||||||
3. **Avoid CPU↔GPU Sync**: `.item()` calls were causing massive stalls
|
|
||||||
4. **Batch When Possible**: GPU parallelism much better than sequential processing
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommendations
|
|
||||||
|
|
||||||
### For Current Setup:
|
|
||||||
- ✓ Use vectorized postprocessing (already implemented)
|
|
||||||
- ✓ Enjoy 210x speedup for multi-camera tracking
|
|
||||||
- ✓ 147 FPS per camera is excellent for most applications
|
|
||||||
|
|
||||||
### For Maximum Performance:
|
|
||||||
- Rebuild model with batch support
|
|
||||||
- Implement batch inference (see `test_batch_inference.py`)
|
|
||||||
- Expected: 300-400 FPS per camera with 4 cameras
|
|
||||||
|
|
||||||
### For Production:
|
|
||||||
- Monitor GPU utilization (should be >80% with batch inference)
|
|
||||||
- Consider batch size based on # of cameras (4, 8, or 16)
|
|
||||||
- Use FP16 precision for best performance
|
|
||||||
- Keep context pool size = batch size for optimal parallelism
|
|
||||||
|
|
@ -1,380 +0,0 @@
|
||||||
# TensorRT Model Repository
|
|
||||||
|
|
||||||
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
### Key Features
|
|
||||||
|
|
||||||
1. **Model Deduplication by File Hash**
|
|
||||||
- Multiple model IDs can point to the same model file
|
|
||||||
- Only one engine loaded in VRAM per unique file
|
|
||||||
- Example: 100 cameras with same model = 1 engine (not 100!)
|
|
||||||
|
|
||||||
2. **Context Pooling for Load Balancing**
|
|
||||||
- Each unique engine has N execution contexts (configurable)
|
|
||||||
- Contexts borrowed/returned via mutex-based queue
|
|
||||||
- Enables concurrent inference without context-per-model overhead
|
|
||||||
- Example: 100 cameras sharing 4 contexts efficiently
|
|
||||||
|
|
||||||
3. **GPU-to-GPU Inference**
|
|
||||||
- All inputs/outputs stay in VRAM (zero CPU transfers)
|
|
||||||
- Integrates seamlessly with StreamDecoder (frames already on GPU)
|
|
||||||
- Maximum performance for video inference pipelines
|
|
||||||
|
|
||||||
4. **Thread-Safe Concurrent Inference**
|
|
||||||
- Mutex-based context acquisition (TensorRT best practice)
|
|
||||||
- No shared IExecutionContext across threads (safe)
|
|
||||||
- Multiple threads can infer concurrently (limited by pool size)
|
|
||||||
|
|
||||||
## Design Rationale
|
|
||||||
|
|
||||||
### Why Context Pooling?
|
|
||||||
|
|
||||||
**Without pooling** (naive approach):
|
|
||||||
```
|
|
||||||
100 cameras → 100 model IDs → 100 execution contexts
|
|
||||||
```
|
|
||||||
- Problem: Each context consumes VRAM (layers, workspace, etc.)
|
|
||||||
- Problem: Context creation overhead per camera
|
|
||||||
- Problem: Doesn't scale to hundreds of cameras
|
|
||||||
|
|
||||||
**With pooling** (our approach):
|
|
||||||
```
|
|
||||||
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
|
|
||||||
```
|
|
||||||
- Solution: Contexts shared across all cameras using same model
|
|
||||||
- Solution: Borrow/return mechanism with mutex queue
|
|
||||||
- Solution: Scales to any number of cameras with fixed context count
|
|
||||||
|
|
||||||
### Memory Savings Example
|
|
||||||
|
|
||||||
YOLOv8n model (~6MB engine file):
|
|
||||||
|
|
||||||
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|
|
||||||
|----------|-----------|---------|----------|-------------|
|
|
||||||
| Naive | 100 | 100 | 100 | ~1.5 GB |
|
|
||||||
| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
|
|
||||||
|
|
||||||
**50x memory savings!**
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Basic Usage
|
|
||||||
|
|
||||||
```python
|
|
||||||
from services.model_repository import TensorRTModelRepository
|
|
||||||
|
|
||||||
# Initialize repository
|
|
||||||
repo = TensorRTModelRepository(
|
|
||||||
gpu_id=0,
|
|
||||||
default_num_contexts=4 # 4 contexts per unique engine
|
|
||||||
)
|
|
||||||
|
|
||||||
# Load model for camera 1
|
|
||||||
repo.load_model(
|
|
||||||
model_id="camera_1",
|
|
||||||
file_path="models/yolov8n.trt"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Load same model for camera 2 (deduplication happens automatically)
|
|
||||||
repo.load_model(
|
|
||||||
model_id="camera_2",
|
|
||||||
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
|
|
||||||
)
|
|
||||||
|
|
||||||
# Run inference (GPU-to-GPU)
|
|
||||||
import torch
|
|
||||||
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
|
||||||
|
|
||||||
outputs = repo.infer(
|
|
||||||
model_id="camera_1",
|
|
||||||
inputs={"images": input_tensor},
|
|
||||||
synchronize=True,
|
|
||||||
timeout=5.0 # Wait up to 5s for available context
|
|
||||||
)
|
|
||||||
|
|
||||||
# Outputs stay on GPU
|
|
||||||
for name, tensor in outputs.items():
|
|
||||||
print(f"{name}: {tensor.shape} on {tensor.device}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Multi-Camera Scenario
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Setup multiple cameras
|
|
||||||
cameras = [f"camera_{i}" for i in range(100)]
|
|
||||||
|
|
||||||
# Load same model for all cameras
|
|
||||||
for camera_id in cameras:
|
|
||||||
repo.load_model(
|
|
||||||
model_id=camera_id,
|
|
||||||
file_path="models/yolov8n.trt" # Same file for all
|
|
||||||
)
|
|
||||||
|
|
||||||
# Check efficiency
|
|
||||||
stats = repo.get_stats()
|
|
||||||
print(f"Model IDs: {stats['total_model_ids']}") # 100
|
|
||||||
print(f"Unique engines: {stats['unique_engines']}") # 1
|
|
||||||
print(f"Total contexts: {stats['total_contexts']}") # 4
|
|
||||||
```
|
|
||||||
|
|
||||||
### Integration with RTSP Decoder
|
|
||||||
|
|
||||||
```python
|
|
||||||
from services.stream_decoder import StreamDecoderFactory
|
|
||||||
from services.model_repository import TensorRTModelRepository
|
|
||||||
|
|
||||||
# Setup
|
|
||||||
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
|
||||||
model_repo = TensorRTModelRepository(gpu_id=0)
|
|
||||||
|
|
||||||
# Create decoder for camera
|
|
||||||
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
|
|
||||||
decoder.start()
|
|
||||||
|
|
||||||
# Load inference model
|
|
||||||
model_repo.load_model("camera_main", "models/yolov8n.trt")
|
|
||||||
|
|
||||||
# Process frames (everything on GPU)
|
|
||||||
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
|
|
||||||
|
|
||||||
# Preprocess (stays on GPU)
|
|
||||||
frame_gpu = frame_gpu.float() / 255.0
|
|
||||||
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
|
|
||||||
|
|
||||||
# Inference (GPU-to-GPU, zero copy)
|
|
||||||
outputs = model_repo.infer(
|
|
||||||
model_id="camera_main",
|
|
||||||
inputs={"images": frame_gpu}
|
|
||||||
)
|
|
||||||
|
|
||||||
# Post-process outputs (can stay on GPU)
|
|
||||||
# ... NMS, bounding boxes, etc.
|
|
||||||
```
|
|
||||||
|
|
||||||
### Concurrent Inference
|
|
||||||
|
|
||||||
```python
|
|
||||||
import threading
|
|
||||||
|
|
||||||
def process_camera(camera_id: str, model_id: str):
|
|
||||||
# Get frame from decoder (on GPU)
|
|
||||||
frame = decoder.get_latest_frame(rgb=True)
|
|
||||||
|
|
||||||
# Inference automatically borrows/returns context from pool
|
|
||||||
outputs = repo.infer(
|
|
||||||
model_id=model_id,
|
|
||||||
inputs={"images": frame},
|
|
||||||
timeout=10.0 # Wait for available context
|
|
||||||
)
|
|
||||||
|
|
||||||
# Process outputs...
|
|
||||||
|
|
||||||
# Multiple threads can infer concurrently
|
|
||||||
threads = []
|
|
||||||
for i in range(10): # 10 threads
|
|
||||||
t = threading.Thread(
|
|
||||||
target=process_camera,
|
|
||||||
args=(f"camera_{i}", f"camera_{i}")
|
|
||||||
)
|
|
||||||
threads.append(t)
|
|
||||||
t.start()
|
|
||||||
|
|
||||||
for t in threads:
|
|
||||||
t.join()
|
|
||||||
|
|
||||||
# With 4 contexts: up to 4 inferences run in parallel
|
|
||||||
# Others wait in queue, contexts auto-balanced
|
|
||||||
```
|
|
||||||
|
|
||||||
## API Reference
|
|
||||||
|
|
||||||
### TensorRTModelRepository
|
|
||||||
|
|
||||||
#### `__init__(gpu_id=0, default_num_contexts=4)`
|
|
||||||
Initialize the repository.
|
|
||||||
|
|
||||||
**Args:**
|
|
||||||
- `gpu_id`: GPU device ID
|
|
||||||
- `default_num_contexts`: Default context pool size per engine
|
|
||||||
|
|
||||||
#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
|
|
||||||
Load a TensorRT model.
|
|
||||||
|
|
||||||
**Args:**
|
|
||||||
- `model_id`: Unique identifier (e.g., "camera_1")
|
|
||||||
- `file_path`: Path to .trt/.engine file
|
|
||||||
- `num_contexts`: Context pool size (None = use default)
|
|
||||||
- `force_reload`: Reload if model_id exists
|
|
||||||
|
|
||||||
**Returns:** `ModelMetadata`
|
|
||||||
|
|
||||||
**Deduplication:** If file hash matches existing model, reuses engine + contexts.
|
|
||||||
|
|
||||||
#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
|
|
||||||
Run inference.
|
|
||||||
|
|
||||||
**Args:**
|
|
||||||
- `model_id`: Model identifier
|
|
||||||
- `inputs`: Dict mapping input names to CUDA tensors
|
|
||||||
- `synchronize`: Wait for completion
|
|
||||||
- `timeout`: Max wait time for context (seconds)
|
|
||||||
|
|
||||||
**Returns:** Dict mapping output names to CUDA tensors
|
|
||||||
|
|
||||||
**Thread-safe:** Borrows context from pool, returns after inference.
|
|
||||||
|
|
||||||
#### `unload_model(model_id)`
|
|
||||||
Unload a model.
|
|
||||||
|
|
||||||
If last reference to engine, fully unloads from VRAM.
|
|
||||||
|
|
||||||
#### `get_metadata(model_id)`
|
|
||||||
Get model metadata.
|
|
||||||
|
|
||||||
**Returns:** `ModelMetadata` or `None`
|
|
||||||
|
|
||||||
#### `get_model_info(model_id)`
|
|
||||||
Get detailed model information.
|
|
||||||
|
|
||||||
**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
|
|
||||||
|
|
||||||
#### `get_stats()`
|
|
||||||
Get repository statistics.
|
|
||||||
|
|
||||||
**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
|
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
### 1. Set Appropriate Context Pool Size
|
|
||||||
|
|
||||||
```python
|
|
||||||
# For 10 cameras with same model, 4 contexts is usually enough
|
|
||||||
repo = TensorRTModelRepository(default_num_contexts=4)
|
|
||||||
|
|
||||||
# For high concurrency, increase pool size
|
|
||||||
repo = TensorRTModelRepository(default_num_contexts=8)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
|
|
||||||
|
|
||||||
### 2. Always Use GPU Tensors
|
|
||||||
|
|
||||||
```python
|
|
||||||
# ✅ Good: Input on GPU
|
|
||||||
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
|
|
||||||
outputs = repo.infer(model_id, {"images": input_gpu})
|
|
||||||
|
|
||||||
# ❌ Bad: Input on CPU (will cause error)
|
|
||||||
input_cpu = torch.rand(1, 3, 640, 640)
|
|
||||||
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Handle Timeout Gracefully
|
|
||||||
|
|
||||||
```python
|
|
||||||
try:
|
|
||||||
outputs = repo.infer(
|
|
||||||
model_id="camera_1",
|
|
||||||
inputs=inputs,
|
|
||||||
timeout=5.0
|
|
||||||
)
|
|
||||||
except RuntimeError as e:
|
|
||||||
# All contexts busy, increase pool size or add backpressure
|
|
||||||
print(f"Inference timeout: {e}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Use Same File for Deduplication
|
|
||||||
|
|
||||||
```python
|
|
||||||
# ✅ Good: Same file path → deduplication
|
|
||||||
repo.load_model("cam1", "/models/yolo.trt")
|
|
||||||
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
|
|
||||||
|
|
||||||
# ❌ Bad: Different paths (even if same content) → no deduplication
|
|
||||||
repo.load_model("cam1", "/models/yolo.trt")
|
|
||||||
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
|
|
||||||
```
|
|
||||||
|
|
||||||
## TensorRT Best Practices Implemented
|
|
||||||
|
|
||||||
Based on NVIDIA documentation and web search findings:
|
|
||||||
|
|
||||||
1. **Separate IExecutionContext per concurrent stream** ✅
|
|
||||||
- Each context has its own CUDA stream
|
|
||||||
- Contexts never shared across threads simultaneously
|
|
||||||
|
|
||||||
2. **Mutex-based context management** ✅
|
|
||||||
- Queue-based borrowing with locks
|
|
||||||
- Thread-safe acquire/release pattern
|
|
||||||
|
|
||||||
3. **GPU memory reuse** ✅
|
|
||||||
- Engines shared by file hash
|
|
||||||
- Contexts pooled and reused
|
|
||||||
|
|
||||||
4. **Zero-copy operations** ✅
|
|
||||||
- All data stays in VRAM
|
|
||||||
- DLPack integration with PyTorch
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### "No execution context available within timeout"
|
|
||||||
|
|
||||||
**Cause:** All contexts busy with concurrent inferences.
|
|
||||||
|
|
||||||
**Solutions:**
|
|
||||||
1. Increase context pool size:
|
|
||||||
```python
|
|
||||||
repo.load_model(model_id, file_path, num_contexts=8)
|
|
||||||
```
|
|
||||||
2. Increase timeout:
|
|
||||||
```python
|
|
||||||
outputs = repo.infer(model_id, inputs, timeout=30.0)
|
|
||||||
```
|
|
||||||
3. Add backpressure/throttling to limit concurrent requests
|
|
||||||
|
|
||||||
### Out of Memory (OOM)
|
|
||||||
|
|
||||||
**Cause:** Too many unique engines or large context pools.
|
|
||||||
|
|
||||||
**Solutions:**
|
|
||||||
1. Ensure deduplication working (same file paths)
|
|
||||||
2. Reduce context pool sizes
|
|
||||||
3. Use smaller models or quantization (INT8/FP16)
|
|
||||||
|
|
||||||
### Import Error: "tensorrt could not be resolved"
|
|
||||||
|
|
||||||
**Solution:** Install TensorRT:
|
|
||||||
```bash
|
|
||||||
pip install tensorrt
|
|
||||||
# Or use NVIDIA's wheel for your CUDA version
|
|
||||||
```
|
|
||||||
|
|
||||||
## Performance Tips
|
|
||||||
|
|
||||||
1. **Batch Processing:** Process multiple frames before synchronizing
|
|
||||||
```python
|
|
||||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
|
||||||
# ... more inferences ...
|
|
||||||
torch.cuda.synchronize() # Sync once at end
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Async Inference:** Don't synchronize if not needed immediately
|
|
||||||
```python
|
|
||||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
|
||||||
# GPU continues working, CPU continues
|
|
||||||
# Synchronize later when you need results
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Monitor Context Utilization:**
|
|
||||||
```python
|
|
||||||
stats = repo.get_stats()
|
|
||||||
print(f"Contexts: {stats['total_contexts']}")
|
|
||||||
|
|
||||||
# If timeouts occur frequently, increase pool size
|
|
||||||
```
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
Part of python-rtsp-worker project.
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue