remove unrelated docs
This commit is contained in:
parent
8e20496fa7
commit
56a65a3377
2 changed files with 0 additions and 648 deletions
|
|
@ -1,380 +0,0 @@
|
|||
# TensorRT Model Repository
|
||||
|
||||
Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Key Features
|
||||
|
||||
1. **Model Deduplication by File Hash**
|
||||
- Multiple model IDs can point to the same model file
|
||||
- Only one engine loaded in VRAM per unique file
|
||||
- Example: 100 cameras with same model = 1 engine (not 100!)
|
||||
|
||||
2. **Context Pooling for Load Balancing**
|
||||
- Each unique engine has N execution contexts (configurable)
|
||||
- Contexts borrowed/returned via mutex-based queue
|
||||
- Enables concurrent inference without context-per-model overhead
|
||||
- Example: 100 cameras sharing 4 contexts efficiently
|
||||
|
||||
3. **GPU-to-GPU Inference**
|
||||
- All inputs/outputs stay in VRAM (zero CPU transfers)
|
||||
- Integrates seamlessly with StreamDecoder (frames already on GPU)
|
||||
- Maximum performance for video inference pipelines
|
||||
|
||||
4. **Thread-Safe Concurrent Inference**
|
||||
- Mutex-based context acquisition (TensorRT best practice)
|
||||
- No shared IExecutionContext across threads (safe)
|
||||
- Multiple threads can infer concurrently (limited by pool size)
|
||||
|
||||
## Design Rationale
|
||||
|
||||
### Why Context Pooling?
|
||||
|
||||
**Without pooling** (naive approach):
|
||||
```
|
||||
100 cameras → 100 model IDs → 100 execution contexts
|
||||
```
|
||||
- Problem: Each context consumes VRAM (layers, workspace, etc.)
|
||||
- Problem: Context creation overhead per camera
|
||||
- Problem: Doesn't scale to hundreds of cameras
|
||||
|
||||
**With pooling** (our approach):
|
||||
```
|
||||
100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
|
||||
```
|
||||
- Solution: Contexts shared across all cameras using same model
|
||||
- Solution: Borrow/return mechanism with mutex queue
|
||||
- Solution: Scales to any number of cameras with fixed context count
|
||||
|
||||
### Memory Savings Example
|
||||
|
||||
YOLOv8n model (~6MB engine file):
|
||||
|
||||
| Approach | Model IDs | Engines | Contexts | Approx VRAM |
|
||||
|----------|-----------|---------|----------|-------------|
|
||||
| Naive | 100 | 100 | 100 | ~1.5 GB |
|
||||
| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
|
||||
|
||||
**50x memory savings!**
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
# Initialize repository
|
||||
repo = TensorRTModelRepository(
|
||||
gpu_id=0,
|
||||
default_num_contexts=4 # 4 contexts per unique engine
|
||||
)
|
||||
|
||||
# Load model for camera 1
|
||||
repo.load_model(
|
||||
model_id="camera_1",
|
||||
file_path="models/yolov8n.trt"
|
||||
)
|
||||
|
||||
# Load same model for camera 2 (deduplication happens automatically)
|
||||
repo.load_model(
|
||||
model_id="camera_2",
|
||||
file_path="models/yolov8n.trt" # Same file → shares engine and contexts!
|
||||
)
|
||||
|
||||
# Run inference (GPU-to-GPU)
|
||||
import torch
|
||||
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||
|
||||
outputs = repo.infer(
|
||||
model_id="camera_1",
|
||||
inputs={"images": input_tensor},
|
||||
synchronize=True,
|
||||
timeout=5.0 # Wait up to 5s for available context
|
||||
)
|
||||
|
||||
# Outputs stay on GPU
|
||||
for name, tensor in outputs.items():
|
||||
print(f"{name}: {tensor.shape} on {tensor.device}")
|
||||
```
|
||||
|
||||
### Multi-Camera Scenario
|
||||
|
||||
```python
|
||||
# Setup multiple cameras
|
||||
cameras = [f"camera_{i}" for i in range(100)]
|
||||
|
||||
# Load same model for all cameras
|
||||
for camera_id in cameras:
|
||||
repo.load_model(
|
||||
model_id=camera_id,
|
||||
file_path="models/yolov8n.trt" # Same file for all
|
||||
)
|
||||
|
||||
# Check efficiency
|
||||
stats = repo.get_stats()
|
||||
print(f"Model IDs: {stats['total_model_ids']}") # 100
|
||||
print(f"Unique engines: {stats['unique_engines']}") # 1
|
||||
print(f"Total contexts: {stats['total_contexts']}") # 4
|
||||
```
|
||||
|
||||
### Integration with RTSP Decoder
|
||||
|
||||
```python
|
||||
from services.stream_decoder import StreamDecoderFactory
|
||||
from services.model_repository import TensorRTModelRepository
|
||||
|
||||
# Setup
|
||||
decoder_factory = StreamDecoderFactory(gpu_id=0)
|
||||
model_repo = TensorRTModelRepository(gpu_id=0)
|
||||
|
||||
# Create decoder for camera
|
||||
decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
|
||||
decoder.start()
|
||||
|
||||
# Load inference model
|
||||
model_repo.load_model("camera_main", "models/yolov8n.trt")
|
||||
|
||||
# Process frames (everything on GPU)
|
||||
frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA
|
||||
|
||||
# Preprocess (stays on GPU)
|
||||
frame_gpu = frame_gpu.float() / 255.0
|
||||
frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim
|
||||
|
||||
# Inference (GPU-to-GPU, zero copy)
|
||||
outputs = model_repo.infer(
|
||||
model_id="camera_main",
|
||||
inputs={"images": frame_gpu}
|
||||
)
|
||||
|
||||
# Post-process outputs (can stay on GPU)
|
||||
# ... NMS, bounding boxes, etc.
|
||||
```
|
||||
|
||||
### Concurrent Inference
|
||||
|
||||
```python
|
||||
import threading
|
||||
|
||||
def process_camera(camera_id: str, model_id: str):
|
||||
# Get frame from decoder (on GPU)
|
||||
frame = decoder.get_latest_frame(rgb=True)
|
||||
|
||||
# Inference automatically borrows/returns context from pool
|
||||
outputs = repo.infer(
|
||||
model_id=model_id,
|
||||
inputs={"images": frame},
|
||||
timeout=10.0 # Wait for available context
|
||||
)
|
||||
|
||||
# Process outputs...
|
||||
|
||||
# Multiple threads can infer concurrently
|
||||
threads = []
|
||||
for i in range(10): # 10 threads
|
||||
t = threading.Thread(
|
||||
target=process_camera,
|
||||
args=(f"camera_{i}", f"camera_{i}")
|
||||
)
|
||||
threads.append(t)
|
||||
t.start()
|
||||
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
# With 4 contexts: up to 4 inferences run in parallel
|
||||
# Others wait in queue, contexts auto-balanced
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### TensorRTModelRepository
|
||||
|
||||
#### `__init__(gpu_id=0, default_num_contexts=4)`
|
||||
Initialize the repository.
|
||||
|
||||
**Args:**
|
||||
- `gpu_id`: GPU device ID
|
||||
- `default_num_contexts`: Default context pool size per engine
|
||||
|
||||
#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
|
||||
Load a TensorRT model.
|
||||
|
||||
**Args:**
|
||||
- `model_id`: Unique identifier (e.g., "camera_1")
|
||||
- `file_path`: Path to .trt/.engine file
|
||||
- `num_contexts`: Context pool size (None = use default)
|
||||
- `force_reload`: Reload if model_id exists
|
||||
|
||||
**Returns:** `ModelMetadata`
|
||||
|
||||
**Deduplication:** If file hash matches existing model, reuses engine + contexts.
|
||||
|
||||
#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
|
||||
Run inference.
|
||||
|
||||
**Args:**
|
||||
- `model_id`: Model identifier
|
||||
- `inputs`: Dict mapping input names to CUDA tensors
|
||||
- `synchronize`: Wait for completion
|
||||
- `timeout`: Max wait time for context (seconds)
|
||||
|
||||
**Returns:** Dict mapping output names to CUDA tensors
|
||||
|
||||
**Thread-safe:** Borrows context from pool, returns after inference.
|
||||
|
||||
#### `unload_model(model_id)`
|
||||
Unload a model.
|
||||
|
||||
If last reference to engine, fully unloads from VRAM.
|
||||
|
||||
#### `get_metadata(model_id)`
|
||||
Get model metadata.
|
||||
|
||||
**Returns:** `ModelMetadata` or `None`
|
||||
|
||||
#### `get_model_info(model_id)`
|
||||
Get detailed model information.
|
||||
|
||||
**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
|
||||
|
||||
#### `get_stats()`
|
||||
Get repository statistics.
|
||||
|
||||
**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Set Appropriate Context Pool Size
|
||||
|
||||
```python
|
||||
# For 10 cameras with same model, 4 contexts is usually enough
|
||||
repo = TensorRTModelRepository(default_num_contexts=4)
|
||||
|
||||
# For high concurrency, increase pool size
|
||||
repo = TensorRTModelRepository(default_num_contexts=8)
|
||||
```
|
||||
|
||||
**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
|
||||
|
||||
### 2. Always Use GPU Tensors
|
||||
|
||||
```python
|
||||
# ✅ Good: Input on GPU
|
||||
input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
|
||||
outputs = repo.infer(model_id, {"images": input_gpu})
|
||||
|
||||
# ❌ Bad: Input on CPU (will cause error)
|
||||
input_cpu = torch.rand(1, 3, 640, 640)
|
||||
outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError!
|
||||
```
|
||||
|
||||
### 3. Handle Timeout Gracefully
|
||||
|
||||
```python
|
||||
try:
|
||||
outputs = repo.infer(
|
||||
model_id="camera_1",
|
||||
inputs=inputs,
|
||||
timeout=5.0
|
||||
)
|
||||
except RuntimeError as e:
|
||||
# All contexts busy, increase pool size or add backpressure
|
||||
print(f"Inference timeout: {e}")
|
||||
```
|
||||
|
||||
### 4. Use Same File for Deduplication
|
||||
|
||||
```python
|
||||
# ✅ Good: Same file path → deduplication
|
||||
repo.load_model("cam1", "/models/yolo.trt")
|
||||
repo.load_model("cam2", "/models/yolo.trt") # Shares engine!
|
||||
|
||||
# ❌ Bad: Different paths (even if same content) → no deduplication
|
||||
repo.load_model("cam1", "/models/yolo.trt")
|
||||
repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine
|
||||
```
|
||||
|
||||
## TensorRT Best Practices Implemented
|
||||
|
||||
Based on NVIDIA documentation and web search findings:
|
||||
|
||||
1. **Separate IExecutionContext per concurrent stream** ✅
|
||||
- Each context has its own CUDA stream
|
||||
- Contexts never shared across threads simultaneously
|
||||
|
||||
2. **Mutex-based context management** ✅
|
||||
- Queue-based borrowing with locks
|
||||
- Thread-safe acquire/release pattern
|
||||
|
||||
3. **GPU memory reuse** ✅
|
||||
- Engines shared by file hash
|
||||
- Contexts pooled and reused
|
||||
|
||||
4. **Zero-copy operations** ✅
|
||||
- All data stays in VRAM
|
||||
- DLPack integration with PyTorch
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No execution context available within timeout"
|
||||
|
||||
**Cause:** All contexts busy with concurrent inferences.
|
||||
|
||||
**Solutions:**
|
||||
1. Increase context pool size:
|
||||
```python
|
||||
repo.load_model(model_id, file_path, num_contexts=8)
|
||||
```
|
||||
2. Increase timeout:
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, timeout=30.0)
|
||||
```
|
||||
3. Add backpressure/throttling to limit concurrent requests
|
||||
|
||||
### Out of Memory (OOM)
|
||||
|
||||
**Cause:** Too many unique engines or large context pools.
|
||||
|
||||
**Solutions:**
|
||||
1. Ensure deduplication working (same file paths)
|
||||
2. Reduce context pool sizes
|
||||
3. Use smaller models or quantization (INT8/FP16)
|
||||
|
||||
### Import Error: "tensorrt could not be resolved"
|
||||
|
||||
**Solution:** Install TensorRT:
|
||||
```bash
|
||||
pip install tensorrt
|
||||
# Or use NVIDIA's wheel for your CUDA version
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Batch Processing:** Process multiple frames before synchronizing
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||
# ... more inferences ...
|
||||
torch.cuda.synchronize() # Sync once at end
|
||||
```
|
||||
|
||||
2. **Async Inference:** Don't synchronize if not needed immediately
|
||||
```python
|
||||
outputs = repo.infer(model_id, inputs, synchronize=False)
|
||||
# GPU continues working, CPU continues
|
||||
# Synchronize later when you need results
|
||||
```
|
||||
|
||||
3. **Monitor Context Utilization:**
|
||||
```python
|
||||
stats = repo.get_stats()
|
||||
print(f"Contexts: {stats['total_contexts']}")
|
||||
|
||||
# If timeouts occur frequently, increase pool size
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Part of python-rtsp-worker project.
|
||||
Loading…
Add table
Add a link
Reference in a new issue