From 56a65a33770a2452c7bf9c17264c618eca6d7b7a Mon Sep 17 00:00:00 2001 From: Siwat Sirichai Date: Sun, 9 Nov 2025 11:51:21 +0700 Subject: [PATCH] remove unrelated docs --- OPTIMIZATION_SUMMARY.md | 268 -------------------- services/README_MODEL_REPOSITORY.md | 380 ---------------------------- 2 files changed, 648 deletions(-) delete mode 100644 OPTIMIZATION_SUMMARY.md delete mode 100644 services/README_MODEL_REPOSITORY.md diff --git a/OPTIMIZATION_SUMMARY.md b/OPTIMIZATION_SUMMARY.md deleted file mode 100644 index beb7312..0000000 --- a/OPTIMIZATION_SUMMARY.md +++ /dev/null @@ -1,268 +0,0 @@ -# Performance Optimization Summary - -## Investigation: Multi-Camera FPS Drop - -### Initial Problem -**Symptom**: Severe FPS degradation in multi-camera mode -- Single camera: 3.01 FPS -- Multi-camera (4 cams): 0.70 FPS per camera -- **76.8% FPS drop per camera** - ---- - -## Root Cause Analysis - -### Profiling Results (BEFORE Optimization) - -| Component | Time | FPS | Status | -|-----------|------|-----|--------| -| Video Decoding (NVDEC) | 0.24 ms | 4165 FPS | ✓ Fast | -| Preprocessing | 0.14 ms | 7158 FPS | ✓ Fast | -| TensorRT Inference | 1.79 ms | 558 FPS | ✓ Fast | -| **Postprocessing (NMS)** | **404.87 ms** | **2.47 FPS** | ⚠️ **CRITICAL BOTTLENECK** | -| Full Pipeline | 1952 ms | 0.51 FPS | ⚠️ Slow | - -**Bottleneck Identified**: Postprocessing was **226x slower than inference!** - -### Why Postprocessing Was So Slow - -```python -# BEFORE: services/yolo.py (SLOW - 404ms) -for detection in output[0]: # Python loop over 8400 anchor points - bbox = detection[:4] - class_scores = detection[4:] - max_score, class_id = torch.max(class_scores, 0) - - if max_score > conf_threshold: - cx, cy, w, h = bbox - x1 = cx - w / 2 # Individual operations - # ... - detections.append([ - x1.item(), # GPU→CPU sync (very slow!) - y1.item(), - # ... - ]) -``` - -**Problems**: -1. **Python loop** over 8400 anchor points (not vectorized) -2. **`.item()` calls** causing GPU→CPU synchronization stalls -3. **List building** then converting back to tensor (inefficient) - ---- - -## Solution 1: Vectorized Postprocessing - -### Implementation - -```python -# AFTER: services/yolo.py (FAST - 7ms) -# Vectorized operations (no Python loops) -output = output.transpose(1, 2).squeeze(0) # (8400, 84) - -# Split bbox and scores (vectorized) -bboxes = output[:, :4] # (8400, 4) -class_scores = output[:, 4:] # (8400, 80) - -# Get max scores for ALL anchors at once -max_scores, class_ids = torch.max(class_scores, dim=1) - -# Filter by confidence (vectorized) -mask = max_scores > conf_threshold -filtered_bboxes = bboxes[mask] -filtered_scores = max_scores[mask] -filtered_class_ids = class_ids[mask] - -# Convert bbox format (vectorized) -cx, cy, w, h = filtered_bboxes[:, 0], filtered_bboxes[:, 1], ... -x1 = cx - w / 2 # Operates on entire tensor -x2 = cx + w / 2 - -# Stack into detections (pure GPU operations, no .item()) -detections_tensor = torch.stack([x1, y1, x2, y2, filtered_scores, ...], dim=1) -``` - -### Results (AFTER Optimization) - -| Component | Time (Before) | Time (After) | Improvement | -|-----------|---------------|--------------|-------------| -| Postprocessing | 404.87 ms | **7.33 ms** | **55x faster** | -| Full Pipeline | 1952 ms | **714 ms** | **2.7x faster** | -| Multi-Camera (4 cams) | 5859 ms | **1228 ms** | **4.8x faster** | - -**Key Achievement**: Eliminated 98.2% of postprocessing time! - -### FPS Benchmark Comparison - -| Metric | Before | After | Improvement | -|--------|--------|-------|-------------| -| **Single Camera** | 3.01 FPS | **558.03 FPS** | **185x faster** | -| **Multi-Camera (per cam)** | 0.70 FPS | **147.06 FPS** | **210x faster** | -| **Combined Throughput** | 2.79 FPS | **588.22 FPS** | **211x faster** | - ---- - -## Solution 2: Batch Inference (Optional) - -### Remaining Issue -Even after vectorization, there's still a **73.6% FPS drop** in multi-camera mode. - -**Root Cause**: **Sequential Processing** -```python -# Current approach: Process cameras one-by-one -for camera in cameras: - frame = camera.get_frame() - result = model.infer(frame) # Wait for each inference - # Total time = inference_time × num_cameras -``` - -### Batch Inference Solution - -**Concept**: Process all cameras in a single batched inference call - -```python -# Collect frames from all cameras -frames = [cam.get_frame() for cam in cameras] - -# Stack into batch: (4, 3, 640, 640) -batch_input = preprocess_batch(frames) - -# Single inference for ALL cameras -outputs = model.infer(batch_input) # Process 4 frames together! - -# Split results per camera -results = postprocess_batch(outputs) -``` - -### Requirements - -1. **Rebuild model with dynamic batching**: - ```bash - ./scripts/build_batch_model.sh - ``` - - This creates `models/yolov8n_batch4.trt` with support for batch sizes 1-4. - -2. **Use batch preprocessing/postprocessing**: - - `preprocess_batch(frames)` - Stack frames into batch - - `postprocess_batch(outputs)` - Split batched results - -### Expected Performance - -| Approach | Single Cam FPS | Multi-Cam (4) Per-Cam FPS | Efficiency | -|----------|---------------|---------------------------|------------| -| Sequential | 558 FPS | 147 FPS (73.6% drop) | Poor | -| **Batched** | 558 FPS | **300-400+ FPS** (40-28% drop) | **Excellent** | - -**Why Batched is Faster**: -- GPU processes 4 frames in parallel (better utilization) -- Single kernel launch instead of 4 separate calls -- Reduced CPU-GPU synchronization overhead -- Better memory bandwidth usage - ---- - -## Summary of Optimizations - -### 1. Vectorized Postprocessing ✓ (Completed) -- **Impact**: 185x single-camera speedup, 210x multi-camera speedup -- **Effort**: Low (code refactor only) -- **Status**: ✓ Implemented in `services/yolo.py` - -### 2. Batch Inference 🔄 (Optional) -- **Impact**: Additional 2-3x multi-camera speedup -- **Effort**: Medium (requires model rebuild + code changes) -- **Status**: Infrastructure ready, needs model rebuild - -### 3. Alternative Optimizations (Not Needed) -- CUDA streams: Complex, batch inference is simpler -- Multi-threading: Limited gains due to GIL -- Lower resolution: Reduces accuracy - ---- - -## How to Test Batch Inference - -### Step 1: Rebuild Model -```bash -./scripts/build_batch_model.sh -``` - -### Step 2: Run Benchmark -```bash -python test_batch_inference.py -``` - -This will compare: -- Sequential processing (current method) -- Batched processing (optimized method) - -### Step 3: Integrate into Production -See `test_batch_inference.py` for example implementation: -- `preprocess_batch()` - Stack frames -- `postprocess_batch()` - Split results -- Single `model_repo.infer()` call for all cameras - ---- - -## Files Modified/Created - -### Modified: -- `services/yolo.py` - Vectorized postprocessing (55x faster) - -### Created: -- `test_profiling.py` - Component-level profiling -- `test_fps_benchmark.py` - Single vs multi-camera FPS -- `test_batch_inference.py` - Batch inference test -- `scripts/build_batch_model.sh` - Build batch-enabled model -- `OPTIMIZATION_SUMMARY.md` - This document - ---- - -## Performance Timeline - -``` -Initial State (Before Investigation): - Single Camera: 3.01 FPS - Multi-Camera: 0.70 FPS per camera - ⚠️ CRITICAL PERFORMANCE ISSUE - -After Vectorization: - Single Camera: 558.03 FPS (+185x) - Multi-Camera: 147.06 FPS (+210x) - ✓ BOTTLENECK ELIMINATED - -After Batch Inference (Projected): - Single Camera: 558.03 FPS (unchanged) - Multi-Camera: 300-400 FPS (+2-3x additional) - ✓ OPTIMAL PERFORMANCE -``` - ---- - -## Lessons Learned - -1. **Profile First**: Initial assumption was inference bottleneck, but it was postprocessing -2. **Python Loops Are Slow**: Vectorize everything when working with tensors -3. **Avoid CPU↔GPU Sync**: `.item()` calls were causing massive stalls -4. **Batch When Possible**: GPU parallelism much better than sequential processing - ---- - -## Recommendations - -### For Current Setup: -- ✓ Use vectorized postprocessing (already implemented) -- ✓ Enjoy 210x speedup for multi-camera tracking -- ✓ 147 FPS per camera is excellent for most applications - -### For Maximum Performance: -- Rebuild model with batch support -- Implement batch inference (see `test_batch_inference.py`) -- Expected: 300-400 FPS per camera with 4 cameras - -### For Production: -- Monitor GPU utilization (should be >80% with batch inference) -- Consider batch size based on # of cameras (4, 8, or 16) -- Use FP16 precision for best performance -- Keep context pool size = batch size for optimal parallelism diff --git a/services/README_MODEL_REPOSITORY.md b/services/README_MODEL_REPOSITORY.md deleted file mode 100644 index 00e1a34..0000000 --- a/services/README_MODEL_REPOSITORY.md +++ /dev/null @@ -1,380 +0,0 @@ -# TensorRT Model Repository - -Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference. - -## Architecture - -### Key Features - -1. **Model Deduplication by File Hash** - - Multiple model IDs can point to the same model file - - Only one engine loaded in VRAM per unique file - - Example: 100 cameras with same model = 1 engine (not 100!) - -2. **Context Pooling for Load Balancing** - - Each unique engine has N execution contexts (configurable) - - Contexts borrowed/returned via mutex-based queue - - Enables concurrent inference without context-per-model overhead - - Example: 100 cameras sharing 4 contexts efficiently - -3. **GPU-to-GPU Inference** - - All inputs/outputs stay in VRAM (zero CPU transfers) - - Integrates seamlessly with StreamDecoder (frames already on GPU) - - Maximum performance for video inference pipelines - -4. **Thread-Safe Concurrent Inference** - - Mutex-based context acquisition (TensorRT best practice) - - No shared IExecutionContext across threads (safe) - - Multiple threads can infer concurrently (limited by pool size) - -## Design Rationale - -### Why Context Pooling? - -**Without pooling** (naive approach): -``` -100 cameras → 100 model IDs → 100 execution contexts -``` -- Problem: Each context consumes VRAM (layers, workspace, etc.) -- Problem: Context creation overhead per camera -- Problem: Doesn't scale to hundreds of cameras - -**With pooling** (our approach): -``` -100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool) -``` -- Solution: Contexts shared across all cameras using same model -- Solution: Borrow/return mechanism with mutex queue -- Solution: Scales to any number of cameras with fixed context count - -### Memory Savings Example - -YOLOv8n model (~6MB engine file): - -| Approach | Model IDs | Engines | Contexts | Approx VRAM | -|----------|-----------|---------|----------|-------------| -| Naive | 100 | 100 | 100 | ~1.5 GB | -| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** | - -**50x memory savings!** - -## Usage - -### Basic Usage - -```python -from services.model_repository import TensorRTModelRepository - -# Initialize repository -repo = TensorRTModelRepository( - gpu_id=0, - default_num_contexts=4 # 4 contexts per unique engine -) - -# Load model for camera 1 -repo.load_model( - model_id="camera_1", - file_path="models/yolov8n.trt" -) - -# Load same model for camera 2 (deduplication happens automatically) -repo.load_model( - model_id="camera_2", - file_path="models/yolov8n.trt" # Same file → shares engine and contexts! -) - -# Run inference (GPU-to-GPU) -import torch -input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0') - -outputs = repo.infer( - model_id="camera_1", - inputs={"images": input_tensor}, - synchronize=True, - timeout=5.0 # Wait up to 5s for available context -) - -# Outputs stay on GPU -for name, tensor in outputs.items(): - print(f"{name}: {tensor.shape} on {tensor.device}") -``` - -### Multi-Camera Scenario - -```python -# Setup multiple cameras -cameras = [f"camera_{i}" for i in range(100)] - -# Load same model for all cameras -for camera_id in cameras: - repo.load_model( - model_id=camera_id, - file_path="models/yolov8n.trt" # Same file for all - ) - -# Check efficiency -stats = repo.get_stats() -print(f"Model IDs: {stats['total_model_ids']}") # 100 -print(f"Unique engines: {stats['unique_engines']}") # 1 -print(f"Total contexts: {stats['total_contexts']}") # 4 -``` - -### Integration with RTSP Decoder - -```python -from services.stream_decoder import StreamDecoderFactory -from services.model_repository import TensorRTModelRepository - -# Setup -decoder_factory = StreamDecoderFactory(gpu_id=0) -model_repo = TensorRTModelRepository(gpu_id=0) - -# Create decoder for camera -decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream") -decoder.start() - -# Load inference model -model_repo.load_model("camera_main", "models/yolov8n.trt") - -# Process frames (everything on GPU) -frame_gpu = decoder.get_latest_frame(rgb=True) # torch.Tensor on CUDA - -# Preprocess (stays on GPU) -frame_gpu = frame_gpu.float() / 255.0 -frame_gpu = frame_gpu.unsqueeze(0) # Add batch dim - -# Inference (GPU-to-GPU, zero copy) -outputs = model_repo.infer( - model_id="camera_main", - inputs={"images": frame_gpu} -) - -# Post-process outputs (can stay on GPU) -# ... NMS, bounding boxes, etc. -``` - -### Concurrent Inference - -```python -import threading - -def process_camera(camera_id: str, model_id: str): - # Get frame from decoder (on GPU) - frame = decoder.get_latest_frame(rgb=True) - - # Inference automatically borrows/returns context from pool - outputs = repo.infer( - model_id=model_id, - inputs={"images": frame}, - timeout=10.0 # Wait for available context - ) - - # Process outputs... - -# Multiple threads can infer concurrently -threads = [] -for i in range(10): # 10 threads - t = threading.Thread( - target=process_camera, - args=(f"camera_{i}", f"camera_{i}") - ) - threads.append(t) - t.start() - -for t in threads: - t.join() - -# With 4 contexts: up to 4 inferences run in parallel -# Others wait in queue, contexts auto-balanced -``` - -## API Reference - -### TensorRTModelRepository - -#### `__init__(gpu_id=0, default_num_contexts=4)` -Initialize the repository. - -**Args:** -- `gpu_id`: GPU device ID -- `default_num_contexts`: Default context pool size per engine - -#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)` -Load a TensorRT model. - -**Args:** -- `model_id`: Unique identifier (e.g., "camera_1") -- `file_path`: Path to .trt/.engine file -- `num_contexts`: Context pool size (None = use default) -- `force_reload`: Reload if model_id exists - -**Returns:** `ModelMetadata` - -**Deduplication:** If file hash matches existing model, reuses engine + contexts. - -#### `infer(model_id, inputs, synchronize=True, timeout=5.0)` -Run inference. - -**Args:** -- `model_id`: Model identifier -- `inputs`: Dict mapping input names to CUDA tensors -- `synchronize`: Wait for completion -- `timeout`: Max wait time for context (seconds) - -**Returns:** Dict mapping output names to CUDA tensors - -**Thread-safe:** Borrows context from pool, returns after inference. - -#### `unload_model(model_id)` -Unload a model. - -If last reference to engine, fully unloads from VRAM. - -#### `get_metadata(model_id)` -Get model metadata. - -**Returns:** `ModelMetadata` or `None` - -#### `get_model_info(model_id)` -Get detailed model information. - -**Returns:** Dict with engine references, context pool size, shared model IDs, etc. - -#### `get_stats()` -Get repository statistics. - -**Returns:** Dict with total models, unique engines, contexts, memory efficiency. - -## Best Practices - -### 1. Set Appropriate Context Pool Size - -```python -# For 10 cameras with same model, 4 contexts is usually enough -repo = TensorRTModelRepository(default_num_contexts=4) - -# For high concurrency, increase pool size -repo = TensorRTModelRepository(default_num_contexts=8) -``` - -**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors. - -### 2. Always Use GPU Tensors - -```python -# ✅ Good: Input on GPU -input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0') -outputs = repo.infer(model_id, {"images": input_gpu}) - -# ❌ Bad: Input on CPU (will cause error) -input_cpu = torch.rand(1, 3, 640, 640) -outputs = repo.infer(model_id, {"images": input_cpu}) # ValueError! -``` - -### 3. Handle Timeout Gracefully - -```python -try: - outputs = repo.infer( - model_id="camera_1", - inputs=inputs, - timeout=5.0 - ) -except RuntimeError as e: - # All contexts busy, increase pool size or add backpressure - print(f"Inference timeout: {e}") -``` - -### 4. Use Same File for Deduplication - -```python -# ✅ Good: Same file path → deduplication -repo.load_model("cam1", "/models/yolo.trt") -repo.load_model("cam2", "/models/yolo.trt") # Shares engine! - -# ❌ Bad: Different paths (even if same content) → no deduplication -repo.load_model("cam1", "/models/yolo.trt") -repo.load_model("cam2", "/models/yolo_copy.trt") # Separate engine -``` - -## TensorRT Best Practices Implemented - -Based on NVIDIA documentation and web search findings: - -1. **Separate IExecutionContext per concurrent stream** ✅ - - Each context has its own CUDA stream - - Contexts never shared across threads simultaneously - -2. **Mutex-based context management** ✅ - - Queue-based borrowing with locks - - Thread-safe acquire/release pattern - -3. **GPU memory reuse** ✅ - - Engines shared by file hash - - Contexts pooled and reused - -4. **Zero-copy operations** ✅ - - All data stays in VRAM - - DLPack integration with PyTorch - -## Troubleshooting - -### "No execution context available within timeout" - -**Cause:** All contexts busy with concurrent inferences. - -**Solutions:** -1. Increase context pool size: - ```python - repo.load_model(model_id, file_path, num_contexts=8) - ``` -2. Increase timeout: - ```python - outputs = repo.infer(model_id, inputs, timeout=30.0) - ``` -3. Add backpressure/throttling to limit concurrent requests - -### Out of Memory (OOM) - -**Cause:** Too many unique engines or large context pools. - -**Solutions:** -1. Ensure deduplication working (same file paths) -2. Reduce context pool sizes -3. Use smaller models or quantization (INT8/FP16) - -### Import Error: "tensorrt could not be resolved" - -**Solution:** Install TensorRT: -```bash -pip install tensorrt -# Or use NVIDIA's wheel for your CUDA version -``` - -## Performance Tips - -1. **Batch Processing:** Process multiple frames before synchronizing - ```python - outputs = repo.infer(model_id, inputs, synchronize=False) - # ... more inferences ... - torch.cuda.synchronize() # Sync once at end - ``` - -2. **Async Inference:** Don't synchronize if not needed immediately - ```python - outputs = repo.infer(model_id, inputs, synchronize=False) - # GPU continues working, CPU continues - # Synchronize later when you need results - ``` - -3. **Monitor Context Utilization:** - ```python - stats = repo.get_stats() - print(f"Contexts: {stats['total_contexts']}") - - # If timeouts occur frequently, increase pool size - ``` - -## License - -Part of python-rtsp-worker project.