remove unrelated docs

2025-11-09 11:51:21 +07:00 · 2025-11-09 11:51:21 +07:00 · 56a65a3377
commit 56a65a3377
parent 8e20496fa7
2 changed files with 0 additions and 648 deletions
--- a/OPTIMIZATION_SUMMARY.md
+++ b/OPTIMIZATION_SUMMARY.md
@ -1,268 +0,0 @@
 # Performance Optimization Summary
 ## Investigation: Multi-Camera FPS Drop
 ### Initial Problem
 **Symptom**: Severe FPS degradation in multi-camera mode
 - Single camera: 3.01 FPS
 - Multi-camera (4 cams): 0.70 FPS per camera
 - **76.8% FPS drop per camera**
 ---
 ## Root Cause Analysis
 ### Profiling Results (BEFORE Optimization)
 | Component | Time | FPS | Status |
 |-----------|------|-----|--------|
 | Video Decoding (NVDEC) | 0.24 ms | 4165 FPS | ✓ Fast |
 | Preprocessing | 0.14 ms | 7158 FPS | ✓ Fast |
 | TensorRT Inference | 1.79 ms | 558 FPS | ✓ Fast |
 | **Postprocessing (NMS)** | **404.87 ms** | **2.47 FPS** | ⚠️ **CRITICAL BOTTLENECK** |
 | Full Pipeline | 1952 ms | 0.51 FPS | ⚠️ Slow |
 **Bottleneck Identified**: Postprocessing was **226x slower than inference!**
 ### Why Postprocessing Was So Slow
 ```python
 # BEFORE: services/yolo.py (SLOW - 404ms)
 for detection in output[0]:  # Python loop over 8400 anchor points
    bbox = detection[:4]
    class_scores = detection[4:]
    max_score, class_id = torch.max(class_scores, 0)
    if max_score > conf_threshold:
        cx, cy, w, h = bbox
        x1 = cx - w / 2  # Individual operations
        # ...
        detections.append([
            x1.item(),  # GPU→CPU sync (very slow!)
            y1.item(),
            # ...
        ])
 ```
 **Problems**:
 1. **Python loop** over 8400 anchor points (not vectorized)
 2. **`.item()` calls** causing GPU→CPU synchronization stalls
 3. **List building** then converting back to tensor (inefficient)
 ---
 ## Solution 1: Vectorized Postprocessing
 ### Implementation
 ```python
 # AFTER: services/yolo.py (FAST - 7ms)
 # Vectorized operations (no Python loops)
 output = output.transpose(1, 2).squeeze(0)  # (8400, 84)
 # Split bbox and scores (vectorized)
 bboxes = output[:, :4]  # (8400, 4)
 class_scores = output[:, 4:]  # (8400, 80)
 # Get max scores for ALL anchors at once
 max_scores, class_ids = torch.max(class_scores, dim=1)
 # Filter by confidence (vectorized)
 mask = max_scores > conf_threshold
 filtered_bboxes = bboxes[mask]
 filtered_scores = max_scores[mask]
 filtered_class_ids = class_ids[mask]
 # Convert bbox format (vectorized)
 cx, cy, w, h = filtered_bboxes[:, 0], filtered_bboxes[:, 1], ...
 x1 = cx - w / 2  # Operates on entire tensor
 x2 = cx + w / 2
 # Stack into detections (pure GPU operations, no .item())
 detections_tensor = torch.stack([x1, y1, x2, y2, filtered_scores, ...], dim=1)
 ```
 ### Results (AFTER Optimization)
 | Component | Time (Before) | Time (After) | Improvement |
 |-----------|---------------|--------------|-------------|
 | Postprocessing | 404.87 ms | **7.33 ms** | **55x faster** |
 | Full Pipeline | 1952 ms | **714 ms** | **2.7x faster** |
 | Multi-Camera (4 cams) | 5859 ms | **1228 ms** | **4.8x faster** |
 **Key Achievement**: Eliminated 98.2% of postprocessing time!
 ### FPS Benchmark Comparison
 | Metric | Before | After | Improvement |
 |--------|--------|-------|-------------|
 | **Single Camera** | 3.01 FPS | **558.03 FPS** | **185x faster** |
 | **Multi-Camera (per cam)** | 0.70 FPS | **147.06 FPS** | **210x faster** |
 | **Combined Throughput** | 2.79 FPS | **588.22 FPS** | **211x faster** |
 ---
 ## Solution 2: Batch Inference (Optional)
 ### Remaining Issue
 Even after vectorization, there's still a **73.6% FPS drop** in multi-camera mode.
 **Root Cause**: **Sequential Processing**
 ```python
 # Current approach: Process cameras one-by-one
 for camera in cameras:
    frame = camera.get_frame()
    result = model.infer(frame)  # Wait for each inference
    # Total time = inference_time × num_cameras
 ```
 ### Batch Inference Solution
 **Concept**: Process all cameras in a single batched inference call
 ```python
 # Collect frames from all cameras
 frames = [cam.get_frame() for cam in cameras]
 # Stack into batch: (4, 3, 640, 640)
 batch_input = preprocess_batch(frames)
 # Single inference for ALL cameras
 outputs = model.infer(batch_input)  # Process 4 frames together!
 # Split results per camera
 results = postprocess_batch(outputs)
 ```
 ### Requirements
 1. **Rebuild model with dynamic batching**:
   ```bash
   ./scripts/build_batch_model.sh
   ```
   This creates `models/yolov8n_batch4.trt` with support for batch sizes 1-4.
 2. **Use batch preprocessing/postprocessing**:
   - `preprocess_batch(frames)` - Stack frames into batch
   - `postprocess_batch(outputs)` - Split batched results
 ### Expected Performance
 | Approach | Single Cam FPS | Multi-Cam (4) Per-Cam FPS | Efficiency |
 |----------|---------------|---------------------------|------------|
 | Sequential | 558 FPS | 147 FPS (73.6% drop) | Poor |
 | **Batched** | 558 FPS | **300-400+ FPS** (40-28% drop) | **Excellent** |
 **Why Batched is Faster**:
 - GPU processes 4 frames in parallel (better utilization)
 - Single kernel launch instead of 4 separate calls
 - Reduced CPU-GPU synchronization overhead
 - Better memory bandwidth usage
 ---
 ## Summary of Optimizations
 ### 1. Vectorized Postprocessing ✓ (Completed)
 - **Impact**: 185x single-camera speedup, 210x multi-camera speedup
 - **Effort**: Low (code refactor only)
 - **Status**: ✓ Implemented in `services/yolo.py`
 ### 2. Batch Inference 🔄 (Optional)
 - **Impact**: Additional 2-3x multi-camera speedup
 - **Effort**: Medium (requires model rebuild + code changes)
 - **Status**: Infrastructure ready, needs model rebuild
 ### 3. Alternative Optimizations (Not Needed)
 - CUDA streams: Complex, batch inference is simpler
 - Multi-threading: Limited gains due to GIL
 - Lower resolution: Reduces accuracy
 ---
 ## How to Test Batch Inference
 ### Step 1: Rebuild Model
 ```bash
 ./scripts/build_batch_model.sh
 ```
 ### Step 2: Run Benchmark
 ```bash
 python test_batch_inference.py
 ```
 This will compare:
 - Sequential processing (current method)
 - Batched processing (optimized method)
 ### Step 3: Integrate into Production
 See `test_batch_inference.py` for example implementation:
 - `preprocess_batch()` - Stack frames
 - `postprocess_batch()` - Split results
 - Single `model_repo.infer()` call for all cameras
 ---
 ## Files Modified/Created
 ### Modified:
 - `services/yolo.py` - Vectorized postprocessing (55x faster)
 ### Created:
 - `test_profiling.py` - Component-level profiling
 - `test_fps_benchmark.py` - Single vs multi-camera FPS
 - `test_batch_inference.py` - Batch inference test
 - `scripts/build_batch_model.sh` - Build batch-enabled model
 - `OPTIMIZATION_SUMMARY.md` - This document
 ---
 ## Performance Timeline
 ```
 Initial State (Before Investigation):
  Single Camera:     3.01 FPS
  Multi-Camera:      0.70 FPS per camera
  ⚠️ CRITICAL PERFORMANCE ISSUE
 After Vectorization:
  Single Camera:     558.03 FPS  (+185x)
  Multi-Camera:      147.06 FPS  (+210x)
  ✓ BOTTLENECK ELIMINATED
 After Batch Inference (Projected):
  Single Camera:     558.03 FPS  (unchanged)
  Multi-Camera:      300-400 FPS (+2-3x additional)
  ✓ OPTIMAL PERFORMANCE
 ```
 ---
 ## Lessons Learned
 1. **Profile First**: Initial assumption was inference bottleneck, but it was postprocessing
 2. **Python Loops Are Slow**: Vectorize everything when working with tensors
 3. **Avoid CPU↔GPU Sync**: `.item()` calls were causing massive stalls
 4. **Batch When Possible**: GPU parallelism much better than sequential processing
 ---
 ## Recommendations
 ### For Current Setup:
 - ✓ Use vectorized postprocessing (already implemented)
 - ✓ Enjoy 210x speedup for multi-camera tracking
 - ✓ 147 FPS per camera is excellent for most applications
 ### For Maximum Performance:
 - Rebuild model with batch support
 - Implement batch inference (see `test_batch_inference.py`)
 - Expected: 300-400 FPS per camera with 4 cameras
 ### For Production:
 - Monitor GPU utilization (should be >80% with batch inference)
 - Consider batch size based on # of cameras (4, 8, or 16)
 - Use FP16 precision for best performance
 - Keep context pool size = batch size for optimal parallelism
--- a/services/README_MODEL_REPOSITORY.md
+++ b/services/README_MODEL_REPOSITORY.md
@ -1,380 +0,0 @@
 # TensorRT Model Repository
 Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
 ## Architecture
 ### Key Features
 1. **Model Deduplication by File Hash**
   - Multiple model IDs can point to the same model file
   - Only one engine loaded in VRAM per unique file
   - Example: 100 cameras with same model = 1 engine (not 100!)
 2. **Context Pooling for Load Balancing**
   - Each unique engine has N execution contexts (configurable)
   - Contexts borrowed/returned via mutex-based queue
   - Enables concurrent inference without context-per-model overhead
   - Example: 100 cameras sharing 4 contexts efficiently
 3. **GPU-to-GPU Inference**
   - All inputs/outputs stay in VRAM (zero CPU transfers)
   - Integrates seamlessly with StreamDecoder (frames already on GPU)
   - Maximum performance for video inference pipelines
 4. **Thread-Safe Concurrent Inference**
   - Mutex-based context acquisition (TensorRT best practice)
   - No shared IExecutionContext across threads (safe)
   - Multiple threads can infer concurrently (limited by pool size)
 ## Design Rationale
 ### Why Context Pooling?
 **Without pooling** (naive approach):
 ```
 100 cameras → 100 model IDs → 100 execution contexts
 ```
 - Problem: Each context consumes VRAM (layers, workspace, etc.)
 - Problem: Context creation overhead per camera
 - Problem: Doesn't scale to hundreds of cameras
 **With pooling** (our approach):
 ```
 100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
 ```
 - Solution: Contexts shared across all cameras using same model
 - Solution: Borrow/return mechanism with mutex queue
 - Solution: Scales to any number of cameras with fixed context count
 ### Memory Savings Example
 YOLOv8n model (~6MB engine file):
 | Approach | Model IDs | Engines | Contexts | Approx VRAM |
 |----------|-----------|---------|----------|-------------|
 | Naive | 100 | 100 | 100 | ~1.5 GB |
 | **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
 **50x memory savings!**
 ## Usage
 ### Basic Usage
 ```python
 from services.model_repository import TensorRTModelRepository
 # Initialize repository
 repo = TensorRTModelRepository(
    gpu_id=0,
    default_num_contexts=4  # 4 contexts per unique engine
 )
 # Load model for camera 1
 repo.load_model(
    model_id="camera_1",
    file_path="models/yolov8n.trt"
 )
 # Load same model for camera 2 (deduplication happens automatically)
 repo.load_model(
    model_id="camera_2",
    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
 )
 # Run inference (GPU-to-GPU)
 import torch
 input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
 outputs = repo.infer(
    model_id="camera_1",
    inputs={"images": input_tensor},
    synchronize=True,
    timeout=5.0  # Wait up to 5s for available context
 )
 # Outputs stay on GPU
 for name, tensor in outputs.items():
    print(f"{name}: {tensor.shape} on {tensor.device}")
 ```
 ### Multi-Camera Scenario
 ```python
 # Setup multiple cameras
 cameras = [f"camera_{i}" for i in range(100)]
 # Load same model for all cameras
 for camera_id in cameras:
    repo.load_model(
        model_id=camera_id,
        file_path="models/yolov8n.trt"  # Same file for all
    )
 # Check efficiency
 stats = repo.get_stats()
 print(f"Model IDs: {stats['total_model_ids']}")  # 100
 print(f"Unique engines: {stats['unique_engines']}")  # 1
 print(f"Total contexts: {stats['total_contexts']}")  # 4
 ```
 ### Integration with RTSP Decoder
 ```python
 from services.stream_decoder import StreamDecoderFactory
 from services.model_repository import TensorRTModelRepository
 # Setup
 decoder_factory = StreamDecoderFactory(gpu_id=0)
 model_repo = TensorRTModelRepository(gpu_id=0)
 # Create decoder for camera
 decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
 decoder.start()
 # Load inference model
 model_repo.load_model("camera_main", "models/yolov8n.trt")
 # Process frames (everything on GPU)
 frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA
 # Preprocess (stays on GPU)
 frame_gpu = frame_gpu.float() / 255.0
 frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim
 # Inference (GPU-to-GPU, zero copy)
 outputs = model_repo.infer(
    model_id="camera_main",
    inputs={"images": frame_gpu}
 )
 # Post-process outputs (can stay on GPU)
 # ... NMS, bounding boxes, etc.
 ```
 ### Concurrent Inference
 ```python
 import threading
 def process_camera(camera_id: str, model_id: str):
    # Get frame from decoder (on GPU)
    frame = decoder.get_latest_frame(rgb=True)
    # Inference automatically borrows/returns context from pool
    outputs = repo.infer(
        model_id=model_id,
        inputs={"images": frame},
        timeout=10.0  # Wait for available context
    )
    # Process outputs...
 # Multiple threads can infer concurrently
 threads = []
 for i in range(10):  # 10 threads
    t = threading.Thread(
        target=process_camera,
        args=(f"camera_{i}", f"camera_{i}")
    )
    threads.append(t)
    t.start()
 for t in threads:
    t.join()
 # With 4 contexts: up to 4 inferences run in parallel
 # Others wait in queue, contexts auto-balanced
 ```
 ## API Reference
 ### TensorRTModelRepository
 #### `__init__(gpu_id=0, default_num_contexts=4)`
 Initialize the repository.
 **Args:**
 - `gpu_id`: GPU device ID
 - `default_num_contexts`: Default context pool size per engine
 #### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
 Load a TensorRT model.
 **Args:**
 - `model_id`: Unique identifier (e.g., "camera_1")
 - `file_path`: Path to .trt/.engine file
 - `num_contexts`: Context pool size (None = use default)
 - `force_reload`: Reload if model_id exists
 **Returns:** `ModelMetadata`
 **Deduplication:** If file hash matches existing model, reuses engine + contexts.
 #### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
 Run inference.
 **Args:**
 - `model_id`: Model identifier
 - `inputs`: Dict mapping input names to CUDA tensors
 - `synchronize`: Wait for completion
 - `timeout`: Max wait time for context (seconds)
 **Returns:** Dict mapping output names to CUDA tensors
 **Thread-safe:** Borrows context from pool, returns after inference.
 #### `unload_model(model_id)`
 Unload a model.
 If last reference to engine, fully unloads from VRAM.
 #### `get_metadata(model_id)`
 Get model metadata.
 **Returns:** `ModelMetadata` or `None`
 #### `get_model_info(model_id)`
 Get detailed model information.
 **Returns:** Dict with engine references, context pool size, shared model IDs, etc.
 #### `get_stats()`
 Get repository statistics.
 **Returns:** Dict with total models, unique engines, contexts, memory efficiency.
 ## Best Practices
 ### 1. Set Appropriate Context Pool Size
 ```python
 # For 10 cameras with same model, 4 contexts is usually enough
 repo = TensorRTModelRepository(default_num_contexts=4)
 # For high concurrency, increase pool size
 repo = TensorRTModelRepository(default_num_contexts=8)
 ```
 **Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
 ### 2. Always Use GPU Tensors
 ```python
 # ✅ Good: Input on GPU
 input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
 outputs = repo.infer(model_id, {"images": input_gpu})
 # ❌ Bad: Input on CPU (will cause error)
 input_cpu = torch.rand(1, 3, 640, 640)
 outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!
 ```
 ### 3. Handle Timeout Gracefully
 ```python
 try:
    outputs = repo.infer(
        model_id="camera_1",
        inputs=inputs,
        timeout=5.0
    )
 except RuntimeError as e:
    # All contexts busy, increase pool size or add backpressure
    print(f"Inference timeout: {e}")
 ```
 ### 4. Use Same File for Deduplication
 ```python
 # ✅ Good: Same file path → deduplication
 repo.load_model("cam1", "/models/yolo.trt")
 repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!
 # ❌ Bad: Different paths (even if same content) → no deduplication
 repo.load_model("cam1", "/models/yolo.trt")
 repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine
 ```
 ## TensorRT Best Practices Implemented
 Based on NVIDIA documentation and web search findings:
 1. **Separate IExecutionContext per concurrent stream** ✅
   - Each context has its own CUDA stream
   - Contexts never shared across threads simultaneously
 2. **Mutex-based context management** ✅
   - Queue-based borrowing with locks
   - Thread-safe acquire/release pattern
 3. **GPU memory reuse** ✅
   - Engines shared by file hash
   - Contexts pooled and reused
 4. **Zero-copy operations** ✅
   - All data stays in VRAM
   - DLPack integration with PyTorch
 ## Troubleshooting
 ### "No execution context available within timeout"
 **Cause:** All contexts busy with concurrent inferences.
 **Solutions:**
 1. Increase context pool size:
   ```python
   repo.load_model(model_id, file_path, num_contexts=8)
   ```
 2. Increase timeout:
   ```python
   outputs = repo.infer(model_id, inputs, timeout=30.0)
   ```
 3. Add backpressure/throttling to limit concurrent requests
 ### Out of Memory (OOM)
 **Cause:** Too many unique engines or large context pools.
 **Solutions:**
 1. Ensure deduplication working (same file paths)
 2. Reduce context pool sizes
 3. Use smaller models or quantization (INT8/FP16)
 ### Import Error: "tensorrt could not be resolved"
 **Solution:** Install TensorRT:
 ```bash
 pip install tensorrt
 # Or use NVIDIA's wheel for your CUDA version
 ```
 ## Performance Tips
 1. **Batch Processing:** Process multiple frames before synchronizing
   ```python
   outputs = repo.infer(model_id, inputs, synchronize=False)
   # ... more inferences ...
   torch.cuda.synchronize()  # Sync once at end
   ```
 2. **Async Inference:** Don't synchronize if not needed immediately
   ```python
   outputs = repo.infer(model_id, inputs, synchronize=False)
   # GPU continues working, CPU continues
   # Synchronize later when you need results
   ```
 3. **Monitor Context Utilization:**
   ```python
   stats = repo.get_stats()
   print(f"Contexts: {stats['total_contexts']}")
   # If timeouts occur frequently, increase pool size
   ```
 ## License
 Part of python-rtsp-worker project.