From 56a65a33770a2452c7bf9c17264c618eca6d7b7a Mon Sep 17 00:00:00 2001
From: Siwat Sirichai <siwat@siwatinc.com>
Date: Sun, 9 Nov 2025 11:51:21 +0700
Subject: [PATCH] remove unrelated docs

---
 OPTIMIZATION_SUMMARY.md             | 268 --------------------
 services/README_MODEL_REPOSITORY.md | 380 ----------------------------
 2 files changed, 648 deletions(-)
 delete mode 100644 OPTIMIZATION_SUMMARY.md
 delete mode 100644 services/README_MODEL_REPOSITORY.md

diff --git a/OPTIMIZATION_SUMMARY.md b/OPTIMIZATION_SUMMARY.md
deleted file mode 100644
index beb7312..0000000
--- a/OPTIMIZATION_SUMMARY.md
+++ /dev/null
@@ -1,268 +0,0 @@
-# Performance Optimization Summary
-
-## Investigation: Multi-Camera FPS Drop
-
-### Initial Problem
-**Symptom**: Severe FPS degradation in multi-camera mode
-- Single camera: 3.01 FPS
-- Multi-camera (4 cams): 0.70 FPS per camera
-- **76.8% FPS drop per camera**
-
----
-
-## Root Cause Analysis
-
-### Profiling Results (BEFORE Optimization)
-
-| Component | Time | FPS | Status |
-|-----------|------|-----|--------|
-| Video Decoding (NVDEC) | 0.24 ms | 4165 FPS | ✓ Fast |
-| Preprocessing | 0.14 ms | 7158 FPS | ✓ Fast |
-| TensorRT Inference | 1.79 ms | 558 FPS | ✓ Fast |
-| **Postprocessing (NMS)** | **404.87 ms** | **2.47 FPS** | ⚠️ **CRITICAL BOTTLENECK** |
-| Full Pipeline | 1952 ms | 0.51 FPS | ⚠️ Slow |
-
-**Bottleneck Identified**: Postprocessing was **226x slower than inference!**
-
-### Why Postprocessing Was So Slow
-
-```python
-# BEFORE: services/yolo.py (SLOW - 404ms)
-for detection in output[0]:  # Python loop over 8400 anchor points
-    bbox = detection[:4]
-    class_scores = detection[4:]
-    max_score, class_id = torch.max(class_scores, 0)
-
-    if max_score > conf_threshold:
-        cx, cy, w, h = bbox
-        x1 = cx - w / 2  # Individual operations
-        # ...
-        detections.append([
-            x1.item(),  # GPU→CPU sync (very slow!)
-            y1.item(),
-            # ...
-        ])
-```
-
-**Problems**:
-1. **Python loop** over 8400 anchor points (not vectorized)
-2. **`.item()` calls** causing GPU→CPU synchronization stalls
-3. **List building** then converting back to tensor (inefficient)
-
----
-
-## Solution 1: Vectorized Postprocessing
-
-### Implementation
-
-```python
-# AFTER: services/yolo.py (FAST - 7ms)
-# Vectorized operations (no Python loops)
-output = output.transpose(1, 2).squeeze(0)  # (8400, 84)
-
-# Split bbox and scores (vectorized)
-bboxes = output[:, :4]  # (8400, 4)
-class_scores = output[:, 4:]  # (8400, 80)
-
-# Get max scores for ALL anchors at once
-max_scores, class_ids = torch.max(class_scores, dim=1)
-
-# Filter by confidence (vectorized)
-mask = max_scores > conf_threshold
-filtered_bboxes = bboxes[mask]
-filtered_scores = max_scores[mask]
-filtered_class_ids = class_ids[mask]
-
-# Convert bbox format (vectorized)
-cx, cy, w, h = filtered_bboxes[:, 0], filtered_bboxes[:, 1], ...
-x1 = cx - w / 2  # Operates on entire tensor
-x2 = cx + w / 2
-
-# Stack into detections (pure GPU operations, no .item())
-detections_tensor = torch.stack([x1, y1, x2, y2, filtered_scores, ...], dim=1)
-```
-
-### Results (AFTER Optimization)
-
-| Component | Time (Before) | Time (After) | Improvement |
-|-----------|---------------|--------------|-------------|
-| Postprocessing | 404.87 ms | **7.33 ms** | **55x faster** |
-| Full Pipeline | 1952 ms | **714 ms** | **2.7x faster** |
-| Multi-Camera (4 cams) | 5859 ms | **1228 ms** | **4.8x faster** |
-
-**Key Achievement**: Eliminated 98.2% of postprocessing time!
-
-### FPS Benchmark Comparison
-
-| Metric | Before | After | Improvement |
-|--------|--------|-------|-------------|
-| **Single Camera** | 3.01 FPS | **558.03 FPS** | **185x faster** |
-| **Multi-Camera (per cam)** | 0.70 FPS | **147.06 FPS** | **210x faster** |
-| **Combined Throughput** | 2.79 FPS | **588.22 FPS** | **211x faster** |
-
----
-
-## Solution 2: Batch Inference (Optional)
-
-### Remaining Issue
-Even after vectorization, there's still a **73.6% FPS drop** in multi-camera mode.
-
-**Root Cause**: **Sequential Processing**
-```python
-# Current approach: Process cameras one-by-one
-for camera in cameras:
-    frame = camera.get_frame()
-    result = model.infer(frame)  # Wait for each inference
-    # Total time = inference_time × num_cameras
-```
-
-### Batch Inference Solution
-
-**Concept**: Process all cameras in a single batched inference call
-
-```python
-# Collect frames from all cameras
-frames = [cam.get_frame() for cam in cameras]
-
-# Stack into batch: (4, 3, 640, 640)
-batch_input = preprocess_batch(frames)
-
-# Single inference for ALL cameras
-outputs = model.infer(batch_input)  # Process 4 frames together!
-
-# Split results per camera
-results = postprocess_batch(outputs)
-```
-
-### Requirements
-
-1. **Rebuild model with dynamic batching**:
-   ```bash
-   ./scripts/build_batch_model.sh
-   ```
-
-   This creates `models/yolov8n_batch4.trt` with support for batch sizes 1-4.
-
-2. **Use batch preprocessing/postprocessing**:
-   - `preprocess_batch(frames)` - Stack frames into batch
-   - `postprocess_batch(outputs)` - Split batched results
-
-### Expected Performance
-
-| Approach | Single Cam FPS | Multi-Cam (4) Per-Cam FPS | Efficiency |
-|----------|---------------|---------------------------|------------|
-| Sequential | 558 FPS | 147 FPS (73.6% drop) | Poor |
-| **Batched** | 558 FPS | **300-400+ FPS** (40-28% drop) | **Excellent** |
-
-**Why Batched is Faster**:
-- GPU processes 4 frames in parallel (better utilization)
-- Single kernel launch instead of 4 separate calls
-- Reduced CPU-GPU synchronization overhead
-- Better memory bandwidth usage
-
----
-
-## Summary of Optimizations
-
-### 1. Vectorized Postprocessing ✓ (Completed)
-- **Impact**: 185x single-camera speedup, 210x multi-camera speedup
-- **Effort**: Low (code refactor only)
-- **Status**: ✓ Implemented in `services/yolo.py`
-
-### 2. Batch Inference 🔄 (Optional)
-- **Impact**: Additional 2-3x multi-camera speedup
-- **Effort**: Medium (requires model rebuild + code changes)
-- **Status**: Infrastructure ready, needs model rebuild
-
-### 3. Alternative Optimizations (Not Needed)
-- CUDA streams: Complex, batch inference is simpler
-- Multi-threading: Limited gains due to GIL
-- Lower resolution: Reduces accuracy
-
----
-
-## How to Test Batch Inference
-
-### Step 1: Rebuild Model
-```bash
-./scripts/build_batch_model.sh
-```
-
-### Step 2: Run Benchmark
-```bash
-python test_batch_inference.py
-```
-
-This will compare:
-- Sequential processing (current method)
-- Batched processing (optimized method)
-
-### Step 3: Integrate into Production
-See `test_batch_inference.py` for example implementation:
-- `preprocess_batch()` - Stack frames
-- `postprocess_batch()` - Split results
-- Single `model_repo.infer()` call for all cameras
-
----
-
-## Files Modified/Created
-
-### Modified:
-- `services/yolo.py` - Vectorized postprocessing (55x faster)
-
-### Created:
-- `test_profiling.py` - Component-level profiling
-- `test_fps_benchmark.py` - Single vs multi-camera FPS
-- `test_batch_inference.py` - Batch inference test
-- `scripts/build_batch_model.sh` - Build batch-enabled model
-- `OPTIMIZATION_SUMMARY.md` - This document
-
----
-
-## Performance Timeline
-
-```
-Initial State (Before Investigation):
-  Single Camera:     3.01 FPS
-  Multi-Camera:      0.70 FPS per camera
-  ⚠️ CRITICAL PERFORMANCE ISSUE
-
-After Vectorization:
-  Single Camera:     558.03 FPS  (+185x)
-  Multi-Camera:      147.06 FPS  (+210x)
-  ✓ BOTTLENECK ELIMINATED
-
-After Batch Inference (Projected):
-  Single Camera:     558.03 FPS  (unchanged)
-  Multi-Camera:      300-400 FPS (+2-3x additional)
-  ✓ OPTIMAL PERFORMANCE
-```
-
----
-
-## Lessons Learned
-
-1. **Profile First**: Initial assumption was inference bottleneck, but it was postprocessing
-2. **Python Loops Are Slow**: Vectorize everything when working with tensors
-3. **Avoid CPU↔GPU Sync**: `.item()` calls were causing massive stalls
-4. **Batch When Possible**: GPU parallelism much better than sequential processing
-
----
-
-## Recommendations
-
-### For Current Setup:
-- ✓ Use vectorized postprocessing (already implemented)
-- ✓ Enjoy 210x speedup for multi-camera tracking
-- ✓ 147 FPS per camera is excellent for most applications
-
-### For Maximum Performance:
-- Rebuild model with batch support
-- Implement batch inference (see `test_batch_inference.py`)
-- Expected: 300-400 FPS per camera with 4 cameras
-
-### For Production:
-- Monitor GPU utilization (should be >80% with batch inference)
-- Consider batch size based on # of cameras (4, 8, or 16)
-- Use FP16 precision for best performance
-- Keep context pool size = batch size for optimal parallelism
diff --git a/services/README_MODEL_REPOSITORY.md b/services/README_MODEL_REPOSITORY.md
deleted file mode 100644
index 00e1a34..0000000
--- a/services/README_MODEL_REPOSITORY.md
+++ /dev/null
@@ -1,380 +0,0 @@
-# TensorRT Model Repository
-
-Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
-
-## Architecture
-
-### Key Features
-
-1. **Model Deduplication by File Hash**
-   - Multiple model IDs can point to the same model file
-   - Only one engine loaded in VRAM per unique file
-   - Example: 100 cameras with same model = 1 engine (not 100!)
-
-2. **Context Pooling for Load Balancing**
-   - Each unique engine has N execution contexts (configurable)
-   - Contexts borrowed/returned via mutex-based queue
-   - Enables concurrent inference without context-per-model overhead
-   - Example: 100 cameras sharing 4 contexts efficiently
-
-3. **GPU-to-GPU Inference**
-   - All inputs/outputs stay in VRAM (zero CPU transfers)
-   - Integrates seamlessly with StreamDecoder (frames already on GPU)
-   - Maximum performance for video inference pipelines
-
-4. **Thread-Safe Concurrent Inference**
-   - Mutex-based context acquisition (TensorRT best practice)
-   - No shared IExecutionContext across threads (safe)
-   - Multiple threads can infer concurrently (limited by pool size)
-
-## Design Rationale
-
-### Why Context Pooling?
-
-**Without pooling** (naive approach):
-```
-100 cameras → 100 model IDs → 100 execution contexts
-```
-- Problem: Each context consumes VRAM (layers, workspace, etc.)
-- Problem: Context creation overhead per camera
-- Problem: Doesn't scale to hundreds of cameras
-
-**With pooling** (our approach):
-```
-100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
-```
-- Solution: Contexts shared across all cameras using same model
-- Solution: Borrow/return mechanism with mutex queue
-- Solution: Scales to any number of cameras with fixed context count
-
-### Memory Savings Example
-
-YOLOv8n model (~6MB engine file):
-
-| Approach | Model IDs | Engines | Contexts | Approx VRAM |
-|----------|-----------|---------|----------|-------------|
-| Naive | 100 | 100 | 100 | ~1.5 GB |
-| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
-
-**50x memory savings!**
-
-## Usage
-
-### Basic Usage
-
-```python
-from services.model_repository import TensorRTModelRepository
-
-# Initialize repository
-repo = TensorRTModelRepository(
-    gpu_id=0,
-    default_num_contexts=4  # 4 contexts per unique engine
-)
-
-# Load model for camera 1
-repo.load_model(
-    model_id="camera_1",
-    file_path="models/yolov8n.trt"
-)
-
-# Load same model for camera 2 (deduplication happens automatically)
-repo.load_model(
-    model_id="camera_2",
-    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
-)
-
-# Run inference (GPU-to-GPU)
-import torch
-input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
-
-outputs = repo.infer(
-    model_id="camera_1",
-    inputs={"images": input_tensor},
-    synchronize=True,
-    timeout=5.0  # Wait up to 5s for available context
-)
-
-# Outputs stay on GPU
-for name, tensor in outputs.items():
-    print(f"{name}: {tensor.shape} on {tensor.device}")
-```
-
-### Multi-Camera Scenario
-
-```python
-# Setup multiple cameras
-cameras = [f"camera_{i}" for i in range(100)]
-
-# Load same model for all cameras
-for camera_id in cameras:
-    repo.load_model(
-        model_id=camera_id,
-        file_path="models/yolov8n.trt"  # Same file for all
-    )
-
-# Check efficiency
-stats = repo.get_stats()
-print(f"Model IDs: {stats['total_model_ids']}")  # 100
-print(f"Unique engines: {stats['unique_engines']}")  # 1
-print(f"Total contexts: {stats['total_contexts']}")  # 4
-```
-
-### Integration with RTSP Decoder
-
-```python
-from services.stream_decoder import StreamDecoderFactory
-from services.model_repository import TensorRTModelRepository
-
-# Setup
-decoder_factory = StreamDecoderFactory(gpu_id=0)
-model_repo = TensorRTModelRepository(gpu_id=0)
-
-# Create decoder for camera
-decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
-decoder.start()
-
-# Load inference model
-model_repo.load_model("camera_main", "models/yolov8n.trt")
-
-# Process frames (everything on GPU)
-frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA
-
-# Preprocess (stays on GPU)
-frame_gpu = frame_gpu.float() / 255.0
-frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim
-
-# Inference (GPU-to-GPU, zero copy)
-outputs = model_repo.infer(
-    model_id="camera_main",
-    inputs={"images": frame_gpu}
-)
-
-# Post-process outputs (can stay on GPU)
-# ... NMS, bounding boxes, etc.
-```
-
-### Concurrent Inference
-
-```python
-import threading
-
-def process_camera(camera_id: str, model_id: str):
-    # Get frame from decoder (on GPU)
-    frame = decoder.get_latest_frame(rgb=True)
-
-    # Inference automatically borrows/returns context from pool
-    outputs = repo.infer(
-        model_id=model_id,
-        inputs={"images": frame},
-        timeout=10.0  # Wait for available context
-    )
-
-    # Process outputs...
-
-# Multiple threads can infer concurrently
-threads = []
-for i in range(10):  # 10 threads
-    t = threading.Thread(
-        target=process_camera,
-        args=(f"camera_{i}", f"camera_{i}")
-    )
-    threads.append(t)
-    t.start()
-
-for t in threads:
-    t.join()
-
-# With 4 contexts: up to 4 inferences run in parallel
-# Others wait in queue, contexts auto-balanced
-```
-
-## API Reference
-
-### TensorRTModelRepository
-
-#### `__init__(gpu_id=0, default_num_contexts=4)`
-Initialize the repository.
-
-**Args:**
-- `gpu_id`: GPU device ID
-- `default_num_contexts`: Default context pool size per engine
-
-#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
-Load a TensorRT model.
-
-**Args:**
-- `model_id`: Unique identifier (e.g., "camera_1")
-- `file_path`: Path to .trt/.engine file
-- `num_contexts`: Context pool size (None = use default)
-- `force_reload`: Reload if model_id exists
-
-**Returns:** `ModelMetadata`
-
-**Deduplication:** If file hash matches existing model, reuses engine + contexts.
-
-#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
-Run inference.
-
-**Args:**
-- `model_id`: Model identifier
-- `inputs`: Dict mapping input names to CUDA tensors
-- `synchronize`: Wait for completion
-- `timeout`: Max wait time for context (seconds)
-
-**Returns:** Dict mapping output names to CUDA tensors
-
-**Thread-safe:** Borrows context from pool, returns after inference.
-
-#### `unload_model(model_id)`
-Unload a model.
-
-If last reference to engine, fully unloads from VRAM.
-
-#### `get_metadata(model_id)`
-Get model metadata.
-
-**Returns:** `ModelMetadata` or `None`
-
-#### `get_model_info(model_id)`
-Get detailed model information.
-
-**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
-
-#### `get_stats()`
-Get repository statistics.
-
-**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
-
-## Best Practices
-
-### 1. Set Appropriate Context Pool Size
-
-```python
-# For 10 cameras with same model, 4 contexts is usually enough
-repo = TensorRTModelRepository(default_num_contexts=4)
-
-# For high concurrency, increase pool size
-repo = TensorRTModelRepository(default_num_contexts=8)
-```
-
-**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
-
-### 2. Always Use GPU Tensors
-
-```python
-# ✅ Good: Input on GPU
-input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
-outputs = repo.infer(model_id, {"images": input_gpu})
-
-# ❌ Bad: Input on CPU (will cause error)
-input_cpu = torch.rand(1, 3, 640, 640)
-outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!
-```
-
-### 3. Handle Timeout Gracefully
-
-```python
-try:
-    outputs = repo.infer(
-        model_id="camera_1",
-        inputs=inputs,
-        timeout=5.0
-    )
-except RuntimeError as e:
-    # All contexts busy, increase pool size or add backpressure
-    print(f"Inference timeout: {e}")
-```
-
-### 4. Use Same File for Deduplication
-
-```python
-# ✅ Good: Same file path → deduplication
-repo.load_model("cam1", "/models/yolo.trt")
-repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!
-
-# ❌ Bad: Different paths (even if same content) → no deduplication
-repo.load_model("cam1", "/models/yolo.trt")
-repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine
-```
-
-## TensorRT Best Practices Implemented
-
-Based on NVIDIA documentation and web search findings:
-
-1. **Separate IExecutionContext per concurrent stream** ✅
-   - Each context has its own CUDA stream
-   - Contexts never shared across threads simultaneously
-
-2. **Mutex-based context management** ✅
-   - Queue-based borrowing with locks
-   - Thread-safe acquire/release pattern
-
-3. **GPU memory reuse** ✅
-   - Engines shared by file hash
-   - Contexts pooled and reused
-
-4. **Zero-copy operations** ✅
-   - All data stays in VRAM
-   - DLPack integration with PyTorch
-
-## Troubleshooting
-
-### "No execution context available within timeout"
-
-**Cause:** All contexts busy with concurrent inferences.
-
-**Solutions:**
-1. Increase context pool size:
-   ```python
-   repo.load_model(model_id, file_path, num_contexts=8)
-   ```
-2. Increase timeout:
-   ```python
-   outputs = repo.infer(model_id, inputs, timeout=30.0)
-   ```
-3. Add backpressure/throttling to limit concurrent requests
-
-### Out of Memory (OOM)
-
-**Cause:** Too many unique engines or large context pools.
-
-**Solutions:**
-1. Ensure deduplication working (same file paths)
-2. Reduce context pool sizes
-3. Use smaller models or quantization (INT8/FP16)
-
-### Import Error: "tensorrt could not be resolved"
-
-**Solution:** Install TensorRT:
-```bash
-pip install tensorrt
-# Or use NVIDIA's wheel for your CUDA version
-```
-
-## Performance Tips
-
-1. **Batch Processing:** Process multiple frames before synchronizing
-   ```python
-   outputs = repo.infer(model_id, inputs, synchronize=False)
-   # ... more inferences ...
-   torch.cuda.synchronize()  # Sync once at end
-   ```
-
-2. **Async Inference:** Don't synchronize if not needed immediately
-   ```python
-   outputs = repo.infer(model_id, inputs, synchronize=False)
-   # GPU continues working, CPU continues
-   # Synchronize later when you need results
-   ```
-
-3. **Monitor Context Utilization:**
-   ```python
-   stats = repo.get_stats()
-   print(f"Contexts: {stats['total_contexts']}")
-
-   # If timeouts occur frequently, increase pool size
-   ```
-
-## License
-
-Part of python-rtsp-worker project.