remove unrelated docs

2025-11-09 11:51:21 +07:00 · 2025-11-09 11:51:21 +07:00 · 56a65a3377
commit 56a65a3377
parent 8e20496fa7
2 changed files with 0 additions and 648 deletions
--- a/services/README_MODEL_REPOSITORY.md
+++ b/services/README_MODEL_REPOSITORY.md
@ -1,380 +0,0 @@
-# TensorRT Model Repository
-
-Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
-
-## Architecture
-
-### Key Features
-
-1. **Model Deduplication by File Hash**
-   - Multiple model IDs can point to the same model file
-   - Only one engine loaded in VRAM per unique file
-   - Example: 100 cameras with same model = 1 engine (not 100!)
-
-2. **Context Pooling for Load Balancing**
-   - Each unique engine has N execution contexts (configurable)
-   - Contexts borrowed/returned via mutex-based queue
-   - Enables concurrent inference without context-per-model overhead
-   - Example: 100 cameras sharing 4 contexts efficiently
-
-3. **GPU-to-GPU Inference**
-   - All inputs/outputs stay in VRAM (zero CPU transfers)
-   - Integrates seamlessly with StreamDecoder (frames already on GPU)
-   - Maximum performance for video inference pipelines
-
-4. **Thread-Safe Concurrent Inference**
-   - Mutex-based context acquisition (TensorRT best practice)
-   - No shared IExecutionContext across threads (safe)
-   - Multiple threads can infer concurrently (limited by pool size)
-
-## Design Rationale
-
-### Why Context Pooling?
-
-**Without pooling** (naive approach):
-```
-100 cameras → 100 model IDs → 100 execution contexts
-```
- Problem: Each context consumes VRAM (layers, workspace, etc.)
- Problem: Context creation overhead per camera
- Problem: Doesn't scale to hundreds of cameras
-
-**With pooling** (our approach):
-```
-100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
-```
- Solution: Contexts shared across all cameras using same model
- Solution: Borrow/return mechanism with mutex queue
- Solution: Scales to any number of cameras with fixed context count
-
-### Memory Savings Example
-
-YOLOv8n model (~6MB engine file):
-
-| Approach | Model IDs | Engines | Contexts | Approx VRAM |
-|----------|-----------|---------|----------|-------------|
-| Naive | 100 | 100 | 100 | ~1.5 GB |
-| **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
-
-**50x memory savings!**
-
-## Usage
-
-### Basic Usage
-
-```python
-from services.model_repository import TensorRTModelRepository
-
-# Initialize repository
-repo = TensorRTModelRepository(
-    gpu_id=0,
-    default_num_contexts=4  # 4 contexts per unique engine
-)
-
-# Load model for camera 1
-repo.load_model(
-    model_id="camera_1",
-    file_path="models/yolov8n.trt"
-)
-
-# Load same model for camera 2 (deduplication happens automatically)
-repo.load_model(
-    model_id="camera_2",
-    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
-)
-
-# Run inference (GPU-to-GPU)
-import torch
-input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
-
-outputs = repo.infer(
-    model_id="camera_1",
-    inputs={"images": input_tensor},
-    synchronize=True,
-    timeout=5.0  # Wait up to 5s for available context
-)
-
-# Outputs stay on GPU
-for name, tensor in outputs.items():
-    print(f"{name}: {tensor.shape} on {tensor.device}")
-```
-
-### Multi-Camera Scenario
-
-```python
-# Setup multiple cameras
-cameras = [f"camera_{i}" for i in range(100)]
-
-# Load same model for all cameras
-for camera_id in cameras:
-    repo.load_model(
-        model_id=camera_id,
-        file_path="models/yolov8n.trt"  # Same file for all
-    )
-
-# Check efficiency
-stats = repo.get_stats()
-print(f"Model IDs: {stats['total_model_ids']}")  # 100
-print(f"Unique engines: {stats['unique_engines']}")  # 1
-print(f"Total contexts: {stats['total_contexts']}")  # 4
-```
-
-### Integration with RTSP Decoder
-
-```python
-from services.stream_decoder import StreamDecoderFactory
-from services.model_repository import TensorRTModelRepository
-
-# Setup
-decoder_factory = StreamDecoderFactory(gpu_id=0)
-model_repo = TensorRTModelRepository(gpu_id=0)
-
-# Create decoder for camera
-decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
-decoder.start()
-
-# Load inference model
-model_repo.load_model("camera_main", "models/yolov8n.trt")
-
-# Process frames (everything on GPU)
-frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA
-
-# Preprocess (stays on GPU)
-frame_gpu = frame_gpu.float() / 255.0
-frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim
-
-# Inference (GPU-to-GPU, zero copy)
-outputs = model_repo.infer(
-    model_id="camera_main",
-    inputs={"images": frame_gpu}
-)
-
-# Post-process outputs (can stay on GPU)
-# ... NMS, bounding boxes, etc.
-```
-
-### Concurrent Inference
-
-```python
-import threading
-
-def process_camera(camera_id: str, model_id: str):
-    # Get frame from decoder (on GPU)
-    frame = decoder.get_latest_frame(rgb=True)
-
-    # Inference automatically borrows/returns context from pool
-    outputs = repo.infer(
-        model_id=model_id,
-        inputs={"images": frame},
-        timeout=10.0  # Wait for available context
-    )
-
-    # Process outputs...
-
-# Multiple threads can infer concurrently
-threads = []
-for i in range(10):  # 10 threads
-    t = threading.Thread(
-        target=process_camera,
-        args=(f"camera_{i}", f"camera_{i}")
-    )
-    threads.append(t)
-    t.start()
-
-for t in threads:
-    t.join()
-
-# With 4 contexts: up to 4 inferences run in parallel
-# Others wait in queue, contexts auto-balanced
-```
-
-## API Reference
-
-### TensorRTModelRepository
-
-#### `__init__(gpu_id=0, default_num_contexts=4)`
-Initialize the repository.
-
-**Args:**
- `gpu_id`: GPU device ID
- `default_num_contexts`: Default context pool size per engine
-
-#### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
-Load a TensorRT model.
-
-**Args:**
- `model_id`: Unique identifier (e.g., "camera_1")
- `file_path`: Path to .trt/.engine file
- `num_contexts`: Context pool size (None = use default)
- `force_reload`: Reload if model_id exists
-
-**Returns:** `ModelMetadata`
-
-**Deduplication:** If file hash matches existing model, reuses engine + contexts.
-
-#### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
-Run inference.
-
-**Args:**
- `model_id`: Model identifier
- `inputs`: Dict mapping input names to CUDA tensors
- `synchronize`: Wait for completion
- `timeout`: Max wait time for context (seconds)
-
-**Returns:** Dict mapping output names to CUDA tensors
-
-**Thread-safe:** Borrows context from pool, returns after inference.
-
-#### `unload_model(model_id)`
-Unload a model.
-
-If last reference to engine, fully unloads from VRAM.
-
-#### `get_metadata(model_id)`
-Get model metadata.
-
-**Returns:** `ModelMetadata` or `None`
-
-#### `get_model_info(model_id)`
-Get detailed model information.
-
-**Returns:** Dict with engine references, context pool size, shared model IDs, etc.
-
-#### `get_stats()`
-Get repository statistics.
-
-**Returns:** Dict with total models, unique engines, contexts, memory efficiency.
-
-## Best Practices
-
-### 1. Set Appropriate Context Pool Size
-
-```python
-# For 10 cameras with same model, 4 contexts is usually enough
-repo = TensorRTModelRepository(default_num_contexts=4)
-
-# For high concurrency, increase pool size
-repo = TensorRTModelRepository(default_num_contexts=8)
-```
-
-**Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
-
-### 2. Always Use GPU Tensors
-
-```python
-# ✅ Good: Input on GPU
-input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
-outputs = repo.infer(model_id, {"images": input_gpu})
-
-# ❌ Bad: Input on CPU (will cause error)
-input_cpu = torch.rand(1, 3, 640, 640)
-outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!
-```
-
-### 3. Handle Timeout Gracefully
-
-```python
-try:
-    outputs = repo.infer(
-        model_id="camera_1",
-        inputs=inputs,
-        timeout=5.0
-    )
-except RuntimeError as e:
-    # All contexts busy, increase pool size or add backpressure
-    print(f"Inference timeout: {e}")
-```
-
-### 4. Use Same File for Deduplication
-
-```python
-# ✅ Good: Same file path → deduplication
-repo.load_model("cam1", "/models/yolo.trt")
-repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!
-
-# ❌ Bad: Different paths (even if same content) → no deduplication
-repo.load_model("cam1", "/models/yolo.trt")
-repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine
-```
-
-## TensorRT Best Practices Implemented
-
-Based on NVIDIA documentation and web search findings:
-
-1. **Separate IExecutionContext per concurrent stream** ✅
-   - Each context has its own CUDA stream
-   - Contexts never shared across threads simultaneously
-
-2. **Mutex-based context management** ✅
-   - Queue-based borrowing with locks
-   - Thread-safe acquire/release pattern
-
-3. **GPU memory reuse** ✅
-   - Engines shared by file hash
-   - Contexts pooled and reused
-
-4. **Zero-copy operations** ✅
-   - All data stays in VRAM
-   - DLPack integration with PyTorch
-
-## Troubleshooting
-
-### "No execution context available within timeout"
-
-**Cause:** All contexts busy with concurrent inferences.
-
-**Solutions:**
-1. Increase context pool size:
-   ```python
-   repo.load_model(model_id, file_path, num_contexts=8)
-   ```
-2. Increase timeout:
-   ```python
-   outputs = repo.infer(model_id, inputs, timeout=30.0)
-   ```
-3. Add backpressure/throttling to limit concurrent requests
-
-### Out of Memory (OOM)
-
-**Cause:** Too many unique engines or large context pools.
-
-**Solutions:**
-1. Ensure deduplication working (same file paths)
-2. Reduce context pool sizes
-3. Use smaller models or quantization (INT8/FP16)
-
-### Import Error: "tensorrt could not be resolved"
-
-**Solution:** Install TensorRT:
-```bash
-pip install tensorrt
-# Or use NVIDIA's wheel for your CUDA version
-```
-
-## Performance Tips
-
-1. **Batch Processing:** Process multiple frames before synchronizing
-   ```python
-   outputs = repo.infer(model_id, inputs, synchronize=False)
-   # ... more inferences ...
-   torch.cuda.synchronize()  # Sync once at end
-   ```
-
-2. **Async Inference:** Don't synchronize if not needed immediately
-   ```python
-   outputs = repo.infer(model_id, inputs, synchronize=False)
-   # GPU continues working, CPU continues
-   # Synchronize later when you need results
-   ```
-
-3. **Monitor Context Utilization:**
-   ```python
-   stats = repo.get_stats()
-   print(f"Contexts: {stats['total_contexts']}")
-
-   # If timeouts occur frequently, increase pool size
-   ```
-
-## License
-
-Part of python-rtsp-worker project.