feat: inference subsystem and optimization to decoder

2025-11-09 00:57:08 +07:00 · 2025-11-09 00:57:08 +07:00 · 3c83a57e44
commit 3c83a57e44
19 changed files with 3897 additions and 0 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@ -0,0 +1,197 @@
+# Scripts Directory
+
+This directory contains utility scripts for the python-rtsp-worker project.
+
+## convert_pt_to_tensorrt.py
+
+Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.
+
+### Features
+
+- **Multiple Precision Modes**: FP32, FP16, INT8
+- **Dynamic Batch Size**: Support for variable batch sizes
+- **Automatic Optimization**: Creates optimization profiles for best performance
+- **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
+- **Easy to Use**: Simple command-line interface
+
+### Requirements
+
+Make sure you have the following dependencies installed:
+
+```bash
+pip install torch tensorrt onnx
+```
+
+### Quick Start
+
+**Basic conversion (FP32)**:
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model path/to/model.pt \
+    --output models/model.trt
+```
+
+**FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model yolov8n.pt \
+    --output models/yolov8n.trt \
+    --fp16
+```
+
+**Custom input shape**:
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model model.pt \
+    --output model.trt \
+    --input-shape 1,3,416,416
+```
+
+**Dynamic batch size** (for variable batch inference):
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model model.pt \
+    --output model.trt \
+    --dynamic-batch \
+    --max-batch 16
+```
+
+**Maximum optimization** (FP16 + INT8):
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model model.pt \
+    --output model.trt \
+    --fp16 \
+    --int8
+```
+
+### Command-Line Arguments
+
+| Argument | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
+| `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
+| `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
+| `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
+| `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
+| `--dynamic-batch` | No | False | Enable dynamic batch size support |
+| `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
+| `--workspace-size` | No | 4 | TensorRT workspace size in GB |
+| `--gpu` | No | 0 | GPU device ID to use |
+| `--input-names` | No | ["input"] | Custom input tensor names |
+| `--output-names` | No | ["output"] | Custom output tensor names |
+| `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
+| `--verbose`, `-v` | No | False | Enable verbose logging |
+
+### Performance Tips
+
+1. **Always use FP16** unless you need FP32 precision:
+   - 2x faster inference
+   - 50% less VRAM usage
+   - Minimal accuracy loss for most models
+
+2. **Use dynamic batching** for variable workloads:
+   - Process 1-16 images with same engine
+   - Automatic optimization for common batch sizes
+
+3. **Increase workspace size** for complex models:
+   - Default 4GB works for most models
+   - Increase to 8GB for very large models
+
+4. **INT8 quantization** for maximum speed:
+   - Requires calibration data (not included in basic conversion)
+   - 4x faster than FP32
+   - Best for deployment scenarios
+
+### Integration with Model Repository
+
+Once converted, use the TensorRT engine with the model repository:
+
+```python
+from services.model_repository import TensorRTModelRepository
+
+# Initialize repository
+repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
+
+# Load the converted model
+repo.load_model(
+    model_id="my_model",
+    file_path="models/model.trt",
+    num_contexts=4
+)
+
+# Run inference
+import torch
+input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
+outputs = repo.infer(
+    model_id="my_model",
+    inputs={"input": input_tensor}
+)
+```
+
+### Troubleshooting
+
+**Issue**: `Failed to parse ONNX model`
+- Solution: Check if your PyTorch model is compatible with ONNX export
+- Try updating PyTorch and ONNX versions
+
+**Issue**: `FP16 not supported on this platform`
+- Solution: Your GPU doesn't support FP16. Remove `--fp16` flag
+
+**Issue**: `Out of memory during conversion`
+- Solution: Reduce `--workspace-size` or free up GPU memory
+
+**Issue**: `Model contains only state_dict`
+- Solution: Your checkpoint only has weights. You need the full model architecture.
+- Modify the script's `load_pytorch_model()` method to instantiate your model class
+
+### Examples for Common Models
+
+**YOLOv8**:
+```bash
+# Download model first
+# yolo export model=yolov8n.pt format=engine device=0
+
+# Or use this script
+python scripts/convert_pt_to_tensorrt.py \
+    --model yolov8n.pt \
+    --output models/yolov8n.trt \
+    --input-shape 1,3,640,640 \
+    --fp16
+```
+
+**ResNet**:
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model resnet50.pt \
+    --output models/resnet50.trt \
+    --input-shape 1,3,224,224 \
+    --fp16 \
+    --dynamic-batch \
+    --max-batch 32
+```
+
+**Custom Model**:
+```bash
+python scripts/convert_pt_to_tensorrt.py \
+    --model custom_model.pt \
+    --output models/custom.trt \
+    --input-shape 1,3,512,512 \
+    --input-names image \
+    --output-names predictions \
+    --fp16 \
+    --verbose
+```
+
+### Notes
+
+- The script uses ONNX as an intermediate format, which is the recommended approach
+- TensorRT engines are hardware-specific; rebuild for different GPUs
+- Conversion time varies (30 seconds to 5 minutes depending on model size)
+- The first inference after loading is slower (warmup)
+
+### Support
+
+For issues or questions, please check:
+- TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
+- PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html