python-rtsp-worker/scripts/README.md

# Scripts Directory

This directory contains utility scripts for the python-rtsp-worker project.

## convert_pt_to_tensorrt.py

Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.

### Features

- **Multiple Precision Modes**: FP32, FP16, INT8
- **Dynamic Batch Size**: Support for variable batch sizes
- **Automatic Optimization**: Creates optimization profiles for best performance
- **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
- **Easy to Use**: Simple command-line interface

### Requirements

Make sure you have the following dependencies installed:

```bash
pip install torch tensorrt onnx
```

### Quick Start

**Basic conversion (FP32)**:
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model path/to/model.pt \
    --output models/model.trt
```

**FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model yolov8n.pt \
    --output models/yolov8n.trt \
    --fp16
```

**Custom input shape**:
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --input-shape 1,3,416,416
```

**Dynamic batch size** (for variable batch inference):
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --dynamic-batch \
    --max-batch 16
```

**Maximum optimization** (FP16 + INT8):
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --fp16 \
    --int8
```

### Command-Line Arguments

| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
| `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
| `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
| `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
| `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
| `--dynamic-batch` | No | False | Enable dynamic batch size support |
| `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
| `--workspace-size` | No | 4 | TensorRT workspace size in GB |
| `--gpu` | No | 0 | GPU device ID to use |
| `--input-names` | No | ["input"] | Custom input tensor names |
| `--output-names` | No | ["output"] | Custom output tensor names |
| `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
| `--verbose`, `-v` | No | False | Enable verbose logging |

### Performance Tips

1. **Always use FP16** unless you need FP32 precision:
   - 2x faster inference
   - 50% less VRAM usage
   - Minimal accuracy loss for most models

2. **Use dynamic batching** for variable workloads:
   - Process 1-16 images with same engine
   - Automatic optimization for common batch sizes

3. **Increase workspace size** for complex models:
   - Default 4GB works for most models
   - Increase to 8GB for very large models

4. **INT8 quantization** for maximum speed:
   - Requires calibration data (not included in basic conversion)
   - 4x faster than FP32
   - Best for deployment scenarios

### Integration with Model Repository

Once converted, use the TensorRT engine with the model repository:

```python
from services.model_repository import TensorRTModelRepository

# Initialize repository
repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)

# Load the converted model
repo.load_model(
    model_id="my_model",
    file_path="models/model.trt",
    num_contexts=4
)

# Run inference
import torch
input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
outputs = repo.infer(
    model_id="my_model",
    inputs={"input": input_tensor}
)
```

### Troubleshooting

**Issue**: `Failed to parse ONNX model`
- Solution: Check if your PyTorch model is compatible with ONNX export
- Try updating PyTorch and ONNX versions

**Issue**: `FP16 not supported on this platform`
- Solution: Your GPU doesn't support FP16. Remove `--fp16` flag

**Issue**: `Out of memory during conversion`
- Solution: Reduce `--workspace-size` or free up GPU memory

**Issue**: `Model contains only state_dict`
- Solution: Your checkpoint only has weights. You need the full model architecture.
- Modify the script's `load_pytorch_model()` method to instantiate your model class

### Examples for Common Models

**YOLOv8**:
```bash
# Download model first
# yolo export model=yolov8n.pt format=engine device=0

# Or use this script
python scripts/convert_pt_to_tensorrt.py \
    --model yolov8n.pt \
    --output models/yolov8n.trt \
    --input-shape 1,3,640,640 \
    --fp16
```

**ResNet**:
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model resnet50.pt \
    --output models/resnet50.trt \
    --input-shape 1,3,224,224 \
    --fp16 \
    --dynamic-batch \
    --max-batch 32
```

**Custom Model**:
```bash
python scripts/convert_pt_to_tensorrt.py \
    --model custom_model.pt \
    --output models/custom.trt \
    --input-shape 1,3,512,512 \
    --input-names image \
    --output-names predictions \
    --fp16 \
    --verbose
```

### Notes

- The script uses ONNX as an intermediate format, which is the recommended approach
- TensorRT engines are hardware-specific; rebuild for different GPUs
- Conversion time varies (30 seconds to 5 minutes depending on model size)
- The first inference after loading is slower (warmup)

### Support

For issues or questions, please check:
- TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
- PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html