feat: inference subsystem and optimization to decoder

2025-11-09 00:57:08 +07:00 · 2025-11-09 00:57:08 +07:00 · 3c83a57e44
commit 3c83a57e44
19 changed files with 3897 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,11 @@
 # RTSP Camera URLs
 # Add your camera URLs here, one per line with CAMERA_URL_N format
 CAMERA_URL_1=rtsp://user:pass@host/path
 CAMERA_URL_2=rtsp://user:pass@host/path
 CAMERA_URL_3=rtsp://user:pass@host/path
 CAMERA_URL_4=rtsp://user:pass@host/path
 # Add more cameras as needed...
 # CAMERA_URL_5=rtsp://user:pass@host/path
 # CAMERA_URL_6=rtsp://user:pass@host/path
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,6 @@
 fastapi
 __pycache__/
 *.pyc
 .env
 .claude
 models/
--- a/app.py
+++ b/app.py
@ -0,0 +1,13 @@
 from fastapi import FastAPI
 app = FastAPI()
@app.get("/")
 async def root():
    return {"message": "Hello World"}
@app.get("/health")
 async def health_check():
    return {"status": "healthy"}
--- a/claude.md
+++ b/claude.md
@ -0,0 +1,373 @@
 # GPU-Accelerated RTSP Stream Processing System
 ## Project Overview
 A high-performance RTSP stream processing system designed to handle 1000+ concurrent camera streams using NVIDIA GPU hardware acceleration. The system implements a zero-copy GPU pipeline that minimizes VRAM usage through shared CUDA context and keeps all processing on the GPU until final JPEG compression.
 ## Key Achievements
 - **Shared CUDA Context**: 70% VRAM reduction (from ~200MB to ~60MB per stream)
 - **Linear VRAM Scaling**: Perfect scaling at 60 MB per additional stream
 - **Zero-Copy Pipeline**: All processing stays on GPU until JPEG bytes
 - **Proven Performance**: 4 streams @ 720p, 7-7.5 FPS each, 458 MB total VRAM
 ## Architecture
 ### Pipeline Flow
 ```
 RTSP Stream → PyAV (CPU)
           ↓
    NVDEC Decode (GPU) → NV12 Format
           ↓
    NV12 to RGB (GPU) → PyTorch Ops
           ↓
    nvJPEG Encode (GPU) → JPEG Bytes
           ↓
    CPU (JPEG only)
 ```
 ### Core Components
 #### StreamDecoderFactory
 Singleton factory managing shared CUDA context across all decoder instances.
 **Key Methods:**
 - `get_factory(gpu_id)`: Returns singleton instance
 - `create_decoder(rtsp_url, buffer_size)`: Creates new decoder with shared context
 **CUDA Context Initialization:**
 ```python
 err, = cuda_driver.cuInit(0)
 err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
 ```
 #### StreamDecoder
 Individual stream decoder with NVDEC hardware acceleration and thread-safe ring buffer.
 **Key Features:**
 - Thread-safe frame buffer (deque)
 - Connection status tracking
 - Automatic reconnection handling
 - Background thread for continuous decoding
 **Key Methods:**
 - `start()`: Start decoding thread
 - `stop()`: Stop and cleanup
 - `get_latest_frame()`: Get most recent RGB frame (GPU tensor)
 - `is_connected()`: Check connection status
 - `get_buffer_size()`: Current buffer size
 #### JPEGEncoderFactory
 Shared JPEG encoder using nvImageCodec for GPU-accelerated encoding.
 **Key Function:**
 ```python
 def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encodes GPU RGB tensor to JPEG bytes without CPU transfer.
    Uses __cuda_array_interface__ for zero-copy operation.
    Performance: 1-2ms per 720p frame
    """
 ```
 ## Technical Implementation
 ### Shared CUDA Context Pattern
 ```python
 # Single shared context for all decoders
 factory = StreamDecoderFactory(gpu_id=0)
 # All decoders share same context
 decoder1 = factory.create_decoder(url1, buffer_size=30)
 decoder2 = factory.create_decoder(url2, buffer_size=30)
 decoder3 = factory.create_decoder(url3, buffer_size=30)
 ```
 **Benefits:**
 - 70% VRAM reduction per stream
 - Single decoder initialization overhead
 - Efficient resource sharing
 ### NV12 to RGB Conversion (GPU)
 ```python
 def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Converts NV12 (YUV420) to RGB entirely on GPU using PyTorch ops.
    Uses BT.601 color space conversion.
    Input: (height * 1.5, width) NV12 tensor
    Output: (3, height, width) RGB tensor
    """
 ```
 **Steps:**
 1. Split Y and UV planes
 2. Deinterleave UV components
 3. Upsample chroma (bilinear interpolation)
 4. Apply BT.601 color matrix
 5. Clamp to [0, 255]
 ### Zero-Copy Operations
 **DLPack for PyTorch ↔ nvImageCodec:**
 ```python
 # GPU tensor stays on GPU
 rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
 nv_image = nvimgcodec.as_image(rgb_hwc)  # Uses __cuda_array_interface__
 jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
 ```
 ## Performance Metrics
 ### VRAM Usage (Python Process)
 | Streams | Total VRAM | Overhead | Per Stream | Marginal Cost |
 |---------|-----------|----------|------------|---------------|
 | 0       | 216 MB    | 0 MB     | -          | -             |
 | 1       | 278 MB    | 62 MB    | 62.0 MB    | 62 MB         |
 | 2       | 338 MB    | 122 MB   | 61.0 MB    | 60 MB         |
 | 3       | 398 MB    | 182 MB   | 60.7 MB    | 60 MB         |
 | 4       | 458 MB    | 242 MB   | 60.5 MB    | 60 MB         |
 **Result:** Perfect linear scaling at ~60 MB per stream
 ### Capacity Estimates
 With 60 MB per stream + 216 MB baseline:
 - **16GB GPU**: ~269 cameras (conservative: ~250)
 - **24GB GPU**: ~407 cameras (conservative: ~380)
 - **48GB GPU**: ~815 cameras (conservative: ~780)
 - **For 1000 streams**: ~60GB VRAM required
 ### Throughput
 - **Frame Rate**: 7-7.5 FPS per stream @ 720p
 - **JPEG Encoding**: 1-2ms per frame
 - **Connection Time**: ~15s for stream stabilization
 ## Project Structure
 ```
 python-rtsp-worker/
 ├── app.py                      # FastAPI application
 ├── services/
 │   ├── __init__.py            # Package exports
 │   ├── stream_decoder.py      # StreamDecoder & Factory
 │   └── jpeg_encoder.py        # JPEG encoding utilities
 ├── test_stream.py             # Single stream test
 ├── test_multi_stream.py       # 4-stream test with monitoring
 ├── test_vram_scaling.py       # System VRAM measurement
 ├── test_vram_process.py       # Process VRAM measurement
 ├── test_jpeg_encode.py        # JPEG encoding test
 ├── requirements.txt           # Python dependencies
 ├── .env                       # Camera URLs (gitignored)
 ├── .env.example              # Template for camera URLs
 └── .gitignore
 ```
 ## Dependencies
 ```
 fastapi                    # Web framework
 uvicorn[standard]         # ASGI server
 torch                     # GPU tensor operations
 PyNvVideoCodec            # NVDEC hardware decoding
 av                        # FFmpeg/RTSP client
 cuda-python               # CUDA driver bindings
 nvidia-nvimgcodec-cu12    # nvJPEG encoding
 python-dotenv             # Environment variables
 ```
 ## Configuration
 ### Environment Variables (.env)
 ```bash
 # RTSP Camera URLs
 CAMERA_URL_1=rtsp://user:pass@host/path
 CAMERA_URL_2=rtsp://user:pass@host/path
 CAMERA_URL_3=rtsp://user:pass@host/path
 CAMERA_URL_4=rtsp://user:pass@host/path
 # Add more as needed...
 ```
 ### Loading URLs in Code
 ```python
 from dotenv import load_dotenv
 import os
 load_dotenv()
 camera_urls = []
 i = 1
 while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break
 ```
 ## Usage Examples
 ### Basic Usage
 ```python
 from services import StreamDecoderFactory, encode_frame_to_jpeg
 # Create factory (shared CUDA context)
 factory = StreamDecoderFactory(gpu_id=0)
 # Create decoder
 decoder = factory.create_decoder(
    rtsp_url="rtsp://user:pass@host/path",
    buffer_size=30
 )
 # Start decoding
 decoder.start()
 # Wait for connection
 import time
 time.sleep(5)
 # Get latest frame (GPU tensor)
 rgb_frame = decoder.get_latest_frame()
 if rgb_frame is not None:
    # Encode to JPEG (on GPU)
    jpeg_bytes = encode_frame_to_jpeg(rgb_frame, quality=95)
    # Save or transmit jpeg_bytes
    with open("frame.jpg", "wb") as f:
        f.write(jpeg_bytes)
 # Cleanup
 decoder.stop()
 ```
 ### Multi-Stream Usage
 ```python
 from services import StreamDecoderFactory
 import time
 factory = StreamDecoderFactory(gpu_id=0)
 # Create multiple decoders (all share context)
 decoders = []
 for url in camera_urls:
    decoder = factory.create_decoder(url, buffer_size=30)
    decoder.start()
    decoders.append(decoder)
 # Wait for connections
 time.sleep(15)
 # Check status
 for i, decoder in enumerate(decoders):
    status = decoder.get_status()
    buffer_size = decoder.get_buffer_size()
    connected = decoder.is_connected()
    print(f"Stream {i+1}: {status.value}, Buffer: {buffer_size}, Connected: {connected}")
 # Process frames
 for decoder in decoders:
    frame = decoder.get_latest_frame()
    if frame is not None:
        # Process frame...
        pass
 # Cleanup
 for decoder in decoders:
    decoder.stop()
 ```
 ## Testing
 ### Run Single Stream Test
 ```bash
 python test_stream.py
 ```
 ### Run 4-Stream Test with VRAM Monitoring
 ```bash
 python test_multi_stream.py
 ```
 ### Measure VRAM Scaling
 ```bash
 python test_vram_process.py
 ```
 ### Test JPEG Encoding
 ```bash
 python test_jpeg_encode.py
 ```
 ## Known Issues
 ### Segmentation Faults on Cleanup
 **Status**: Non-critical
 **Impact**: Occurs during cleanup, doesn't affect core functionality
 **Cause**: Likely CUDA context cleanup order issues
 **Workaround**: Functionality works correctly; cleanup errors can be ignored
 ## Technical Decisions
 ### Why PyNvVideoCodec?
 - Direct access to NVDEC hardware decoder
 - Minimal overhead compared to FFmpeg/torchaudio
 - Returns GPU tensors via DLPack
 - Better control over decode sessions
 ### Why Shared CUDA Context?
 - Reduces VRAM from ~200MB to ~60MB per stream (70% savings)
 - Enables 1000-stream target on 60GB GPU
 - Minimal complexity overhead with singleton pattern
 ### Why nvImageCodec?
 - GPU-native JPEG encoding (nvJPEG)
 - Zero-copy with PyTorch via `__cuda_array_interface__`
 - 1-2ms encoding time per 720p frame
 - Keeps data on GPU until final compression
 ### Why Thread-Safe Ring Buffer?
 - Decouples decoding from inference pipeline
 - Prevents frame drops during processing spikes
 - Allows async frame access
 - Configurable buffer size per stream
 ## Future Considerations
 ### Hardware Decode Session Limits
 - NVIDIA GPUs typically support 5-30 concurrent decode sessions
 - May need multiple GPUs for 1000 streams
 - Test with actual hardware to verify limits
 ### Scaling Beyond 1000 Streams
 - Multi-GPU support with context per GPU
 - Load balancing across GPUs
 - Network bandwidth considerations
 ### TensorRT Integration
 - Next step: Integrate with TensorRT inference pipeline
 - GPU frames → TensorRT → Results
 - Keep entire pipeline on GPU
 ## References
 - [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)
 - [NVIDIA Video Codec SDK](https://developer.nvidia.com/nvidia-video-codec-sdk)
 - [nvImageCodec Documentation](https://docs.nvidia.com/cuda/nvimgcodec/)
 - [CUDA Python Bindings](https://nvidia.github.io/cuda-python/)
 ## License
 This project uses NVIDIA proprietary libraries (PyNvVideoCodec, nvImageCodec) which require NVIDIA GPU hardware and may have specific licensing terms.
--- a/requirements.dev.txt
+++ b/requirements.dev.txt
@ -0,0 +1,11 @@
 # Development Dependencies
 # Install with: pip install -r requirements.dev.txt
 # Model conversion tools
 tensorrt
 onnx
 ultralytics  # For YOLO models download and export
 # Optional: Additional tools for model optimization
 onnxruntime-gpu  # ONNX runtime for testing
 onnx-simplifier  # Simplify ONNX models
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,8 @@
 fastapi
 uvicorn[standard]
 torch
 PyNvVideoCodec
 av
 cuda-python
 nvidia-nvimgcodec-cu12  # GPU-accelerated JPEG encoding/decoding with nvJPEG
 python-dotenv  # Load environment variables from .env file
--- a/scripts/README.md
+++ b/scripts/README.md
@ -0,0 +1,197 @@
 # Scripts Directory
 This directory contains utility scripts for the python-rtsp-worker project.
 ## convert_pt_to_tensorrt.py
 Converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized GPU inference.
 ### Features
 - **Multiple Precision Modes**: FP32, FP16, INT8
 - **Dynamic Batch Size**: Support for variable batch sizes
 - **Automatic Optimization**: Creates optimization profiles for best performance
 - **ONNX Intermediate**: Uses ONNX as intermediate format for compatibility
 - **Easy to Use**: Simple command-line interface
 ### Requirements
 Make sure you have the following dependencies installed:
 ```bash
 pip install torch tensorrt onnx
 ```
 ### Quick Start
 **Basic conversion (FP32)**:
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model path/to/model.pt \
    --output models/model.trt
 ```
 **FP16 precision** (recommended for most cases - 2x faster, minimal accuracy loss):
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model yolov8n.pt \
    --output models/yolov8n.trt \
    --fp16
 ```
 **Custom input shape**:
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --input-shape 1,3,416,416
 ```
 **Dynamic batch size** (for variable batch inference):
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --dynamic-batch \
    --max-batch 16
 ```
 **Maximum optimization** (FP16 + INT8):
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model model.pt \
    --output model.trt \
    --fp16 \
    --int8
 ```
 ### Command-Line Arguments
 | Argument | Required | Default | Description |
 |----------|----------|---------|-------------|
 | `--model`, `-m` | Yes | - | Path to PyTorch model file (.pt or .pth) |
 | `--output`, `-o` | Yes | - | Output path for TensorRT engine (.trt) |
 | `--input-shape`, `-s` | No | 1,3,640,640 | Input tensor shape as B,C,H,W |
 | `--fp16` | No | False | Enable FP16 precision (faster, ~same accuracy) |
 | `--int8` | No | False | Enable INT8 precision (fastest, needs calibration) |
 | `--dynamic-batch` | No | False | Enable dynamic batch size support |
 | `--max-batch` | No | 16 | Maximum batch size for dynamic batching |
 | `--workspace-size` | No | 4 | TensorRT workspace size in GB |
 | `--gpu` | No | 0 | GPU device ID to use |
 | `--input-names` | No | ["input"] | Custom input tensor names |
 | `--output-names` | No | ["output"] | Custom output tensor names |
 | `--keep-onnx` | No | False | Keep intermediate ONNX file for debugging |
 | `--verbose`, `-v` | No | False | Enable verbose logging |
 ### Performance Tips
 1. **Always use FP16** unless you need FP32 precision:
   - 2x faster inference
   - 50% less VRAM usage
   - Minimal accuracy loss for most models
 2. **Use dynamic batching** for variable workloads:
   - Process 1-16 images with same engine
   - Automatic optimization for common batch sizes
 3. **Increase workspace size** for complex models:
   - Default 4GB works for most models
   - Increase to 8GB for very large models
 4. **INT8 quantization** for maximum speed:
   - Requires calibration data (not included in basic conversion)
   - 4x faster than FP32
   - Best for deployment scenarios
 ### Integration with Model Repository
 Once converted, use the TensorRT engine with the model repository:
 ```python
 from services.model_repository import TensorRTModelRepository
 # Initialize repository
 repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
 # Load the converted model
 repo.load_model(
    model_id="my_model",
    file_path="models/model.trt",
    num_contexts=4
 )
 # Run inference
 import torch
 input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
 outputs = repo.infer(
    model_id="my_model",
    inputs={"input": input_tensor}
 )
 ```
 ### Troubleshooting
 **Issue**: `Failed to parse ONNX model`
 - Solution: Check if your PyTorch model is compatible with ONNX export
 - Try updating PyTorch and ONNX versions
 **Issue**: `FP16 not supported on this platform`
 - Solution: Your GPU doesn't support FP16. Remove `--fp16` flag
 **Issue**: `Out of memory during conversion`
 - Solution: Reduce `--workspace-size` or free up GPU memory
 **Issue**: `Model contains only state_dict`
 - Solution: Your checkpoint only has weights. You need the full model architecture.
 - Modify the script's `load_pytorch_model()` method to instantiate your model class
 ### Examples for Common Models
 **YOLOv8**:
 ```bash
 # Download model first
 # yolo export model=yolov8n.pt format=engine device=0
 # Or use this script
 python scripts/convert_pt_to_tensorrt.py \
    --model yolov8n.pt \
    --output models/yolov8n.trt \
    --input-shape 1,3,640,640 \
    --fp16
 ```
 **ResNet**:
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model resnet50.pt \
    --output models/resnet50.trt \
    --input-shape 1,3,224,224 \
    --fp16 \
    --dynamic-batch \
    --max-batch 32
 ```
 **Custom Model**:
 ```bash
 python scripts/convert_pt_to_tensorrt.py \
    --model custom_model.pt \
    --output models/custom.trt \
    --input-shape 1,3,512,512 \
    --input-names image \
    --output-names predictions \
    --fp16 \
    --verbose
 ```
 ### Notes
 - The script uses ONNX as an intermediate format, which is the recommended approach
 - TensorRT engines are hardware-specific; rebuild for different GPUs
 - Conversion time varies (30 seconds to 5 minutes depending on model size)
 - The first inference after loading is slower (warmup)
 ### Support
 For issues or questions, please check:
 - TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
 - PyTorch ONNX export guide: https://pytorch.org/docs/stable/onnx.html
--- a/scripts/convert_pt_to_tensorrt.py
+++ b/scripts/convert_pt_to_tensorrt.py
@ -0,0 +1,562 @@
 #!/usr/bin/env python3
 """
 PyTorch to TensorRT Model Conversion Script
 This script converts PyTorch models (.pt, .pth) to TensorRT engines (.trt) for optimized inference.
 Features:
 - Automatic FP32/FP16/INT8 precision modes
 - Dynamic batch size support
 - Input shape validation
 - Optimization profiles for dynamic shapes
 - ONNX intermediate format
 - GPU-accelerated conversion
 Usage:
    python convert_pt_to_tensorrt.py --model path/to/model.pt --output models/model.trt
    python convert_pt_to_tensorrt.py --model yolov8n.pt --input-shape 1 3 640 640 --fp16
    python convert_pt_to_tensorrt.py --model model.pt --dynamic-batch --max-batch 16
 """
 import argparse
 import sys
 from pathlib import Path
 from typing import Tuple, List, Optional
 import torch
 import tensorrt as trt
 import numpy as np
 class TensorRTConverter:
    """Converts PyTorch models to TensorRT engines"""
    def __init__(self, gpu_id: int = 0, verbose: bool = True):
        """
        Initialize the converter.
        Args:
            gpu_id: GPU device ID to use for conversion
            verbose: Enable verbose logging
        """
        self.gpu_id = gpu_id
        self.device = torch.device(f'cuda:{gpu_id}')
        # TensorRT logger
        log_level = trt.Logger.VERBOSE if verbose else trt.Logger.WARNING
        self.logger = trt.Logger(log_level)
        # Set CUDA device
        torch.cuda.set_device(gpu_id)
        print(f"Initialized TensorRT Converter on GPU {gpu_id}")
        print(f"PyTorch version: {torch.__version__}")
        print(f"TensorRT version: {trt.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA device: {torch.cuda.get_device_name(gpu_id)}")
    def load_pytorch_model(self, model_path: str) -> torch.nn.Module:
        """
        Load PyTorch model from file.
        Args:
            model_path: Path to .pt or .pth file
        Returns:
            Loaded PyTorch model in eval mode
        """
        print(f"\nLoading PyTorch model from {model_path}...")
        if not Path(model_path).exists():
            raise FileNotFoundError(f"Model file not found: {model_path}")
        # Load model (weights_only=False for models with custom classes)
        checkpoint = torch.load(model_path, map_location=self.device, weights_only=False)
        # Handle different checkpoint formats
        if isinstance(checkpoint, dict):
            if 'model' in checkpoint:
                model = checkpoint['model']
            elif 'state_dict' in checkpoint:
                # Need model architecture - this is a limitation
                raise ValueError(
                    "Checkpoint contains only state_dict. "
                    "Please provide the complete model or modify this script to load your architecture."
                )
            else:
                raise ValueError("Unknown checkpoint format")
        else:
            model = checkpoint
        # Set to eval mode
        model.eval()
        model.to(self.device)
        print(f"✓ Model loaded successfully")
        return model
    def export_to_onnx(self, model: torch.nn.Module, input_shape: Tuple[int, ...],
                       onnx_path: str, dynamic_batch: bool = False,
                       input_names: List[str] = None, output_names: List[str] = None) -> str:
        """
        Export PyTorch model to ONNX format (intermediate step).
        Args:
            model: PyTorch model
            input_shape: Input tensor shape (B, C, H, W)
            onnx_path: Output path for ONNX file
            dynamic_batch: Enable dynamic batch dimension
            input_names: List of input tensor names
            output_names: List of output tensor names
        Returns:
            Path to exported ONNX file
        """
        print(f"\nExporting to ONNX format...")
        print(f"Input shape: {input_shape}")
        print(f"Dynamic batch: {dynamic_batch}")
        # Default names
        if input_names is None:
            input_names = ['input']
        if output_names is None:
            output_names = ['output']
        # Create dummy input
        dummy_input = torch.randn(*input_shape, device=self.device)
        # Dynamic axes configuration
        dynamic_axes = None
        if dynamic_batch:
            dynamic_axes = {
                input_names[0]: {0: 'batch'},
                output_names[0]: {0: 'batch'}
            }
        # Export to ONNX
        torch.onnx.export(
            model,
            dummy_input,
            onnx_path,
            input_names=input_names,
            output_names=output_names,
            dynamic_axes=dynamic_axes,
            opset_version=17,  # Use recent ONNX opset
            do_constant_folding=True,
            verbose=False
        )
        print(f"✓ ONNX model exported to {onnx_path}")
        return onnx_path
    def build_tensorrt_engine_from_onnx(self, onnx_path: str, engine_path: str,
                                        fp16: bool = False, int8: bool = False,
                                        max_workspace_size: int = 4,
                                        min_batch: int = 1, opt_batch: int = 1, max_batch: int = 1) -> str:
        """
        Build TensorRT engine from ONNX model.
        Args:
            onnx_path: Path to ONNX model
            engine_path: Output path for TensorRT engine
            fp16: Enable FP16 precision
            int8: Enable INT8 precision (requires calibration)
            max_workspace_size: Maximum workspace size in GB
            min_batch: Minimum batch size for optimization
            opt_batch: Optimal batch size for optimization
            max_batch: Maximum batch size for optimization
        Returns:
            Path to built TensorRT engine
        """
        print(f"\nBuilding TensorRT engine from ONNX...")
        print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
        print(f"Workspace size: {max_workspace_size} GB")
        # Create builder and network
        builder = trt.Builder(self.logger)
        network = builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        )
        parser = trt.OnnxParser(network, self.logger)
        # Parse ONNX model
        print(f"Loading ONNX file from {onnx_path}...")
        with open(onnx_path, 'rb') as f:
            if not parser.parse(f.read()):
                print("ERROR: Failed to parse the ONNX file:")
                for error in range(parser.num_errors):
                    print(f"  {parser.get_error(error)}")
                raise RuntimeError("Failed to parse ONNX model")
        print(f"✓ ONNX model parsed successfully")
        # Print network info
        print(f"\nNetwork Information:")
        print(f"  Inputs: {network.num_inputs}")
        for i in range(network.num_inputs):
            inp = network.get_input(i)
            print(f"    [{i}] {inp.name}: {inp.shape} ({inp.dtype})")
        print(f"  Outputs: {network.num_outputs}")
        for i in range(network.num_outputs):
            out = network.get_output(i)
            print(f"    [{i}] {out.name}: {out.shape} ({out.dtype})")
        # Create builder config
        config = builder.create_builder_config()
        # Set workspace size
        config.set_memory_pool_limit(
            trt.MemoryPoolType.WORKSPACE,
            max_workspace_size * (1 << 30)  # GB to bytes
        )
        # Enable precision modes
        if fp16:
            if not builder.platform_has_fast_fp16:
                print("Warning: FP16 not supported on this platform, using FP32")
            else:
                config.set_flag(trt.BuilderFlag.FP16)
                print("✓ FP16 mode enabled")
        if int8:
            if not builder.platform_has_fast_int8:
                print("Warning: INT8 not supported on this platform, using FP32/FP16")
            else:
                config.set_flag(trt.BuilderFlag.INT8)
                print("✓ INT8 mode enabled")
                print("Note: INT8 calibration not implemented. Results may be suboptimal.")
        # Set optimization profile for dynamic shapes
        if max_batch > 1 or min_batch != max_batch:
            profile = builder.create_optimization_profile()
            for i in range(network.num_inputs):
                inp = network.get_input(i)
                shape = list(inp.shape)
                # Handle dynamic batch dimension
                if shape[0] == -1:
                    # Min, opt, max shapes
                    min_shape = [min_batch] + shape[1:]
                    opt_shape = [opt_batch] + shape[1:]
                    max_shape = [max_batch] + shape[1:]
                    profile.set_shape(inp.name, min_shape, opt_shape, max_shape)
                    print(f"  Dynamic shape for {inp.name}:")
                    print(f"    Min: {min_shape}")
                    print(f"    Opt: {opt_shape}")
                    print(f"    Max: {max_shape}")
            config.add_optimization_profile(profile)
        # Build engine
        print(f"\nBuilding TensorRT engine (this may take a few minutes)...")
        serialized_engine = builder.build_serialized_network(network, config)
        if serialized_engine is None:
            raise RuntimeError("Failed to build TensorRT engine")
        # Save engine to file
        print(f"Saving engine to {engine_path}...")
        with open(engine_path, 'wb') as f:
            f.write(serialized_engine)
        # Get file size
        file_size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
        print(f"✓ TensorRT engine built successfully")
        print(f"  Engine size: {file_size_mb:.2f} MB")
        return engine_path
    def convert(self, model_path: str, output_path: str,
                input_shape: Tuple[int, ...] = (1, 3, 640, 640),
                fp16: bool = False, int8: bool = False,
                dynamic_batch: bool = False,
                max_batch: int = 16,
                workspace_size: int = 4,
                input_names: List[str] = None,
                output_names: List[str] = None,
                keep_onnx: bool = False) -> str:
        """
        Convert PyTorch or ONNX model to TensorRT engine.
        Args:
            model_path: Path to PyTorch model (.pt, .pth) or ONNX model (.onnx)
            output_path: Path for output TensorRT engine (.trt)
            input_shape: Input tensor shape (B, C, H, W) - required for PyTorch models
            fp16: Enable FP16 precision
            int8: Enable INT8 precision
            dynamic_batch: Enable dynamic batch size
            max_batch: Maximum batch size (for dynamic batching)
            workspace_size: TensorRT workspace size in GB
            input_names: Custom input names (for PyTorch export)
            output_names: Custom output names (for PyTorch export)
            keep_onnx: Keep intermediate ONNX file
        Returns:
            Path to created TensorRT engine
        """
        # Create output directory
        output_dir = Path(output_path).parent
        output_dir.mkdir(parents=True, exist_ok=True)
        # Check if input is already ONNX
        model_path_obj = Path(model_path)
        is_onnx = model_path_obj.suffix.lower() == '.onnx'
        if is_onnx:
            # Direct ONNX to TensorRT conversion
            print(f"Input is ONNX model, converting directly to TensorRT...")
            min_batch = 1
            opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
            max_batch_size = max_batch if dynamic_batch else input_shape[0]
            engine_path = self.build_tensorrt_engine_from_onnx(
                onnx_path=model_path,
                engine_path=output_path,
                fp16=fp16,
                int8=int8,
                max_workspace_size=workspace_size,
                min_batch=min_batch,
                opt_batch=opt_batch,
                max_batch=max_batch_size
            )
            print(f"\n{'=' * 80}")
            print(f"CONVERSION COMPLETED SUCCESSFULLY")
            print(f"{'=' * 80}")
            print(f"Input:  {model_path}")
            print(f"Output: {engine_path}")
            print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
            print(f"{'=' * 80}")
            return engine_path
        # PyTorch to TensorRT conversion (via ONNX)
        # Temporary ONNX path
        onnx_path = str(output_dir / "temp_model.onnx")
        try:
            # Step 1: Load PyTorch model
            model = self.load_pytorch_model(model_path)
            # Step 2: Export to ONNX
            self.export_to_onnx(
                model=model,
                input_shape=input_shape,
                onnx_path=onnx_path,
                dynamic_batch=dynamic_batch,
                input_names=input_names,
                output_names=output_names
            )
            # Step 3: Build TensorRT engine
            min_batch = 1
            opt_batch = input_shape[0] if not dynamic_batch else max(1, max_batch // 2)
            max_batch_size = max_batch if dynamic_batch else input_shape[0]
            engine_path = self.build_tensorrt_engine_from_onnx(
                onnx_path=onnx_path,
                engine_path=output_path,
                fp16=fp16,
                int8=int8,
                max_workspace_size=workspace_size,
                min_batch=min_batch,
                opt_batch=opt_batch,
                max_batch=max_batch_size
            )
            print(f"\n{'=' * 80}")
            print(f"CONVERSION COMPLETED SUCCESSFULLY")
            print(f"{'=' * 80}")
            print(f"Input:  {model_path}")
            print(f"Output: {engine_path}")
            print(f"Precision: FP{'16' if fp16 else '32'}{' + INT8' if int8 else ''}")
            print(f"Dynamic batch: {dynamic_batch}")
            if dynamic_batch:
                print(f"Batch range: [1, {max_batch}]")
            print(f"{'=' * 80}")
            return engine_path
        finally:
            # Cleanup temporary ONNX file
            if not keep_onnx and Path(onnx_path).exists():
                Path(onnx_path).unlink()
                print(f"Cleaned up temporary ONNX file")
 def parse_shape(shape_str: str) -> Tuple[int, ...]:
    """Parse shape string like '1,3,640,640' to tuple"""
    try:
        return tuple(int(x) for x in shape_str.split(','))
    except ValueError:
        raise argparse.ArgumentTypeError(
            f"Invalid shape format: {shape_str}. Expected format: 1,3,640,640"
        )
 def main():
    parser = argparse.ArgumentParser(
        description="Convert PyTorch models to TensorRT engines",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Basic conversion (FP32)
  python convert_pt_to_tensorrt.py --model yolov8n.pt --output models/yolov8n.trt
  # FP16 precision for faster inference
  python convert_pt_to_tensorrt.py --model model.pt --output model.trt --fp16
  # Custom input shape
  python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
      --input-shape 1,3,416,416
  # Dynamic batch size (1 to 16)
  python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
      --dynamic-batch --max-batch 16
  # INT8 quantization for maximum speed (requires calibration)
  python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
      --fp16 --int8
  # Keep intermediate ONNX file for debugging
  python convert_pt_to_tensorrt.py --model model.pt --output model.trt \\
      --keep-onnx
        """
    )
    # Required arguments
    parser.add_argument(
        '--model', '-m',
        type=str,
        required=True,
        help='Path to PyTorch model file (.pt or .pth)'
    )
    parser.add_argument(
        '--output', '-o',
        type=str,
        required=True,
        help='Output path for TensorRT engine (.trt or .engine)'
    )
    # Optional arguments
    parser.add_argument(
        '--input-shape', '-s',
        type=parse_shape,
        default=(1, 3, 640, 640),
        help='Input tensor shape as B,C,H,W (default: 1,3,640,640)'
    )
    parser.add_argument(
        '--fp16',
        action='store_true',
        help='Enable FP16 precision (faster inference, slightly lower accuracy)'
    )
    parser.add_argument(
        '--int8',
        action='store_true',
        help='Enable INT8 precision (fastest, requires calibration)'
    )
    parser.add_argument(
        '--dynamic-batch',
        action='store_true',
        help='Enable dynamic batch size support'
    )
    parser.add_argument(
        '--max-batch',
        type=int,
        default=16,
        help='Maximum batch size for dynamic batching (default: 16)'
    )
    parser.add_argument(
        '--workspace-size',
        type=int,
        default=4,
        help='TensorRT workspace size in GB (default: 4)'
    )
    parser.add_argument(
        '--gpu',
        type=int,
        default=0,
        help='GPU device ID (default: 0)'
    )
    parser.add_argument(
        '--input-names',
        type=str,
        nargs='+',
        default=None,
        help='Custom input tensor names (default: ["input"])'
    )
    parser.add_argument(
        '--output-names',
        type=str,
        nargs='+',
        default=None,
        help='Custom output tensor names (default: ["output"])'
    )
    parser.add_argument(
        '--keep-onnx',
        action='store_true',
        help='Keep intermediate ONNX file'
    )
    parser.add_argument(
        '--verbose', '-v',
        action='store_true',
        help='Enable verbose logging'
    )
    args = parser.parse_args()
    # Validate arguments
    if not Path(args.model).exists():
        print(f"Error: Model file not found: {args.model}")
        sys.exit(1)
    if args.int8 and not args.fp16:
        print("Warning: INT8 mode works best with FP16 enabled. Adding --fp16 flag.")
        args.fp16 = True
    # Run conversion
    try:
        converter = TensorRTConverter(gpu_id=args.gpu, verbose=args.verbose)
        converter.convert(
            model_path=args.model,
            output_path=args.output,
            input_shape=args.input_shape,
            fp16=args.fp16,
            int8=args.int8,
            dynamic_batch=args.dynamic_batch,
            max_batch=args.max_batch,
            workspace_size=args.workspace_size,
            input_names=args.input_names,
            output_names=args.output_names,
            keep_onnx=args.keep_onnx
        )
        print("\n✓ Conversion successful!")
    except Exception as e:
        print(f"\n✗ Conversion failed: {e}")
        if args.verbose:
            import traceback
            traceback.print_exc()
        sys.exit(1)
 if __name__ == "__main__":
    main()
--- a/services/README_MODEL_REPOSITORY.md
+++ b/services/README_MODEL_REPOSITORY.md
@ -0,0 +1,380 @@
 # TensorRT Model Repository
 Efficient TensorRT model management with context pooling, deduplication, and GPU-to-GPU inference.
 ## Architecture
 ### Key Features
 1. **Model Deduplication by File Hash**
   - Multiple model IDs can point to the same model file
   - Only one engine loaded in VRAM per unique file
   - Example: 100 cameras with same model = 1 engine (not 100!)
 2. **Context Pooling for Load Balancing**
   - Each unique engine has N execution contexts (configurable)
   - Contexts borrowed/returned via mutex-based queue
   - Enables concurrent inference without context-per-model overhead
   - Example: 100 cameras sharing 4 contexts efficiently
 3. **GPU-to-GPU Inference**
   - All inputs/outputs stay in VRAM (zero CPU transfers)
   - Integrates seamlessly with StreamDecoder (frames already on GPU)
   - Maximum performance for video inference pipelines
 4. **Thread-Safe Concurrent Inference**
   - Mutex-based context acquisition (TensorRT best practice)
   - No shared IExecutionContext across threads (safe)
   - Multiple threads can infer concurrently (limited by pool size)
 ## Design Rationale
 ### Why Context Pooling?
 **Without pooling** (naive approach):
 ```
 100 cameras → 100 model IDs → 100 execution contexts
 ```
 - Problem: Each context consumes VRAM (layers, workspace, etc.)
 - Problem: Context creation overhead per camera
 - Problem: Doesn't scale to hundreds of cameras
 **With pooling** (our approach):
 ```
 100 cameras → 100 model IDs → 1 shared engine → 4 contexts (pool)
 ```
 - Solution: Contexts shared across all cameras using same model
 - Solution: Borrow/return mechanism with mutex queue
 - Solution: Scales to any number of cameras with fixed context count
 ### Memory Savings Example
 YOLOv8n model (~6MB engine file):
 | Approach | Model IDs | Engines | Contexts | Approx VRAM |
 |----------|-----------|---------|----------|-------------|
 | Naive | 100 | 100 | 100 | ~1.5 GB |
 | **Ours (pooled)** | **100** | **1** | **4** | **~30 MB** |
 **50x memory savings!**
 ## Usage
 ### Basic Usage
 ```python
 from services.model_repository import TensorRTModelRepository
 # Initialize repository
 repo = TensorRTModelRepository(
    gpu_id=0,
    default_num_contexts=4  # 4 contexts per unique engine
 )
 # Load model for camera 1
 repo.load_model(
    model_id="camera_1",
    file_path="models/yolov8n.trt"
 )
 # Load same model for camera 2 (deduplication happens automatically)
 repo.load_model(
    model_id="camera_2",
    file_path="models/yolov8n.trt"  # Same file → shares engine and contexts!
 )
 # Run inference (GPU-to-GPU)
 import torch
 input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0')
 outputs = repo.infer(
    model_id="camera_1",
    inputs={"images": input_tensor},
    synchronize=True,
    timeout=5.0  # Wait up to 5s for available context
 )
 # Outputs stay on GPU
 for name, tensor in outputs.items():
    print(f"{name}: {tensor.shape} on {tensor.device}")
 ```
 ### Multi-Camera Scenario
 ```python
 # Setup multiple cameras
 cameras = [f"camera_{i}" for i in range(100)]
 # Load same model for all cameras
 for camera_id in cameras:
    repo.load_model(
        model_id=camera_id,
        file_path="models/yolov8n.trt"  # Same file for all
    )
 # Check efficiency
 stats = repo.get_stats()
 print(f"Model IDs: {stats['total_model_ids']}")  # 100
 print(f"Unique engines: {stats['unique_engines']}")  # 1
 print(f"Total contexts: {stats['total_contexts']}")  # 4
 ```
 ### Integration with RTSP Decoder
 ```python
 from services.stream_decoder import StreamDecoderFactory
 from services.model_repository import TensorRTModelRepository
 # Setup
 decoder_factory = StreamDecoderFactory(gpu_id=0)
 model_repo = TensorRTModelRepository(gpu_id=0)
 # Create decoder for camera
 decoder = decoder_factory.create_decoder("rtsp://camera.ip/stream")
 decoder.start()
 # Load inference model
 model_repo.load_model("camera_main", "models/yolov8n.trt")
 # Process frames (everything on GPU)
 frame_gpu = decoder.get_latest_frame(rgb=True)  # torch.Tensor on CUDA
 # Preprocess (stays on GPU)
 frame_gpu = frame_gpu.float() / 255.0
 frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dim
 # Inference (GPU-to-GPU, zero copy)
 outputs = model_repo.infer(
    model_id="camera_main",
    inputs={"images": frame_gpu}
 )
 # Post-process outputs (can stay on GPU)
 # ... NMS, bounding boxes, etc.
 ```
 ### Concurrent Inference
 ```python
 import threading
 def process_camera(camera_id: str, model_id: str):
    # Get frame from decoder (on GPU)
    frame = decoder.get_latest_frame(rgb=True)
    # Inference automatically borrows/returns context from pool
    outputs = repo.infer(
        model_id=model_id,
        inputs={"images": frame},
        timeout=10.0  # Wait for available context
    )
    # Process outputs...
 # Multiple threads can infer concurrently
 threads = []
 for i in range(10):  # 10 threads
    t = threading.Thread(
        target=process_camera,
        args=(f"camera_{i}", f"camera_{i}")
    )
    threads.append(t)
    t.start()
 for t in threads:
    t.join()
 # With 4 contexts: up to 4 inferences run in parallel
 # Others wait in queue, contexts auto-balanced
 ```
 ## API Reference
 ### TensorRTModelRepository
 #### `__init__(gpu_id=0, default_num_contexts=4)`
 Initialize the repository.
 **Args:**
 - `gpu_id`: GPU device ID
 - `default_num_contexts`: Default context pool size per engine
 #### `load_model(model_id, file_path, num_contexts=None, force_reload=False)`
 Load a TensorRT model.
 **Args:**
 - `model_id`: Unique identifier (e.g., "camera_1")
 - `file_path`: Path to .trt/.engine file
 - `num_contexts`: Context pool size (None = use default)
 - `force_reload`: Reload if model_id exists
 **Returns:** `ModelMetadata`
 **Deduplication:** If file hash matches existing model, reuses engine + contexts.
 #### `infer(model_id, inputs, synchronize=True, timeout=5.0)`
 Run inference.
 **Args:**
 - `model_id`: Model identifier
 - `inputs`: Dict mapping input names to CUDA tensors
 - `synchronize`: Wait for completion
 - `timeout`: Max wait time for context (seconds)
 **Returns:** Dict mapping output names to CUDA tensors
 **Thread-safe:** Borrows context from pool, returns after inference.
 #### `unload_model(model_id)`
 Unload a model.
 If last reference to engine, fully unloads from VRAM.
 #### `get_metadata(model_id)`
 Get model metadata.
 **Returns:** `ModelMetadata` or `None`
 #### `get_model_info(model_id)`
 Get detailed model information.
 **Returns:** Dict with engine references, context pool size, shared model IDs, etc.
 #### `get_stats()`
 Get repository statistics.
 **Returns:** Dict with total models, unique engines, contexts, memory efficiency.
 ## Best Practices
 ### 1. Set Appropriate Context Pool Size
 ```python
 # For 10 cameras with same model, 4 contexts is usually enough
 repo = TensorRTModelRepository(default_num_contexts=4)
 # For high concurrency, increase pool size
 repo = TensorRTModelRepository(default_num_contexts=8)
 ```
 **Rule of thumb:** Start with 4 contexts, increase if you see timeout errors.
 ### 2. Always Use GPU Tensors
 ```python
 # ✅ Good: Input on GPU
 input_gpu = torch.rand(1, 3, 640, 640, device='cuda:0')
 outputs = repo.infer(model_id, {"images": input_gpu})
 # ❌ Bad: Input on CPU (will cause error)
 input_cpu = torch.rand(1, 3, 640, 640)
 outputs = repo.infer(model_id, {"images": input_cpu})  # ValueError!
 ```
 ### 3. Handle Timeout Gracefully
 ```python
 try:
    outputs = repo.infer(
        model_id="camera_1",
        inputs=inputs,
        timeout=5.0
    )
 except RuntimeError as e:
    # All contexts busy, increase pool size or add backpressure
    print(f"Inference timeout: {e}")
 ```
 ### 4. Use Same File for Deduplication
 ```python
 # ✅ Good: Same file path → deduplication
 repo.load_model("cam1", "/models/yolo.trt")
 repo.load_model("cam2", "/models/yolo.trt")  # Shares engine!
 # ❌ Bad: Different paths (even if same content) → no deduplication
 repo.load_model("cam1", "/models/yolo.trt")
 repo.load_model("cam2", "/models/yolo_copy.trt")  # Separate engine
 ```
 ## TensorRT Best Practices Implemented
 Based on NVIDIA documentation and web search findings:
 1. **Separate IExecutionContext per concurrent stream** ✅
   - Each context has its own CUDA stream
   - Contexts never shared across threads simultaneously
 2. **Mutex-based context management** ✅
   - Queue-based borrowing with locks
   - Thread-safe acquire/release pattern
 3. **GPU memory reuse** ✅
   - Engines shared by file hash
   - Contexts pooled and reused
 4. **Zero-copy operations** ✅
   - All data stays in VRAM
   - DLPack integration with PyTorch
 ## Troubleshooting
 ### "No execution context available within timeout"
 **Cause:** All contexts busy with concurrent inferences.
 **Solutions:**
 1. Increase context pool size:
   ```python
   repo.load_model(model_id, file_path, num_contexts=8)
   ```
 2. Increase timeout:
   ```python
   outputs = repo.infer(model_id, inputs, timeout=30.0)
   ```
 3. Add backpressure/throttling to limit concurrent requests
 ### Out of Memory (OOM)
 **Cause:** Too many unique engines or large context pools.
 **Solutions:**
 1. Ensure deduplication working (same file paths)
 2. Reduce context pool sizes
 3. Use smaller models or quantization (INT8/FP16)
 ### Import Error: "tensorrt could not be resolved"
 **Solution:** Install TensorRT:
 ```bash
 pip install tensorrt
 # Or use NVIDIA's wheel for your CUDA version
 ```
 ## Performance Tips
 1. **Batch Processing:** Process multiple frames before synchronizing
   ```python
   outputs = repo.infer(model_id, inputs, synchronize=False)
   # ... more inferences ...
   torch.cuda.synchronize()  # Sync once at end
   ```
 2. **Async Inference:** Don't synchronize if not needed immediately
   ```python
   outputs = repo.infer(model_id, inputs, synchronize=False)
   # GPU continues working, CPU continues
   # Synchronize later when you need results
   ```
 3. **Monitor Context Utilization:**
   ```python
   stats = repo.get_stats()
   print(f"Contexts: {stats['total_contexts']}")
   # If timeouts occur frequently, increase pool size
   ```
 ## License
 Part of python-rtsp-worker project.
--- a/services/init.py
+++ b/services/init.py
@ -0,0 +1,14 @@
 """
 Services package for RTSP stream processing with GPU acceleration.
 """
 from .stream_decoder import StreamDecoderFactory, StreamDecoder, ConnectionStatus
 from .jpeg_encoder import JPEGEncoderFactory, encode_frame_to_jpeg
 __all__ = [
    'StreamDecoderFactory',
    'StreamDecoder',
    'ConnectionStatus',
    'JPEGEncoderFactory',
    'encode_frame_to_jpeg',
 ]
--- a/services/jpeg_encoder.py
+++ b/services/jpeg_encoder.py
@ -0,0 +1,91 @@
 """
 JPEG Encoder wrapper for GPU-accelerated JPEG encoding using nvImageCodec/nvJPEG.
 Provides a shared encoder instance that can be used across multiple streams.
 """
 from typing import Optional
 import torch
 import nvidia.nvimgcodec as nvimgcodec
 class JPEGEncoderFactory:
    """
    Factory for creating and managing a shared JPEG encoder instance.
    Thread-safe singleton pattern for efficient resource sharing.
    """
    _instance = None
    _encoder = None
    def __new__(cls):
        if cls._instance is None:
            cls._instance = super(JPEGEncoderFactory, cls).__new__(cls)
            cls._encoder = nvimgcodec.Encoder()
            print("JPEGEncoderFactory initialized with shared nvJPEG encoder")
        return cls._instance
    @classmethod
    def get_encoder(cls):
        """Get the shared JPEG encoder instance"""
        if cls._encoder is None:
            cls()  # Initialize if not already done
        return cls._encoder
 def encode_frame_to_jpeg(rgb_frame: torch.Tensor, quality: int = 95) -> Optional[bytes]:
    """
    Encode an RGB frame to JPEG on GPU and return JPEG bytes.
    This function:
    1. Takes RGB frame from GPU (stays on GPU during encoding)
    2. Converts PyTorch tensor to nvImageCodec image via as_image()
    3. Encodes to JPEG using nvJPEG (GPU operation)
    4. Transfers only JPEG bytes to CPU
    5. Returns bytes for saving to disk
    Args:
        rgb_frame: RGB tensor on GPU, shape (3, H, W) or (H, W, 3), dtype uint8
        quality: JPEG quality (0-100, default 95)
    Returns:
        JPEG encoded bytes or None if encoding fails
    """
    if rgb_frame is None:
        return None
    try:
        # Ensure we have (H, W, C) format and contiguous memory
        if rgb_frame.dim() == 3:
            if rgb_frame.shape[0] == 3:
                # Convert from (C, H, W) to (H, W, C)
                rgb_hwc = rgb_frame.permute(1, 2, 0).contiguous()
            else:
                # Already (H, W, C)
                rgb_hwc = rgb_frame.contiguous()
        else:
            raise ValueError(f"Expected 3D tensor, got shape {rgb_frame.shape}")
        # Get shared encoder
        encoder = JPEGEncoderFactory.get_encoder()
        # Create encode parameters with quality
        # Quality is set via quality_value (0-100 scale)
        jpeg_params = nvimgcodec.JpegEncodeParams(optimized_huffman=True)
        encode_params = nvimgcodec.EncodeParams(
            quality_value=float(quality),
            jpeg_encode_params=jpeg_params
        )
        # Convert PyTorch GPU tensor to nvImageCodec image using __cuda_array_interface__
        # This is zero-copy - nvimgcodec reads directly from GPU memory
        nv_image = nvimgcodec.as_image(rgb_hwc)
        # Encode to JPEG on GPU
        # The encoding happens on GPU, only compressed JPEG bytes are transferred to CPU
        jpeg_data = encoder.encode(nv_image, "jpeg", encode_params)
        return bytes(jpeg_data)
    except Exception as e:
        print(f"Error encoding frame to JPEG: {e}")
        return None
--- a/services/model_repository.py
+++ b/services/model_repository.py
@ -0,0 +1,631 @@
 import threading
 import hashlib
 from typing import Optional, Dict, Any, List, Tuple
 from pathlib import Path
 from queue import Queue
 import torch
 import tensorrt as trt
 from dataclasses import dataclass
@dataclass
 class ModelMetadata:
    """Metadata for a loaded TensorRT model"""
    file_path: str
    file_hash: str
    input_shapes: Dict[str, Tuple[int, ...]]
    output_shapes: Dict[str, Tuple[int, ...]]
    input_names: List[str]
    output_names: List[str]
    input_dtypes: Dict[str, torch.dtype]
    output_dtypes: Dict[str, torch.dtype]
 class ExecutionContext:
    """
    Wrapper for TensorRT execution context with CUDA stream.
    Used in context pool for load balancing.
    """
    def __init__(self, context: trt.IExecutionContext, stream: torch.cuda.Stream,
                 context_id: int, device: torch.device):
        self.context = context
        self.stream = stream
        self.context_id = context_id
        self.device = device
        self.in_use = False
        self.lock = threading.Lock()
    def __repr__(self):
        return f"ExecutionContext(id={self.context_id}, in_use={self.in_use})"
 class SharedEngine:
    """
    Shared TensorRT engine with context pool for load balancing.
    Architecture:
    - One engine shared across all model_ids with same file hash
    - Pool of N execution contexts for concurrent inference
    - Contexts are borrowed/returned using mutex locks
    - Load balancing: contexts distributed across requests
    """
    def __init__(self, engine: trt.ICudaEngine, file_hash: str, file_path: str,
                 num_contexts: int, device: torch.device, metadata: ModelMetadata):
        self.engine = engine
        self.file_hash = file_hash
        self.file_path = file_path
        self.metadata = metadata
        self.device = device
        self.num_contexts = num_contexts
        # Create context pool
        self.context_pool: List[ExecutionContext] = []
        self.available_contexts: Queue[ExecutionContext] = Queue()
        for i in range(num_contexts):
            ctx = engine.create_execution_context()
            if ctx is None:
                raise RuntimeError(f"Failed to create execution context {i}")
            stream = torch.cuda.Stream(device=device)
            exec_ctx = ExecutionContext(ctx, stream, i, device)
            self.context_pool.append(exec_ctx)
            self.available_contexts.put(exec_ctx)
        # Model IDs referencing this engine
        self.model_ids: set = set()
        self.lock = threading.Lock()
        print(f"Created context pool with {num_contexts} contexts for engine {file_hash[:8]}...")
    def acquire_context(self, timeout: Optional[float] = None) -> Optional[ExecutionContext]:
        """
        Acquire an available execution context from the pool.
        Blocks if all contexts are in use.
        Args:
            timeout: Max time to wait for context (None = wait forever)
        Returns:
            ExecutionContext or None if timeout
        """
        try:
            exec_ctx = self.available_contexts.get(timeout=timeout)
            with exec_ctx.lock:
                exec_ctx.in_use = True
            return exec_ctx
        except:
            return None
    def release_context(self, exec_ctx: ExecutionContext):
        """
        Return a context to the pool.
        Args:
            exec_ctx: Context to release
        """
        with exec_ctx.lock:
            exec_ctx.in_use = False
        self.available_contexts.put(exec_ctx)
    def add_model_id(self, model_id: str):
        """Add a model_id reference to this engine"""
        with self.lock:
            self.model_ids.add(model_id)
    def remove_model_id(self, model_id: str) -> int:
        """
        Remove a model_id reference from this engine.
        Returns the number of remaining references.
        """
        with self.lock:
            self.model_ids.discard(model_id)
            return len(self.model_ids)
    def get_reference_count(self) -> int:
        """Get number of model_ids using this engine"""
        with self.lock:
            return len(self.model_ids)
    def cleanup(self):
        """Cleanup all contexts"""
        for exec_ctx in self.context_pool:
            del exec_ctx.context
        self.context_pool.clear()
        del self.engine
 class TensorRTModelRepository:
    """
    Thread-safe repository for TensorRT models with context pooling and deduplication.
    Architecture:
    - Deduplication: Multiple model_ids with same file → share one engine
    - Context Pool: Each unique engine has N execution contexts (configurable)
    - Load Balancing: Contexts are borrowed/returned via mutex queue
    - Scalability: Adding 100 cameras with same model = 1 engine + N contexts (not 100 contexts!)
    Best Practices:
    - GPU-to-GPU: All inputs/outputs stay in VRAM (zero CPU transfers)
    - Thread Safety: Mutex-based context borrowing (TensorRT best practice)
    - Memory Efficient: Deduplicate by file hash, share engine across model_ids
    - Concurrent: N contexts allow N parallel inferences per unique model
    Example:
        # 100 cameras, same model file
        for i in range(100):
            repo.load_model(f"camera_{i}", "yolov8.trt")
        # Result: 1 engine in VRAM, N contexts (e.g., 4), not 100 contexts!
    """
    def __init__(self, gpu_id: int = 0, default_num_contexts: int = 4):
        """
        Initialize the model repository.
        Args:
            gpu_id: GPU device ID to use
            default_num_contexts: Default number of execution contexts per unique engine
        """
        self.gpu_id = gpu_id
        self.device = torch.device(f'cuda:{gpu_id}')
        self.default_num_contexts = default_num_contexts
        # Model ID to engine mapping: model_id -> file_hash
        self._model_to_hash: Dict[str, str] = {}
        # Shared engines with context pools: file_hash -> SharedEngine
        self._shared_engines: Dict[str, SharedEngine] = {}
        # Locks for thread safety
        self._repo_lock = threading.RLock()
        # TensorRT logger
        self.trt_logger = trt.Logger(trt.Logger.WARNING)
        print(f"TensorRT Model Repository initialized on GPU {gpu_id}")
        print(f"Default context pool size: {default_num_contexts} contexts per unique model")
    @staticmethod
    def compute_file_hash(file_path: str) -> str:
        """
        Compute SHA256 hash of a file for deduplication.
        Args:
            file_path: Path to the file
        Returns:
            Hexadecimal hash string
        """
        sha256_hash = hashlib.sha256()
        with open(file_path, "rb") as f:
            # Read in chunks to handle large files efficiently
            for byte_block in iter(lambda: f.read(65536), b""):
                sha256_hash.update(byte_block)
        return sha256_hash.hexdigest()
    def _load_engine(self, file_path: str) -> trt.ICudaEngine:
        """
        Load TensorRT engine from file.
        Args:
            file_path: Path to .trt or .engine file
        Returns:
            TensorRT engine
        """
        runtime = trt.Runtime(self.trt_logger)
        with open(file_path, 'rb') as f:
            engine_data = f.read()
        engine = runtime.deserialize_cuda_engine(engine_data)
        if engine is None:
            raise RuntimeError(f"Failed to load TensorRT engine from {file_path}")
        return engine
    def _extract_metadata(self, engine: trt.ICudaEngine,
                         file_path: str, file_hash: str) -> ModelMetadata:
        """
        Extract metadata from TensorRT engine.
        Args:
            engine: TensorRT engine
            file_path: Path to model file
            file_hash: SHA256 hash of model file
        Returns:
            ModelMetadata object
        """
        input_shapes = {}
        output_shapes = {}
        input_names = []
        output_names = []
        input_dtypes = {}
        output_dtypes = {}
        # TensorRT dtype to PyTorch dtype mapping
        trt_to_torch_dtype = {
            trt.DataType.FLOAT: torch.float32,
            trt.DataType.HALF: torch.float16,
            trt.DataType.INT8: torch.int8,
            trt.DataType.INT32: torch.int32,
            trt.DataType.BOOL: torch.bool,
        }
        # Iterate through all tensors (inputs and outputs) - TensorRT 10.x API
        for i in range(engine.num_io_tensors):
            name = engine.get_tensor_name(i)
            shape = tuple(engine.get_tensor_shape(name))
            dtype = trt_to_torch_dtype.get(engine.get_tensor_dtype(name), torch.float32)
            mode = engine.get_tensor_mode(name)
            if mode == trt.TensorIOMode.INPUT:
                input_names.append(name)
                input_shapes[name] = shape
                input_dtypes[name] = dtype
            else:
                output_names.append(name)
                output_shapes[name] = shape
                output_dtypes[name] = dtype
        return ModelMetadata(
            file_path=file_path,
            file_hash=file_hash,
            input_shapes=input_shapes,
            output_shapes=output_shapes,
            input_names=input_names,
            output_names=output_names,
            input_dtypes=input_dtypes,
            output_dtypes=output_dtypes
        )
    def load_model(self, model_id: str, file_path: str,
                   num_contexts: Optional[int] = None,
                   force_reload: bool = False) -> ModelMetadata:
        """
        Load a TensorRT model with the given ID.
        Deduplication: If a model with the same file hash is already loaded, the model_id
        is simply mapped to the existing SharedEngine (no new engine or contexts created).
        Args:
            model_id: User-defined identifier for this model (e.g., "camera_1")
            file_path: Path to TensorRT engine file (.trt or .engine)
            num_contexts: Number of execution contexts in pool (None = use default)
            force_reload: If True, reload even if model_id exists
        Returns:
            ModelMetadata for the loaded model
        Raises:
            FileNotFoundError: If model file doesn't exist
            RuntimeError: If engine loading fails
            ValueError: If model_id already exists and force_reload is False
        """
        file_path = str(Path(file_path).resolve())
        if not Path(file_path).exists():
            raise FileNotFoundError(f"Model file not found: {file_path}")
        if num_contexts is None:
            num_contexts = self.default_num_contexts
        with self._repo_lock:
            # Check if model_id already exists
            if model_id in self._model_to_hash and not force_reload:
                raise ValueError(
                    f"Model ID '{model_id}' already exists. "
                    f"Use force_reload=True to reload or choose a different ID."
                )
            # Unload existing model if force_reload
            if model_id in self._model_to_hash and force_reload:
                self.unload_model(model_id)
            # Compute file hash for deduplication
            print(f"Computing hash for {file_path}...")
            file_hash = self.compute_file_hash(file_path)
            print(f"File hash: {file_hash[:16]}...")
            # Check if this file is already loaded (deduplication)
            if file_hash in self._shared_engines:
                shared_engine = self._shared_engines[file_hash]
                print(f"Engine already loaded (hash match), reusing engine and context pool...")
                print(f"  Existing model_ids using this engine: {shared_engine.model_ids}")
            else:
                # Load new engine
                print(f"Loading TensorRT engine from {file_path}...")
                engine = self._load_engine(file_path)
                # Extract metadata
                metadata = self._extract_metadata(engine, file_path, file_hash)
                # Create shared engine with context pool
                shared_engine = SharedEngine(
                    engine=engine,
                    file_hash=file_hash,
                    file_path=file_path,
                    num_contexts=num_contexts,
                    device=self.device,
                    metadata=metadata
                )
                self._shared_engines[file_hash] = shared_engine
            # Add this model_id to the shared engine
            shared_engine.add_model_id(model_id)
            # Map model_id to file_hash
            self._model_to_hash[model_id] = file_hash
            print(f"Model '{model_id}' loaded successfully")
            print(f"  Inputs: {shared_engine.metadata.input_names}")
            for name in shared_engine.metadata.input_names:
                print(f"    {name}: {shared_engine.metadata.input_shapes[name]} ({shared_engine.metadata.input_dtypes[name]})")
            print(f"  Outputs: {shared_engine.metadata.output_names}")
            for name in shared_engine.metadata.output_names:
                print(f"    {name}: {shared_engine.metadata.output_shapes[name]} ({shared_engine.metadata.output_dtypes[name]})")
            print(f"  Context pool size: {num_contexts}")
            print(f"  Model IDs sharing this engine: {shared_engine.get_reference_count()}")
            print(f"  Unique engines in VRAM: {len(self._shared_engines)}")
            return shared_engine.metadata
    def infer(self, model_id: str, inputs: Dict[str, torch.Tensor],
              synchronize: bool = True, timeout: Optional[float] = 5.0) -> Dict[str, torch.Tensor]:
        """
        Run GPU-to-GPU inference with the specified model using context pooling.
        All inputs must be CUDA tensors and outputs will be CUDA tensors (stays in VRAM).
        Thread-safe: Borrows an execution context from the pool with mutex locking.
        Args:
            model_id: Model identifier
            inputs: Dictionary mapping input names to CUDA tensors
            synchronize: If True, wait for inference to complete. If False, async execution.
            timeout: Max time to wait for available context (seconds)
        Returns:
            Dictionary mapping output names to CUDA tensors (in VRAM)
        Raises:
            KeyError: If model_id not found
            ValueError: If inputs don't match expected shapes or are not on GPU
            RuntimeError: If no context available within timeout
        """
        # Get shared engine
        if model_id not in self._model_to_hash:
            raise KeyError(f"Model '{model_id}' not found. Available: {list(self._model_to_hash.keys())}")
        file_hash = self._model_to_hash[model_id]
        shared_engine = self._shared_engines[file_hash]
        metadata = shared_engine.metadata
        # Validate inputs
        for name in metadata.input_names:
            if name not in inputs:
                raise ValueError(f"Missing required input: {name}")
            tensor = inputs[name]
            if not tensor.is_cuda:
                raise ValueError(f"Input '{name}' must be a CUDA tensor (on GPU)")
            # Check device
            if tensor.device != self.device:
                print(f"Warning: Input '{name}' on {tensor.device}, moving to {self.device}")
                inputs[name] = tensor.to(self.device)
        # Acquire context from pool (mutex-based)
        exec_ctx = shared_engine.acquire_context(timeout=timeout)
        if exec_ctx is None:
            raise RuntimeError(
                f"No execution context available for model '{model_id}' within {timeout}s. "
                f"All {shared_engine.num_contexts} contexts are busy."
            )
        try:
            # Prepare output tensors
            outputs = {}
            # Set input tensors - TensorRT 10.x API
            for name in metadata.input_names:
                input_tensor = inputs[name].contiguous()
                exec_ctx.context.set_tensor_address(name, input_tensor.data_ptr())
            # Allocate and set output tensors
            for name in metadata.output_names:
                output_shape = metadata.output_shapes[name]
                output_dtype = metadata.output_dtypes[name]
                output_tensor = torch.empty(
                    output_shape,
                    dtype=output_dtype,
                    device=self.device
                )
                outputs[name] = output_tensor
                exec_ctx.context.set_tensor_address(name, output_tensor.data_ptr())
            # Execute inference on context's stream - TensorRT 10.x API
            with torch.cuda.stream(exec_ctx.stream):
                success = exec_ctx.context.execute_async_v3(
                    stream_handle=exec_ctx.stream.cuda_stream
                )
                if not success:
                    raise RuntimeError(f"Inference failed for model '{model_id}'")
            # Synchronize if requested
            if synchronize:
                exec_ctx.stream.synchronize()
            return outputs
        finally:
            # Always release context back to pool
            shared_engine.release_context(exec_ctx)
    def infer_batch(self, model_id: str, batch_inputs: List[Dict[str, torch.Tensor]],
                   synchronize: bool = True) -> List[Dict[str, torch.Tensor]]:
        """
        Run inference on multiple inputs.
        Contexts are borrowed/returned for each input, enabling parallel processing.
        Args:
            model_id: Model identifier
            batch_inputs: List of input dictionaries
            synchronize: If True, wait for all inferences to complete
        Returns:
            List of output dictionaries
        """
        results = []
        for inputs in batch_inputs:
            outputs = self.infer(model_id, inputs, synchronize=synchronize)
            results.append(outputs)
        return results
    def unload_model(self, model_id: str):
        """
        Unload a model from the repository.
        Removes the model_id reference from the shared engine. If this was the last
        reference, the engine and all its contexts will be fully unloaded from VRAM.
        Args:
            model_id: Model identifier to unload
        """
        with self._repo_lock:
            if model_id not in self._model_to_hash:
                print(f"Warning: Model '{model_id}' not found")
                return
            file_hash = self._model_to_hash[model_id]
            # Remove model_id from shared engine
            if file_hash in self._shared_engines:
                shared_engine = self._shared_engines[file_hash]
                remaining_refs = shared_engine.remove_model_id(model_id)
                # If no more references, cleanup engine and contexts
                if remaining_refs == 0:
                    shared_engine.cleanup()
                    del self._shared_engines[file_hash]
                    print(f"Model '{model_id}' unloaded, engine removed from VRAM (0 references)")
                else:
                    print(f"Model '{model_id}' unloaded, engine kept in VRAM ({remaining_refs} references)")
            # Remove from model_id mapping
            del self._model_to_hash[model_id]
    def get_metadata(self, model_id: str) -> Optional[ModelMetadata]:
        """
        Get metadata for a loaded model.
        Args:
            model_id: Model identifier
        Returns:
            ModelMetadata or None if not found
        """
        if model_id not in self._model_to_hash:
            return None
        file_hash = self._model_to_hash[model_id]
        if file_hash not in self._shared_engines:
            return None
        return self._shared_engines[file_hash].metadata
    def list_models(self) -> Dict[str, ModelMetadata]:
        """
        List all loaded models.
        Returns:
            Dictionary mapping model_id to ModelMetadata
        """
        with self._repo_lock:
            result = {}
            for model_id, file_hash in self._model_to_hash.items():
                if file_hash in self._shared_engines:
                    result[model_id] = self._shared_engines[file_hash].metadata
            return result
    def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
        """
        Get detailed information about a loaded model.
        Args:
            model_id: Model identifier
        Returns:
            Dictionary with model information or None if not found
        """
        if model_id not in self._model_to_hash:
            return None
        file_hash = self._model_to_hash[model_id]
        if file_hash not in self._shared_engines:
            return None
        shared_engine = self._shared_engines[file_hash]
        metadata = shared_engine.metadata
        return {
            'model_id': model_id,
            'file_path': metadata.file_path,
            'file_hash': metadata.file_hash[:16] + '...',
            'engine_references': shared_engine.get_reference_count(),
            'context_pool_size': shared_engine.num_contexts,
            'shared_with_model_ids': list(shared_engine.model_ids),
            'inputs': {
                name: {
                    'shape': metadata.input_shapes[name],
                    'dtype': str(metadata.input_dtypes[name])
                }
                for name in metadata.input_names
            },
            'outputs': {
                name: {
                    'shape': metadata.output_shapes[name],
                    'dtype': str(metadata.output_dtypes[name])
                }
                for name in metadata.output_names
            }
        }
    def get_stats(self) -> Dict[str, Any]:
        """
        Get repository statistics.
        Returns:
            Dictionary with stats about loaded models and memory usage
        """
        with self._repo_lock:
            total_contexts = sum(
                engine.num_contexts
                for engine in self._shared_engines.values()
            )
            return {
                'total_model_ids': len(self._model_to_hash),
                'unique_engines': len(self._shared_engines),
                'total_contexts': total_contexts,
                'memory_efficiency': f"{len(self._model_to_hash)} model IDs using only {len(self._shared_engines)} engines",
                'gpu_id': self.gpu_id,
                'models': list(self._model_to_hash.keys())
            }
    def __repr__(self):
        with self._repo_lock:
            return (f"TensorRTModelRepository(gpu={self.gpu_id}, "
                   f"model_ids={len(self._model_to_hash)}, "
                   f"unique_engines={len(self._shared_engines)})")
    def __del__(self):
        """Cleanup all models on deletion"""
        with self._repo_lock:
            model_ids = list(self._model_to_hash.keys())
            for model_id in model_ids:
                self.unload_model(model_id)
--- a/services/stream_decoder.py
+++ b/services/stream_decoder.py
@ -0,0 +1,481 @@
 import threading
 from typing import Optional
 from collections import deque
 from enum import Enum
 import torch
 import PyNvVideoCodec as nvc
 import av
 import numpy as np
 from cuda.bindings import driver as cuda_driver
 from .jpeg_encoder import encode_frame_to_jpeg
 def nv12_to_rgb_gpu(nv12_tensor: torch.Tensor, height: int, width: int) -> torch.Tensor:
    """
    Convert NV12 format to RGB on GPU using PyTorch operations.
    NV12 format:
    - Y plane: height x width (luminance)
    - UV plane: (height/2) x width (interleaved U and V, subsampled by 2)
    Total tensor size: (height * 3/2) x width
    Args:
        nv12_tensor: Input tensor in NV12 format, shape (H*3/2, W)
        height: Original frame height
        width: Original frame width
    Returns:
        RGB tensor, shape (3, H, W) in range [0, 255]
    """
    device = nv12_tensor.device
    # Split Y and UV planes
    y_plane = nv12_tensor[:height, :].float()  # (H, W)
    uv_plane = nv12_tensor[height:, :].float()  # (H/2, W)
    # Reshape UV plane to separate U and V channels
    # UV is interleaved: U0V0U1V1... we need to deinterleave
    uv_plane = uv_plane.reshape(height // 2, width // 2, 2)  # (H/2, W/2, 2)
    u_plane = uv_plane[:, :, 0]  # (H/2, W/2)
    v_plane = uv_plane[:, :, 1]  # (H/2, W/2)
    # Upsample U and V to full resolution using bilinear interpolation
    u_upsampled = torch.nn.functional.interpolate(
        u_plane.unsqueeze(0).unsqueeze(0),  # (1, 1, H/2, W/2)
        size=(height, width),
        mode='bilinear',
        align_corners=False
    ).squeeze(0).squeeze(0)  # (H, W)
    v_upsampled = torch.nn.functional.interpolate(
        v_plane.unsqueeze(0).unsqueeze(0),  # (1, 1, H/2, W/2)
        size=(height, width),
        mode='bilinear',
        align_corners=False
    ).squeeze(0).squeeze(0)  # (H, W)
    # YUV to RGB conversion using BT.601 standard
    # R = Y + 1.402 * (V - 128)
    # G = Y - 0.344136 * (U - 128) - 0.714136 * (V - 128)
    # B = Y + 1.772 * (U - 128)
    y = y_plane
    u = u_upsampled - 128.0
    v = v_upsampled - 128.0
    r = y + 1.402 * v
    g = y - 0.344136 * u - 0.714136 * v
    b = y + 1.772 * u
    # Clamp to [0, 255] and convert to uint8
    r = torch.clamp(r, 0, 255).to(torch.uint8)
    g = torch.clamp(g, 0, 255).to(torch.uint8)
    b = torch.clamp(b, 0, 255).to(torch.uint8)
    # Stack to (3, H, W)
    rgb = torch.stack([r, g, b], dim=0)
    return rgb
 class ConnectionStatus(Enum):
    DISCONNECTED = "disconnected"
    CONNECTING = "connecting"
    CONNECTED = "connected"
    ERROR = "error"
    RECONNECTING = "reconnecting"
 class StreamDecoderFactory:
    """
    Factory for creating StreamDecoder instances with shared CUDA context.
    This minimizes VRAM overhead by sharing the CUDA context across all decoders.
    """
    _instance = None
    _lock = threading.Lock()
    def __new__(cls, gpu_id: int = 0):
        if cls._instance is None:
            with cls._lock:
                if cls._instance is None:
                    cls._instance = super(StreamDecoderFactory, cls).__new__(cls)
                    cls._instance._initialized = False
        return cls._instance
    def __init__(self, gpu_id: int = 0):
        if self._initialized:
            return
        self.gpu_id = gpu_id
        # Initialize CUDA and get device
        err, = cuda_driver.cuInit(0)
        if err != cuda_driver.CUresult.CUDA_SUCCESS:
            raise RuntimeError(f"Failed to initialize CUDA: {err}")
        # Get CUDA device
        err, self.cuda_device = cuda_driver.cuDeviceGet(gpu_id)
        if err != cuda_driver.CUresult.CUDA_SUCCESS:
            raise RuntimeError(f"Failed to get CUDA device {gpu_id}: {err}")
        # Retain primary context (shared across all decoders)
        err, self.cuda_context = cuda_driver.cuDevicePrimaryCtxRetain(self.cuda_device)
        if err != cuda_driver.CUresult.CUDA_SUCCESS:
            raise RuntimeError(f"Failed to retain CUDA primary context: {err}")
        self._initialized = True
        print(f"StreamDecoderFactory initialized with shared CUDA context on GPU {gpu_id}")
    def create_decoder(self, rtsp_url: str, buffer_size: int = 30,
                      codec: str = "h264") -> 'StreamDecoder':
        """
        Create a new StreamDecoder instance with shared CUDA context.
        Args:
            rtsp_url: RTSP stream URL
            buffer_size: Number of frames to buffer in VRAM
            codec: Video codec (h264, hevc, etc.)
        Returns:
            StreamDecoder instance
        """
        return StreamDecoder(
            rtsp_url=rtsp_url,
            cuda_context=self.cuda_context,
            gpu_id=self.gpu_id,
            buffer_size=buffer_size,
            codec=codec
        )
    def __del__(self):
        """Cleanup shared CUDA context on factory destruction"""
        if hasattr(self, 'cuda_device') and hasattr(self, 'gpu_id'):
            cuda_driver.cuDevicePrimaryCtxRelease(self.cuda_device)
 class StreamDecoder:
    """
    Decodes RTSP stream using NVDEC and maintains a ring buffer of frames in GPU VRAM.
    Thread-safe for concurrent read/write operations.
    """
    def __init__(self, rtsp_url: str, cuda_context, gpu_id: int,
                 buffer_size: int = 30, codec: str = "h264"):
        """
        Initialize StreamDecoder.
        Args:
            rtsp_url: RTSP stream URL
            cuda_context: Shared CUDA context handle
            gpu_id: GPU device ID
            buffer_size: Number of frames to keep in ring buffer
            codec: Video codec type
        """
        self.rtsp_url = rtsp_url
        self.cuda_context = cuda_context
        self.gpu_id = gpu_id
        self.buffer_size = buffer_size
        self.codec = codec
        # Connection status
        self.status = ConnectionStatus.DISCONNECTED
        self._status_lock = threading.Lock()
        # Frame buffer (ring buffer) - stores CUDA device pointers
        self.frame_buffer = deque(maxlen=buffer_size)
        self._buffer_lock = threading.RLock()
        # Decoder and container instances
        self.decoder = None
        self.container = None
        # Decode thread
        self._decode_thread: Optional[threading.Thread] = None
        self._stop_flag = threading.Event()
        # Frame metadata
        self.frame_width: Optional[int] = None
        self.frame_height: Optional[int] = None
        self.frame_count: int = 0
    def start(self):
        """Start the RTSP stream decoding in background thread"""
        if self._decode_thread is not None and self._decode_thread.is_alive():
            print(f"Decoder already running for {self.rtsp_url}")
            return
        self._stop_flag.clear()
        self._decode_thread = threading.Thread(target=self._decode_loop, daemon=True)
        self._decode_thread.start()
        print(f"Started decoder thread for {self.rtsp_url}")
    def stop(self):
        """Stop the decoding thread and cleanup resources"""
        self._stop_flag.set()
        if self._decode_thread is not None:
            self._decode_thread.join(timeout=5.0)
        self._cleanup()
        print(f"Stopped decoder for {self.rtsp_url}")
    def _set_status(self, status: ConnectionStatus):
        """Thread-safe status update"""
        with self._status_lock:
            self.status = status
    def get_status(self) -> ConnectionStatus:
        """Get current connection status"""
        with self._status_lock:
            return self.status
    def _init_rtsp_connection(self) -> bool:
        """Initialize RTSP connection using PyAV + PyNvVideoCodec"""
        try:
            self._set_status(ConnectionStatus.CONNECTING)
            # Open RTSP stream with PyAV
            options = {
                'rtsp_transport': 'tcp',
                'max_delay': '500000',  # 500ms
                'rtsp_flags': 'prefer_tcp',
                'timeout': '5000000',  # 5 seconds
            }
            self.container = av.open(self.rtsp_url, options=options)
            # Get video stream
            video_stream = self.container.streams.video[0]
            self.frame_width = video_stream.width
            self.frame_height = video_stream.height
            print(f"RTSP connected: {self.frame_width}x{self.frame_height}")
            # Map codec name to PyNvVideoCodec codec enum
            codec_map = {
                'h264': nvc.cudaVideoCodec.H264,
                'hevc': nvc.cudaVideoCodec.HEVC,
                'h265': nvc.cudaVideoCodec.HEVC,
            }
            codec_id = codec_map.get(self.codec.lower(), nvc.cudaVideoCodec.H264)
            # Initialize NVDEC decoder with shared CUDA context
            self.decoder = nvc.CreateDecoder(
                gpuid=self.gpu_id,
                codec=codec_id,
                cudacontext=self.cuda_context,
                usedevicememory=True
            )
            self._set_status(ConnectionStatus.CONNECTED)
            return True
        except Exception as e:
            print(f"Failed to connect to RTSP stream {self.rtsp_url}: {e}")
            self._set_status(ConnectionStatus.ERROR)
            return False
    def _decode_loop(self):
        """Main decode loop running in background thread"""
        retry_count = 0
        max_retries = 5
        while not self._stop_flag.is_set():
            # Initialize connection
            if not self._init_rtsp_connection():
                retry_count += 1
                if retry_count >= max_retries:
                    print(f"Max retries reached for {self.rtsp_url}")
                    self._set_status(ConnectionStatus.ERROR)
                    break
                self._set_status(ConnectionStatus.RECONNECTING)
                self._stop_flag.wait(timeout=2.0)
                continue
            retry_count = 0  # Reset on successful connection
            try:
                # Decode loop - iterate through packets from PyAV
                for packet in self.container.demux(video=0):
                    if self._stop_flag.is_set():
                        break
                    if packet.dts is None:
                        continue
                    # Convert packet to numpy array
                    packet_data = np.frombuffer(bytes(packet), dtype=np.uint8)
                    # Create PacketData and pass numpy array pointer
                    pkt = nvc.PacketData()
                    pkt.bsl_data = packet_data.ctypes.data
                    pkt.bsl = len(packet_data)
                    # Decode using NVDEC
                    decoded_frames = self.decoder.Decode(pkt)
                    if not decoded_frames:
                        continue
                    # Add frames to ring buffer (thread-safe)
                    with self._buffer_lock:
                        for frame in decoded_frames:
                            self.frame_buffer.append(frame)
                            self.frame_count += 1
            except Exception as e:
                print(f"Error in decode loop for {self.rtsp_url}: {e}")
                self._set_status(ConnectionStatus.RECONNECTING)
                self._cleanup()
                self._stop_flag.wait(timeout=2.0)
    def _cleanup(self):
        """Cleanup resources"""
        if self.container:
            try:
                self.container.close()
            except:
                pass
            self.container = None
        self.decoder = None
        with self._buffer_lock:
            self.frame_buffer.clear()
    def get_frame(self, index: int = -1, rgb: bool = True) -> Optional[torch.Tensor]:
        """
        Get a frame from the buffer as a CUDA tensor (in VRAM).
        Args:
            index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
            rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
        Returns:
            torch.Tensor in CUDA memory (device tensor) or None if buffer empty
            - If rgb=True: Shape (3, H, W) in RGB format, dtype uint8
            - If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
        """
        with self._buffer_lock:
            if len(self.frame_buffer) == 0:
                return None
            try:
                decoded_frame = self.frame_buffer[index]
                # Convert DecodedFrame to PyTorch tensor using DLPack (zero-copy)
                # This keeps the data in GPU memory
                nv12_tensor = torch.from_dlpack(decoded_frame)
                if not rgb:
                    # Return raw NV12 format
                    return nv12_tensor
                # Convert NV12 to RGB on GPU
                if self.frame_height is None or self.frame_width is None:
                    print("Frame dimensions not available")
                    return None
                rgb_tensor = nv12_to_rgb_gpu(nv12_tensor, self.frame_height, self.frame_width)
                return rgb_tensor
            except (IndexError, Exception) as e:
                print(f"Error getting frame: {e}")
                return None
    def get_latest_frame(self, rgb: bool = True) -> Optional[torch.Tensor]:
        """
        Get the most recent decoded frame as CUDA tensor.
        Args:
            rgb: If True, convert to RGB. If False, return raw NV12.
        Returns:
            torch.Tensor on GPU in RGB (3, H, W) or NV12 (H*3/2, W) format
        """
        return self.get_frame(-1, rgb=rgb)
    def get_frame_cpu(self, index: int = -1, rgb: bool = True) -> Optional[np.ndarray]:
        """
        Get a frame from the buffer and copy it to CPU memory as numpy array.
        Args:
            index: Frame index in buffer (-1 for latest, -2 for second latest, etc.)
            rgb: If True, convert NV12 to RGB. If False, return raw NV12 format.
        Returns:
            numpy.ndarray in CPU memory or None if buffer empty
            - If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8 (HWC format for easy display)
            - If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
        """
        # Get frame on GPU
        gpu_frame = self.get_frame(index=index, rgb=rgb)
        if gpu_frame is None:
            return None
        # Transfer from GPU to CPU
        cpu_tensor = gpu_frame.cpu()
        # Convert to numpy array
        if rgb:
            # Convert from (3, H, W) to (H, W, 3) for standard image format
            cpu_array = cpu_tensor.permute(1, 2, 0).numpy()
        else:
            # Keep NV12 format as-is
            cpu_array = cpu_tensor.numpy()
        return cpu_array
    def get_latest_frame_cpu(self, rgb: bool = True) -> Optional[np.ndarray]:
        """
        Get the most recent decoded frame as CPU numpy array.
        Args:
            rgb: If True, convert to RGB. If False, return raw NV12.
        Returns:
            numpy.ndarray in CPU memory
            - If rgb=True: Shape (H, W, 3) in RGB format, dtype uint8
            - If rgb=False: Shape (H*3/2, W) in NV12 format, dtype uint8
        """
        return self.get_frame_cpu(-1, rgb=rgb)
    def get_buffer_size(self) -> int:
        """Get current number of frames in buffer"""
        with self._buffer_lock:
            return len(self.frame_buffer)
    def is_connected(self) -> bool:
        """Check if stream is actively connected"""
        return self.get_status() == ConnectionStatus.CONNECTED
    def get_frame_as_jpeg(self, index: int = -1, quality: int = 95) -> Optional[bytes]:
        """
        Get a frame from the buffer and encode to JPEG.
        This method:
        1. Gets RGB frame from buffer (stays on GPU)
        2. Encodes to JPEG using nvJPEG (GPU operation via shared encoder)
        3. Transfers JPEG bytes to CPU
        4. Returns bytes for saving to disk
        Args:
            index: Frame index in buffer (-1 for latest)
            quality: JPEG quality (0-100, default 95)
        Returns:
            JPEG encoded bytes or None if frame unavailable
        """
        # Get RGB frame (on GPU)
        rgb_frame = self.get_frame(index=index, rgb=True)
        # Use the shared JPEG encoder from jpeg_encoder module
        return encode_frame_to_jpeg(rgb_frame, quality=quality)
    def __repr__(self):
        return (f"StreamDecoder(url={self.rtsp_url}, status={self.status.value}, "
                f"buffer={self.get_buffer_size()}/{self.buffer_size}, "
                f"frames_decoded={self.frame_count})")
--- a/test_jpeg_encode.py
+++ b/test_jpeg_encode.py
@ -0,0 +1,174 @@
 #!/usr/bin/env python3
 """
 Test script for JPEG encoding with nvImageCodec
 Tests GPU-accelerated JPEG encoding from RTSP stream frames
 """
 import argparse
 import sys
 import time
 import os
 from pathlib import Path
 from dotenv import load_dotenv
 from services import StreamDecoderFactory
 # Load environment variables from .env file
 load_dotenv()
 def main():
    parser = argparse.ArgumentParser(description='Test JPEG encoding from RTSP stream')
    parser.add_argument(
        '--rtsp-url',
        type=str,
        default=None,
        help='RTSP stream URL (defaults to CAMERA_URL_1 from .env)'
    )
    parser.add_argument(
        '--output-dir',
        type=str,
        default='./snapshots',
        help='Output directory for JPEG files'
    )
    parser.add_argument(
        '--num-frames',
        type=int,
        default=10,
        help='Number of frames to capture'
    )
    parser.add_argument(
        '--interval',
        type=float,
        default=1.0,
        help='Interval between captures in seconds'
    )
    parser.add_argument(
        '--quality',
        type=int,
        default=95,
        help='JPEG quality (0-100)'
    )
    parser.add_argument(
        '--gpu-id',
        type=int,
        default=0,
        help='GPU device ID'
    )
    args = parser.parse_args()
    # Get RTSP URL from command line or environment
    rtsp_url = args.rtsp_url
    if not rtsp_url:
        rtsp_url = os.getenv('CAMERA_URL_1')
        if not rtsp_url:
            print("Error: No RTSP URL provided")
            print("Please either:")
            print("  1. Use --rtsp-url argument, or")
            print("  2. Add CAMERA_URL_1 to your .env file")
            sys.exit(1)
    # Create output directory
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    print("=" * 80)
    print("RTSP Stream JPEG Encoding Test")
    print("=" * 80)
    print(f"RTSP URL: {rtsp_url}")
    print(f"Output Directory: {output_dir}")
    print(f"Number of Frames: {args.num_frames}")
    print(f"Capture Interval: {args.interval}s")
    print(f"JPEG Quality: {args.quality}")
    print(f"GPU ID: {args.gpu_id}")
    print("=" * 80)
    print()
    try:
        # Initialize factory and decoder
        print("[1/3] Initializing StreamDecoderFactory...")
        factory = StreamDecoderFactory(gpu_id=args.gpu_id)
        print("✓ Factory initialized\n")
        print("[2/3] Creating and starting decoder...")
        decoder = factory.create_decoder(
            rtsp_url=rtsp_url,
            buffer_size=30
        )
        decoder.start()
        print("✓ Decoder started\n")
        # Wait for connection
        print("[3/3] Waiting for stream to connect...")
        max_wait = 10
        for i in range(max_wait):
            if decoder.is_connected():
                print("✓ Stream connected\n")
                break
            time.sleep(1)
            print(f"  Waiting... {i+1}/{max_wait}s")
        else:
            print("✗ Failed to connect to stream")
            sys.exit(1)
        # Capture frames
        print(f"Capturing {args.num_frames} frames...")
        print("-" * 80)
        captured = 0
        for i in range(args.num_frames):
            # Get frame as JPEG
            start_time = time.time()
            jpeg_bytes = decoder.get_frame_as_jpeg(quality=args.quality)
            encode_time = (time.time() - start_time) * 1000  # ms
            if jpeg_bytes:
                # Save to file
                filename = output_dir / f"frame_{i:04d}.jpg"
                with open(filename, 'wb') as f:
                    f.write(jpeg_bytes)
                size_kb = len(jpeg_bytes) / 1024
                print(f"[{i+1}/{args.num_frames}] Saved {filename.name} "
                      f"({size_kb:.1f} KB, encoded in {encode_time:.2f}ms)")
                captured += 1
            else:
                print(f"[{i+1}/{args.num_frames}] Failed to get frame")
            # Wait before next capture (except for last frame)
            if i < args.num_frames - 1:
                time.sleep(args.interval)
        print("-" * 80)
        # Summary
        print("\n" + "=" * 80)
        print("Capture Complete")
        print("=" * 80)
        print(f"Successfully captured: {captured}/{args.num_frames} frames")
        print(f"Output directory: {output_dir.absolute()}")
        print("=" * 80)
    except KeyboardInterrupt:
        print("\n\n✗ Interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\n\n✗ Error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    finally:
        # Cleanup
        if 'decoder' in locals():
            print("\nCleaning up...")
            decoder.stop()
            print("✓ Decoder stopped")
    print("\n✓ Test completed successfully")
    sys.exit(0)
 if __name__ == '__main__':
    main()
--- a/test_model_inference.py
+++ b/test_model_inference.py
@ -0,0 +1,310 @@
 """
 Test script for TensorRT Model Repository with multi-camera inference.
 This demonstrates:
 1. Loading the same model for multiple cameras (deduplication)
 2. Context pool load balancing
 3. GPU-to-GPU inference from RTSP streams
 4. Memory efficiency with shared engines
 """
 import time
 import torch
 from services.model_repository import TensorRTModelRepository
 from services.stream_decoder import StreamDecoderFactory
 def test_multi_camera_inference():
    """
    Simulate multi-camera inference scenario.
    Example: 100 cameras, all using the same YOLOv8 model
    - Without pooling: 100 engines + 100 contexts in VRAM
    - With pooling: 1 engine + 4 contexts in VRAM (huge savings!)
    """
    # Initialize model repository with context pooling
    repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
    # Camera configurations (simulated)
    camera_configs = [
        {"id": "camera_1", "rtsp_url": "rtsp://camera1.local/stream"},
        {"id": "camera_2", "rtsp_url": "rtsp://camera2.local/stream"},
        {"id": "camera_3", "rtsp_url": "rtsp://camera3.local/stream"},
        # ... imagine 100 cameras here
    ]
    # Load the same model for all cameras
    model_file = "models/yolov8n.trt"  # Same file for all cameras
    print("=" * 80)
    print("LOADING MODELS FOR MULTIPLE CAMERAS")
    print("=" * 80)
    for config in camera_configs:
        try:
            # Each camera gets its own model_id, but shares the same engine!
            metadata = repo.load_model(
                model_id=config["id"],
                file_path=model_file,
                num_contexts=4  # 4 contexts shared across all cameras
            )
            print(f"\n✓ Loaded model for {config['id']}")
        except Exception as e:
            print(f"\n✗ Failed to load model for {config['id']}: {e}")
    # Show repository stats
    print("\n" + "=" * 80)
    print("REPOSITORY STATISTICS")
    print("=" * 80)
    stats = repo.get_stats()
    print(f"Total model IDs: {stats['total_model_ids']}")
    print(f"Unique engines in VRAM: {stats['unique_engines']}")
    print(f"Total contexts: {stats['total_contexts']}")
    print(f"Memory efficiency: {stats['memory_efficiency']}")
    # Get detailed info for one camera
    print("\n" + "=" * 80)
    print("DETAILED MODEL INFO (camera_1)")
    print("=" * 80)
    info = repo.get_model_info("camera_1")
    if info:
        print(f"Model ID: {info['model_id']}")
        print(f"File: {info['file_path']}")
        print(f"File hash: {info['file_hash']}")
        print(f"Engine references: {info['engine_references']}")
        print(f"Context pool size: {info['context_pool_size']}")
        print(f"Shared with: {info['shared_with_model_ids']}")
        print(f"\nInputs:")
        for name, spec in info['inputs'].items():
            print(f"  {name}: {spec['shape']} ({spec['dtype']})")
        print(f"\nOutputs:")
        for name, spec in info['outputs'].items():
            print(f"  {name}: {spec['shape']} ({spec['dtype']})")
    # Simulate inference from multiple cameras
    print("\n" + "=" * 80)
    print("RUNNING INFERENCE (GPU-to-GPU)")
    print("=" * 80)
    # Create dummy input tensors (simulating frames from cameras)
    # In real scenario, these come from StreamDecoder.get_frame()
    batch_size = 1
    channels = 3
    height = 640
    width = 640
    for config in camera_configs:
        try:
            # Simulate getting frame from camera (already on GPU)
            input_tensor = torch.rand(
                batch_size, channels, height, width,
                dtype=torch.float32,
                device='cuda:0'
            )
            # Run inference (stays in GPU)
            start = time.time()
            outputs = repo.infer(
                model_id=config["id"],
                inputs={"images": input_tensor},  # Adjust input name based on your model
                synchronize=True,
                timeout=5.0
            )
            elapsed = (time.time() - start) * 1000  # Convert to ms
            print(f"\n{config['id']}: Inference completed in {elapsed:.2f}ms")
            for name, tensor in outputs.items():
                print(f"  Output '{name}': {tensor.shape} on {tensor.device}")
        except Exception as e:
            print(f"\n{config['id']}: Inference failed: {e}")
    # Cleanup
    print("\n" + "=" * 80)
    print("CLEANUP")
    print("=" * 80)
    for config in camera_configs:
        repo.unload_model(config["id"])
    print("\nAll models unloaded.")
 def test_rtsp_stream_with_inference():
    """
    Real-world example: Decode RTSP stream and run inference.
    Everything stays in GPU memory (zero CPU transfers).
    """
    print("=" * 80)
    print("RTSP STREAM + TENSORRT INFERENCE (GPU-to-GPU)")
    print("=" * 80)
    # Initialize components
    decoder_factory = StreamDecoderFactory(gpu_id=0)
    model_repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
    # Setup camera stream
    rtsp_url = "rtsp://your-camera-ip/stream"
    decoder = decoder_factory.create_decoder(rtsp_url, buffer_size=30)
    decoder.start()
    # Load inference model
    try:
        model_repo.load_model(
            model_id="camera_main",
            file_path="models/yolov8n.trt"
        )
    except FileNotFoundError:
        print("\n⚠ Model file not found. Please export your model to TensorRT:")
        print("   Example: yolo export model=yolov8n.pt format=engine device=0")
        return
    print("\nWaiting for stream to buffer frames...")
    time.sleep(3)
    # Process frames
    for i in range(10):
        # Get frame from decoder (already on GPU)
        frame_gpu = decoder.get_latest_frame(rgb=True)  # Returns torch.Tensor on CUDA
        if frame_gpu is None:
            print(f"Frame {i}: No frame available")
            continue
        # Preprocess if needed (stays on GPU)
        # For YOLOv8: normalize, resize, etc.
        # Example preprocessing (adjust for your model):
        frame_gpu = frame_gpu.float() / 255.0  # Normalize to [0, 1]
        frame_gpu = frame_gpu.unsqueeze(0)  # Add batch dimension: (1, 3, H, W)
        # Run inference (GPU-to-GPU, zero copy)
        try:
            outputs = model_repo.infer(
                model_id="camera_main",
                inputs={"images": frame_gpu},
                synchronize=True
            )
            print(f"\nFrame {i}: Inference successful")
            for name, tensor in outputs.items():
                print(f"  {name}: {tensor.shape} on {tensor.device}")
            # Post-process results (can stay on GPU or move to CPU as needed)
            # Example: NMS, bounding box extraction, etc.
        except Exception as e:
            print(f"\nFrame {i}: Inference failed: {e}")
        time.sleep(0.1)  # Simulate processing interval
    # Cleanup
    decoder.stop()
    model_repo.unload_model("camera_main")
    print("\n✓ Test completed successfully")
 def test_concurrent_inference():
    """
    Test concurrent inference from multiple threads.
    Demonstrates context pool load balancing.
    """
    import threading
    print("=" * 80)
    print("CONCURRENT INFERENCE TEST (Context Pool Load Balancing)")
    print("=" * 80)
    repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=4)
    # Load model
    try:
        repo.load_model("shared_model", "models/yolov8n.trt", num_contexts=4)
    except Exception as e:
        print(f"Failed to load model: {e}")
        return
    def worker(worker_id: int, num_inferences: int):
        """Worker thread performing inference"""
        for i in range(num_inferences):
            try:
                # Create dummy input
                input_tensor = torch.rand(1, 3, 640, 640, device='cuda:0', dtype=torch.float32)
                # Acquire context from pool, run inference, release context
                outputs = repo.infer(
                    model_id="shared_model",
                    inputs={"images": input_tensor},
                    timeout=10.0
                )
                print(f"Worker {worker_id}, Inference {i}: SUCCESS")
            except Exception as e:
                print(f"Worker {worker_id}, Inference {i}: FAILED - {e}")
            time.sleep(0.01)  # Small delay
    # Launch multiple worker threads (more workers than contexts!)
    threads = []
    num_workers = 10  # 10 workers sharing 4 contexts
    inferences_per_worker = 5
    print(f"\nLaunching {num_workers} workers (only 4 contexts available)")
    print("Contexts will be borrowed/returned automatically\n")
    start_time = time.time()
    for worker_id in range(num_workers):
        t = threading.Thread(target=worker, args=(worker_id, inferences_per_worker))
        threads.append(t)
        t.start()
    # Wait for all workers
    for t in threads:
        t.join()
    elapsed = time.time() - start_time
    total_inferences = num_workers * inferences_per_worker
    print(f"\n✓ Completed {total_inferences} inferences in {elapsed:.2f}s")
    print(f"  Throughput: {total_inferences / elapsed:.2f} inferences/sec")
    print(f"  With only 4 contexts for {num_workers} workers!")
    repo.unload_model("shared_model")
 if __name__ == "__main__":
    print("\n" + "=" * 80)
    print("TENSORRT MODEL REPOSITORY - TEST SUITE")
    print("=" * 80)
    # Test 1: Multi-camera model loading
    print("\n\nTEST 1: Multi-Camera Model Loading with Deduplication")
    print("-" * 80)
    try:
        test_multi_camera_inference()
    except Exception as e:
        print(f"Test 1 failed: {e}")
    # Test 2: RTSP stream + inference (commented out by default)
    # Uncomment if you have a real RTSP stream
    # print("\n\nTEST 2: RTSP Stream + Inference")
    # print("-" * 80)
    # try:
    #     test_rtsp_stream_with_inference()
    # except Exception as e:
    #     print(f"Test 2 failed: {e}")
    # Test 3: Concurrent inference
    print("\n\nTEST 3: Concurrent Inference with Context Pooling")
    print("-" * 80)
    try:
        test_concurrent_inference()
    except Exception as e:
        print(f"Test 3 failed: {e}")
    print("\n" + "=" * 80)
    print("ALL TESTS COMPLETED")
    print("=" * 80)
--- a/test_multi_stream.py
+++ b/test_multi_stream.py
@ -0,0 +1,255 @@
 #!/usr/bin/env python3
 """
 Multi-stream test script to verify CUDA context sharing efficiency.
 Tests multiple RTSP streams simultaneously and monitors VRAM usage.
 """
 import argparse
 import time
 import sys
 import subprocess
 import os
 from pathlib import Path
 from dotenv import load_dotenv
 from services import StreamDecoderFactory, ConnectionStatus
 # Load environment variables from .env file
 load_dotenv()
 def get_gpu_memory_usage(gpu_id: int = 0) -> int:
    """Get current GPU memory usage in MB using nvidia-smi"""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits', f'--id={gpu_id}'],
            capture_output=True,
            text=True,
            check=True
        )
        return int(result.stdout.strip())
    except Exception as e:
        print(f"Warning: Could not get GPU memory usage: {e}")
        return 0
 def main():
    parser = argparse.ArgumentParser(description='Test multi-stream decoding with context sharing')
    parser.add_argument(
        '--gpu-id',
        type=int,
        default=0,
        help='GPU device ID'
    )
    parser.add_argument(
        '--duration',
        type=int,
        default=20,
        help='Test duration in seconds'
    )
    parser.add_argument(
        '--capture-snapshots',
        action='store_true',
        help='Capture JPEG snapshots during test'
    )
    parser.add_argument(
        '--output-dir',
        type=str,
        default='./multi_stream_snapshots',
        help='Output directory for snapshots'
    )
    args = parser.parse_args()
    # Load camera URLs from environment
    camera_urls = []
    i = 1
    while True:
        url = os.getenv(f'CAMERA_URL_{i}')
        if url:
            camera_urls.append(url)
            i += 1
        else:
            break
    if not camera_urls:
        print("Error: No camera URLs found in .env file")
        print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
        sys.exit(1)
    # Create output directory if capturing snapshots
    if args.capture_snapshots:
        output_dir = Path(args.output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
    print("=" * 80)
    print("Multi-Stream RTSP Decoder Test - Context Sharing Verification")
    print("=" * 80)
    print(f"Number of Streams: {len(camera_urls)}")
    print(f"GPU ID: {args.gpu_id}")
    print(f"Test Duration: {args.duration} seconds")
    print(f"Capture Snapshots: {args.capture_snapshots}")
    print("=" * 80)
    print()
    try:
        # Get baseline GPU memory
        print("[Baseline] Measuring initial GPU memory usage...")
        baseline_memory = get_gpu_memory_usage(args.gpu_id)
        print(f"✓ Baseline VRAM: {baseline_memory} MB\n")
        # Initialize factory (shared CUDA context)
        print("[1/4] Initializing StreamDecoderFactory with shared CUDA context...")
        factory = StreamDecoderFactory(gpu_id=args.gpu_id)
        factory_memory = get_gpu_memory_usage(args.gpu_id)
        factory_overhead = factory_memory - baseline_memory
        print(f"✓ Factory initialized")
        print(f"  VRAM after factory: {factory_memory} MB (+{factory_overhead} MB)\n")
        # Create all decoders
        print(f"[2/4] Creating {len(camera_urls)} StreamDecoder instances...")
        decoders = []
        for i, url in enumerate(camera_urls):
            decoder = factory.create_decoder(
                rtsp_url=url,
                buffer_size=30,
                codec='h264'
            )
            decoders.append(decoder)
            print(f"  ✓ Decoder {i+1} created for camera {url.split('@')[1].split('/')[0]}")
        decoders_memory = get_gpu_memory_usage(args.gpu_id)
        decoders_overhead = decoders_memory - factory_memory
        print(f"\n  VRAM after creating {len(decoders)} decoders: {decoders_memory} MB (+{decoders_overhead} MB)")
        print(f"  Average per decoder: {decoders_overhead / len(decoders):.1f} MB\n")
        # Start all decoders
        print(f"[3/4] Starting all {len(decoders)} decoders...")
        for i, decoder in enumerate(decoders):
            decoder.start()
            print(f"  ✓ Decoder {i+1} started")
        started_memory = get_gpu_memory_usage(args.gpu_id)
        started_overhead = started_memory - decoders_memory
        print(f"\n  VRAM after starting decoders: {started_memory} MB (+{started_overhead} MB)")
        print(f"  Average per running decoder: {started_overhead / len(decoders):.1f} MB\n")
        # Wait for all streams to connect
        print("[4/4] Waiting for all streams to connect...")
        max_wait = 15
        for wait_time in range(max_wait):
            connected = sum(1 for d in decoders if d.is_connected())
            print(f"  Connected: {connected}/{len(decoders)} streams", end='\r')
            if connected == len(decoders):
                print(f"\n✓ All {len(decoders)} streams connected!\n")
                break
            time.sleep(1)
        else:
            connected = sum(1 for d in decoders if d.is_connected())
            print(f"\n⚠ Only {connected}/{len(decoders)} streams connected after {max_wait}s\n")
        connected_memory = get_gpu_memory_usage(args.gpu_id)
        connected_overhead = connected_memory - started_memory
        print(f"  VRAM after connection: {connected_memory} MB (+{connected_overhead} MB)\n")
        # Monitor streams
        print(f"Monitoring streams for {args.duration} seconds...")
        print("=" * 80)
        print(f"{'Time':<8} {'VRAM':<10} {'Stream 1':<12} {'Stream 2':<12} {'Stream 3':<12} {'Stream 4':<12}")
        print("-" * 80)
        start_time = time.time()
        snapshot_interval = args.duration // 3 if args.capture_snapshots else 0
        last_snapshot = 0
        while time.time() - start_time < args.duration:
            elapsed = time.time() - start_time
            current_memory = get_gpu_memory_usage(args.gpu_id)
            # Get stats for each decoder
            stats = []
            for decoder in decoders:
                status = decoder.get_status().value[:8]
                buffer = decoder.get_buffer_size()
                frames = decoder.frame_count
                stats.append(f"{status:8s} {buffer:2d}/30 {frames:4d}")
            print(f"{elapsed:6.1f}s {current_memory:6d}MB {stats[0]:<12} {stats[1]:<12} {stats[2]:<12} {stats[3]:<12}")
            # Capture snapshots
            if args.capture_snapshots and snapshot_interval > 0:
                if elapsed - last_snapshot >= snapshot_interval:
                    print("\n  → Capturing snapshots from all streams...")
                    for i, decoder in enumerate(decoders):
                        jpeg_bytes = decoder.get_frame_as_jpeg(quality=85)
                        if jpeg_bytes:
                            filename = output_dir / f"camera_{i+1}_t{int(elapsed)}s.jpg"
                            with open(filename, 'wb') as f:
                                f.write(jpeg_bytes)
                            print(f"     Saved {filename.name} ({len(jpeg_bytes)/1024:.1f} KB)")
                    print()
                    last_snapshot = elapsed
            time.sleep(1)
        print("=" * 80)
        # Final memory analysis
        final_memory = get_gpu_memory_usage(args.gpu_id)
        total_overhead = final_memory - baseline_memory
        print("\n" + "=" * 80)
        print("Memory Usage Analysis")
        print("=" * 80)
        print(f"Baseline VRAM:                    {baseline_memory:6d} MB")
        print(f"After Factory Init:               {factory_memory:6d} MB  (+{factory_overhead:4d} MB)")
        print(f"After Creating {len(decoders)} Decoders:        {decoders_memory:6d} MB  (+{decoders_overhead:4d} MB)")
        print(f"After Starting Decoders:          {started_memory:6d} MB  (+{started_overhead:4d} MB)")
        print(f"After Connection:                 {connected_memory:6d} MB  (+{connected_overhead:4d} MB)")
        print(f"Final (after {args.duration}s):              {final_memory:6d} MB  (+{total_overhead:4d} MB total)")
        print("-" * 80)
        print(f"Average VRAM per stream:          {total_overhead / len(decoders):6.1f} MB")
        print(f"Context sharing efficiency:       {'EXCELLENT' if total_overhead < 500 else 'GOOD' if total_overhead < 800 else 'POOR'}")
        print("=" * 80)
        # Final stats
        print("\nFinal Stream Statistics:")
        print("-" * 80)
        for i, decoder in enumerate(decoders):
            status = decoder.get_status().value
            buffer = decoder.get_buffer_size()
            frames = decoder.frame_count
            fps = frames / args.duration if args.duration > 0 else 0
            print(f"Stream {i+1}: {status:12s} | Buffer: {buffer:2d}/{decoder.buffer_size} | "
                  f"Frames: {frames:5d} | Avg FPS: {fps:5.2f}")
        print("=" * 80)
    except KeyboardInterrupt:
        print("\n\n✗ Interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\n\n✗ Error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    finally:
        # Cleanup
        if 'decoders' in locals():
            print("\nCleaning up...")
            for i, decoder in enumerate(decoders):
                decoder.stop()
                print(f"  ✓ Decoder {i+1} stopped")
            cleanup_memory = get_gpu_memory_usage(args.gpu_id)
            print(f"\nVRAM after cleanup: {cleanup_memory} MB")
    print("\n✓ Multi-stream test completed successfully")
    sys.exit(0)
 if __name__ == '__main__':
    main()
--- a/test_stream.py
+++ b/test_stream.py
@ -0,0 +1,152 @@
 #!/usr/bin/env python3
 """
 CLI test script for StreamDecoder
 Tests RTSP stream decoding with NVDEC hardware acceleration
 """
 import argparse
 import time
 import sys
 from services.stream_decoder import StreamDecoderFactory, ConnectionStatus
 def main():
    parser = argparse.ArgumentParser(description='Test RTSP stream decoder with NVDEC')
    parser.add_argument(
        '--rtsp-url',
        type=str,
        required=True,
        help='RTSP stream URL (e.g., rtsp://user:pass@host/path)'
    )
    parser.add_argument(
        '--gpu-id',
        type=int,
        default=0,
        help='GPU device ID'
    )
    parser.add_argument(
        '--buffer-size',
        type=int,
        default=30,
        help='Frame buffer size'
    )
    parser.add_argument(
        '--duration',
        type=int,
        default=30,
        help='Test duration in seconds'
    )
    parser.add_argument(
        '--check-interval',
        type=float,
        default=1.0,
        help='Status check interval in seconds'
    )
    args = parser.parse_args()
    print("=" * 80)
    print("RTSP Stream Decoder Test")
    print("=" * 80)
    print(f"RTSP URL: {args.rtsp_url}")
    print(f"GPU ID: {args.gpu_id}")
    print(f"Buffer Size: {args.buffer_size} frames")
    print(f"Test Duration: {args.duration} seconds")
    print("=" * 80)
    print()
    try:
        # Create factory with shared CUDA context
        print("[1/4] Initializing StreamDecoderFactory...")
        factory = StreamDecoderFactory(gpu_id=args.gpu_id)
        print("✓ Factory initialized with shared CUDA context\n")
        # Create decoder
        print("[2/4] Creating StreamDecoder...")
        decoder = factory.create_decoder(
            rtsp_url=args.rtsp_url,
            buffer_size=args.buffer_size,
            codec='h264'
        )
        print(f"✓ Decoder created: {decoder}\n")
        # Start decoding
        print("[3/4] Starting decoder thread...")
        decoder.start()
        print("✓ Decoder thread started\n")
        # Monitor for specified duration
        print(f"[4/4] Monitoring stream for {args.duration} seconds...")
        print("-" * 80)
        start_time = time.time()
        last_frame_count = 0
        while time.time() - start_time < args.duration:
            time.sleep(args.check_interval)
            # Get status
            status = decoder.get_status()
            buffer_size = decoder.get_buffer_size()
            frame_count = decoder.frame_count
            fps = (frame_count - last_frame_count) / args.check_interval
            last_frame_count = frame_count
            # Print status
            elapsed = time.time() - start_time
            print(f"[{elapsed:6.1f}s] Status: {status.value:12s} | "
                  f"Buffer: {buffer_size:2d}/{args.buffer_size:2d} | "
                  f"Frames: {frame_count:5d} | "
                  f"FPS: {fps:5.1f}")
            # Try to get latest frame
            if status == ConnectionStatus.CONNECTED:
                frame = decoder.get_latest_frame()
                if frame is not None:
                    print(f"         Frame shape: {frame.shape}, dtype: {frame.dtype}, "
                          f"device: {frame.device}")
            # Check for errors
            if status == ConnectionStatus.ERROR:
                print("\n✗ ERROR: Stream connection failed!")
                break
        print("-" * 80)
        # Final statistics
        print("\n" + "=" * 80)
        print("Test Complete - Final Statistics")
        print("=" * 80)
        print(f"Total Frames Decoded: {decoder.frame_count}")
        print(f"Average FPS: {decoder.frame_count / args.duration:.2f}")
        print(f"Final Status: {decoder.get_status().value}")
        print(f"Buffer Utilization: {decoder.get_buffer_size()}/{args.buffer_size}")
        if decoder.frame_width and decoder.frame_height:
            print(f"Frame Resolution: {decoder.frame_width}x{decoder.frame_height}")
        print("=" * 80)
    except KeyboardInterrupt:
        print("\n\n✗ Interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\n\n✗ Error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    finally:
        # Cleanup
        if 'decoder' in locals():
            print("\nCleaning up...")
            decoder.stop()
            print("✓ Decoder stopped")
    print("\n✓ Test completed successfully")
    sys.exit(0)
 if __name__ == '__main__':
    main()
--- a/test_vram_process.py
+++ b/test_vram_process.py
@ -0,0 +1,143 @@
 #!/usr/bin/env python3
 """
 VRAM scaling test - measures Python process memory usage for 1, 2, 3, and 4 streams.
 """
 import os
 import time
 import subprocess
 from dotenv import load_dotenv
 from services import StreamDecoderFactory
 # Load environment variables from .env file
 load_dotenv()
 # Load camera URLs from environment
 camera_urls = []
 i = 1
 while True:
    url = os.getenv(f'CAMERA_URL_{i}')
    if url:
        camera_urls.append(url)
        i += 1
    else:
        break
 if not camera_urls:
    print("Error: No camera URLs found in .env file")
    print("Please add CAMERA_URL_1, CAMERA_URL_2, etc. to your .env file")
    exit(1)
 def get_python_gpu_memory():
    """Get Python process GPU memory usage in MB"""
    try:
        pid = os.getpid()
        result = subprocess.run(
            ['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'],
            capture_output=True, text=True, check=True
        )
        for line in result.stdout.strip().split('\n'):
            if line:
                parts = line.split(',')
                if len(parts) >= 2 and int(parts[0].strip()) == pid:
                    return int(parts[1].strip())
        return 0
    except:
        return 0
 def test_n_streams(n, wait_time=15):
    """Test with n streams"""
    print(f"\n{'='*80}")
    print(f"Testing with {n} stream(s)")
    print('='*80)
    mem_before = get_python_gpu_memory()
    print(f"Python process VRAM before: {mem_before} MB")
    # Create factory
    factory = StreamDecoderFactory(gpu_id=0)
    time.sleep(1)
    mem_after_factory = get_python_gpu_memory()
    print(f"After factory: {mem_after_factory} MB (+{mem_after_factory - mem_before} MB)")
    # Create decoders
    decoders = []
    for i in range(n):
        decoder = factory.create_decoder(camera_urls[i], buffer_size=30)
        decoders.append(decoder)
    time.sleep(1)
    mem_after_create = get_python_gpu_memory()
    print(f"After creating {n} decoder(s): {mem_after_create} MB (+{mem_after_create - mem_after_factory} MB)")
    # Start decoders
    for decoder in decoders:
        decoder.start()
    time.sleep(2)
    mem_after_start = get_python_gpu_memory()
    print(f"After starting {n} decoder(s): {mem_after_start} MB (+{mem_after_start - mem_after_create} MB)")
    # Wait for connection
    print(f"Waiting {wait_time}s for streams to connect and stabilize...")
    time.sleep(wait_time)
    # Check connection status
    connected = sum(1 for d in decoders if d.is_connected())
    mem_stable = get_python_gpu_memory()
    print(f"Connected: {connected}/{n} streams")
    print(f"Python process VRAM (stable): {mem_stable} MB")
    # Get frame stats
    for i, decoder in enumerate(decoders):
        print(f"  Stream {i+1}: {decoder.get_status().value:10s} "
              f"Buffer: {decoder.get_buffer_size()}/30 "
              f"Frames: {decoder.frame_count}")
    # Cleanup
    for decoder in decoders:
        decoder.stop()
    time.sleep(2)
    mem_after_cleanup = get_python_gpu_memory()
    print(f"After cleanup: {mem_after_cleanup} MB")
    return mem_stable
 if __name__ == '__main__':
    print("Python VRAM Scaling Test")
    print(f"PID: {os.getpid()}")
    baseline = get_python_gpu_memory()
    print(f"Baseline Python process VRAM: {baseline} MB\n")
    results = {}
    for n in [1, 2, 3, 4]:
        mem = test_n_streams(n, wait_time=15)
        results[n] = mem
        print(f"\n→ {n} stream(s): {mem} MB (process total)")
        # Give time between tests
        if n < 4:
            print("\nWaiting 5s before next test...")
            time.sleep(5)
    # Summary
    print("\n" + "="*80)
    print("Python Process VRAM Scaling Summary")
    print("="*80)
    print(f"Baseline:     {baseline:4d} MB")
    for n in [1, 2, 3, 4]:
        total = results[n]
        overhead = total - baseline
        per_stream = overhead / n if n > 0 else 0
        print(f"{n} stream(s):  {total:4d} MB  (+{overhead:3d} MB total, {per_stream:5.1f} MB per stream)")
    # Calculate marginal cost
    print("\nMarginal cost per additional stream:")
    for n in [2, 3, 4]:
        marginal = results[n] - results[n-1]
        print(f"  Stream {n}: +{marginal} MB")
    print("="*80)
--- a/verify_tensorrt_model.py
+++ b/verify_tensorrt_model.py
@ -0,0 +1,85 @@
 #!/usr/bin/env python3
 """
 Quick verification script for TensorRT model
 """
 import torch
 from services.model_repository import TensorRTModelRepository
 def verify_model():
    print("=" * 80)
    print("TensorRT Model Verification")
    print("=" * 80)
    # Initialize repository
    repo = TensorRTModelRepository(gpu_id=0, default_num_contexts=2)
    # Load the model
    print("\nLoading YOLOv8n TensorRT engine...")
    try:
        metadata = repo.load_model(
            model_id="yolov8n_test",
            file_path="models/yolov8n.trt",
            num_contexts=2
        )
        print("✓ Model loaded successfully!")
    except Exception as e:
        print(f"✗ Failed to load model: {e}")
        return
    # Get model info
    print("\n" + "=" * 80)
    print("Model Information")
    print("=" * 80)
    info = repo.get_model_info("yolov8n_test")
    if info:
        print(f"Model ID: {info['model_id']}")
        print(f"File: {info['file_path']}")
        print(f"File hash: {info['file_hash']}")
        print(f"\nInputs:")
        for name, spec in info['inputs'].items():
            print(f"  {name}: {spec['shape']} ({spec['dtype']})")
        print(f"\nOutputs:")
        for name, spec in info['outputs'].items():
            print(f"  {name}: {spec['shape']} ({spec['dtype']})")
    # Run test inference
    print("\n" + "=" * 80)
    print("Running Test Inference")
    print("=" * 80)
    try:
        # Create dummy input (simulating a 640x640 image)
        input_tensor = torch.rand(1, 3, 640, 640, dtype=torch.float32, device='cuda:0')
        print(f"Input tensor: {input_tensor.shape} on {input_tensor.device}")
        # Run inference
        outputs = repo.infer(
            model_id="yolov8n_test",
            inputs={"images": input_tensor},
            synchronize=True
        )
        print("\n✓ Inference successful!")
        print("\nOutputs:")
        for name, tensor in outputs.items():
            print(f"  {name}: {tensor.shape} on {tensor.device} ({tensor.dtype})")
    except Exception as e:
        print(f"\n✗ Inference failed: {e}")
        import traceback
        traceback.print_exc()
    # Cleanup
    print("\n" + "=" * 80)
    print("Cleanup")
    print("=" * 80)
    repo.unload_model("yolov8n_test")
    print("✓ Model unloaded")
    print("\n" + "=" * 80)
    print("Verification Complete!")
    print("=" * 80)
 if __name__ == "__main__":
    verify_model()