diff --git a/claude.md b/claude.md index 75f833d..8547b53 100644 --- a/claude.md +++ b/claude.md @@ -122,7 +122,7 @@ jpeg_data = encoder.encode(nv_image, "jpeg", encode_params) ## Performance Metrics -### VRAM Usage (Python Process) +### VRAM Usage (at 720p) | Streams | Total VRAM | Overhead | Per Stream | Marginal Cost | |---------|-----------|----------|------------|---------------| @@ -134,24 +134,10 @@ jpeg_data = encoder.encode(nv_image, "jpeg", encode_params) **Result:** Perfect linear scaling at ~60 MB per stream -### Capacity Estimates - -With 60 MB per stream + 216 MB baseline: - -- **16GB GPU**: ~269 cameras (conservative: ~250) -- **24GB GPU**: ~407 cameras (conservative: ~380) -- **48GB GPU**: ~815 cameras (conservative: ~780) -- **For 1000 streams**: ~60GB VRAM required - -### Throughput - -- **Frame Rate**: 7-7.5 FPS per stream @ 720p -- **JPEG Encoding**: 1-2ms per frame -- **Connection Time**: ~15s for stream stabilization - ## Project Structure ``` + python-rtsp-worker/ ├── app.py # FastAPI application ├── services/ @@ -166,23 +152,11 @@ python-rtsp-worker/ ├── requirements.txt # Python dependencies ├── .env # Camera URLs (gitignored) ├── .env.example # Template for camera URLs + └── .gitignore ``` -## Dependencies - -``` -fastapi # Web framework -uvicorn[standard] # ASGI server -torch # GPU tensor operations -PyNvVideoCodec # NVDEC hardware decoding -av # FFmpeg/RTSP client -cuda-python # CUDA driver bindings -nvidia-nvimgcodec-cu12 # nvJPEG encoding -python-dotenv # Environment variables -``` - ## Configuration ### Environment Variables (.env) @@ -319,48 +293,6 @@ python test_jpeg_encode.py **Cause**: Likely CUDA context cleanup order issues **Workaround**: Functionality works correctly; cleanup errors can be ignored -## Technical Decisions - -### Why PyNvVideoCodec? -- Direct access to NVDEC hardware decoder -- Minimal overhead compared to FFmpeg/torchaudio -- Returns GPU tensors via DLPack -- Better control over decode sessions - -### Why Shared CUDA Context? -- Reduces VRAM from ~200MB to ~60MB per stream (70% savings) -- Enables 1000-stream target on 60GB GPU -- Minimal complexity overhead with singleton pattern - -### Why nvImageCodec? -- GPU-native JPEG encoding (nvJPEG) -- Zero-copy with PyTorch via `__cuda_array_interface__` -- 1-2ms encoding time per 720p frame -- Keeps data on GPU until final compression - -### Why Thread-Safe Ring Buffer? -- Decouples decoding from inference pipeline -- Prevents frame drops during processing spikes -- Allows async frame access -- Configurable buffer size per stream - -## Future Considerations - -### Hardware Decode Session Limits -- NVIDIA GPUs typically support 5-30 concurrent decode sessions -- May need multiple GPUs for 1000 streams -- Test with actual hardware to verify limits - -### Scaling Beyond 1000 Streams -- Multi-GPU support with context per GPU -- Load balancing across GPUs -- Network bandwidth considerations - -### TensorRT Integration -- Next step: Integrate with TensorRT inference pipeline -- GPU frames → TensorRT → Results -- Keep entire pipeline on GPU - ## References - [PyNvVideoCodec Documentation](https://developer.nvidia.com/pynvvideocodec)