Skip to content

feat: add GPU inference option to web UI#34

Open
EruditionHerta wants to merge 1 commit into
OpenMOSS:mainfrom
EruditionHerta:feature/gpu-web-inference
Open

feat: add GPU inference option to web UI#34
EruditionHerta wants to merge 1 commit into
OpenMOSS:mainfrom
EruditionHerta:feature/gpu-web-inference

Conversation

@EruditionHerta

Copy link
Copy Markdown

Summary

Allow users to select CUDA device for inference in the web interface

Previously the app forced CPU-only mode regardless of --device flag

Now supports --device cuda, --device auto, and per-request device selection via the web UI dropdown

Changes

main(): Removed forced CPU override, added cuda/auto device resolution with CUDA availability detection

RequestRuntimeManager: Extended normalize_requested_execution_device() whitelist to accept cuda and cuda:N, added _build_cuda_runtime_locked() for dynamic GPU runtime creation

Web UI: Added Device selector dropdown (Default/CPU/CUDA:N) in Generation Options

API endpoints: /api/generate and /api/generate-stream/start accept execution_device parameter, /health reports cuda_available

Test Plan

  • python app.py --device cpu (default behavior unchanged)
  • python app.py --device cuda (GPU mode)
  • python app.py --device auto (auto-detect)
  • Web UI shows Device selector with CUDA options when GPU available
  • Switching between CPU/CUDA in the web UI works correctly
  • Streaming generation works with GPU device
Allow users to select CUDA device for inference in the web interface.
Previously the app was forced to CPU-only mode. Now supports:
- `--device cuda` to start on GPU
- `--device auto` to auto-detect (GPU if available, else CPU)
- Device selector dropdown in the web UI
- Dynamic GPU runtime creation when requested per-request
- `/health` endpoint reports `cuda_available`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant