Skip to content

open-webui/benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open WebUI Benchmark Suite

A comprehensive benchmarking framework for testing Open WebUI performance under various load conditions.

Overview

This benchmark suite is designed to:

  1. Measure concurrent user capacity - Test how many users can simultaneously use features like Channels
  2. Identify performance limits - Find the point where response times degrade
  3. Compare compute profiles - Test performance across different resource configurations
  4. Generate actionable reports - Provide detailed metrics and recommendations

Quick Start

Prerequisites

  • Python 3.11+
  • Docker and Docker Compose
  • A running Open WebUI instance (or use the provided Docker setup)
  • Chromium browser (installed automatically via Playwright for UI benchmarks)

Installation

cd benchmark
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

# Install Playwright browsers (required for UI benchmarks)
playwright install chromium

Configuration

Copy the example environment file and configure your admin credentials:

cp .env.example .env

Edit .env with your Open WebUI admin credentials:

OPEN_WEBUI_URL=http://localhost:8080
ADMIN_USER_EMAIL=your-admin@example.com
ADMIN_USER_PASSWORD=your-password

Running Benchmarks

  1. Start Open WebUI with benchmark configuration:
cd docker
./run.sh default  # Use the default compute profile (2 CPU, 8GB RAM)
  1. Run the benchmark:
# Run the default benchmark (chat-ui with auto-scaling), which automatically finds max sustainable users based on P95 response time
owb run

# Set a custom response time threshold (default: 1000ms)
owb run --response-threshold 2000

# Run with a fixed number of users (disables auto-scaling)
owb run -m 50

# Run with visible browsers for debugging
owb run --headed
owb run --headed --slow-mo 500  # Slow down for visual inspection
  1. List available benchmarks:
owb list
  1. Run other benchmarks:
# API-based chat benchmark (no browser)
owb run chat-api -m 50

# Channel API concurrency
owb run channels-api -m 50

# Channel WebSocket benchmark  
owb run channels-ws -m 50

# Run all benchmarks
owb run all
  1. View results:

Results are organized by benchmark name and timestamp:

results/
└── chat_ui_concurrency/
    └── 20260126_014205/
        ├── result.json    # Detailed benchmark data
        ├── results.csv    # Tabular results
        └── summary.txt    # Human-readable summary

Compute Profiles

Compute profiles define the resource constraints for the Open WebUI container:

Profile CPUs Memory Use Case
default 2 8GB Local MacBook testing
minimal 1 4GB Testing lower bounds
cloud_small 2 4GB Small cloud VM
cloud_medium 4 8GB Medium cloud VM
cloud_large 8 16GB Large cloud VM

List available profiles:

owb profiles

Available Benchmarks

Chat API Concurrency (chat-api)

Tests concurrent AI chat performance via the OpenAI-compatible API:

  • Creates test users and makes a model publicly available
  • Each user sends chat requests via the /api/chat endpoint
  • Measures response times, throughput, and error rates
  • Tests the backend's ability to handle concurrent LLM requests

Usage:

owb run chat-api -m 50 --model gpt-4o-mini

Chat UI Concurrency (chat-ui) - Default

Tests concurrent AI chat performance through actual browser UI using Playwright. This is the default benchmark and runs in auto-scale mode by default.

Auto-scale mode (default):

  • Progressively adds users until P95 response time exceeds threshold
  • Automatically finds maximum sustainable concurrent users
  • Reports performance at each level tested

Fixed mode:

  • Test with a specific number of concurrent users
  • Enabled by specifying --max-users / -m

How it works:

  • Launches real Chromium browser instances (or contexts)
  • Each browser logs in as a different user
  • Sends chat messages and waits for streaming responses
  • Measures actual user-experienced response times including rendering
  • Tests full stack performance: UI, backend, and LLM together

Usage:

# Auto-scale mode (default) - finds max sustainable users
owb run
owb run --response-threshold 2000  # Custom threshold (default: 1000ms)

# Fixed mode - test specific user count
owb run -m 50
owb run -m 50 --model gpt-4o-mini

# Debugging options
owb run --headed                    # Visible browsers
owb run --headed --slow-mo 500      # Slow down for inspection

Configuration:

chat_ui:
  headless: true              # Run browsers in headless mode
  slow_mo: 0                  # Slow down operations by ms (debugging)
  viewport_width: 1280        # Browser viewport width
  viewport_height: 720        # Browser viewport height
  browser_timeout: 30000      # Default timeout in ms
  screenshot_on_error: true   # Capture screenshots on failure
  use_isolated_browsers: false # Use separate browser instances vs contexts

Notes:

  • Browser benchmarks require more resources than API benchmarks
  • For high concurrency (50+), use headless mode and browser contexts
  • Headed mode is useful for debugging UI issues
  • The benchmark measures actual streaming response detection

Channel Concurrency (channels-api)

Tests concurrent user capacity in Open WebUI Channels:

  • Creates a test channel
  • Progressively adds users (10, 20, 30, ... up to max)
  • Each user sends messages at a configured rate
  • Measures response times and error rates
  • Identifies the maximum sustainable user count

Configuration options:

channels:
  max_concurrent_users: 100  # Maximum users to test
  user_step_size: 10         # Increment users by this amount
  sustain_time: 30           # Seconds to run at each level
  message_frequency: 0.5     # Messages per second per user

Channel WebSocket (channels-ws)

Tests WebSocket scalability for real-time message delivery in Channels:

  • Establishes WebSocket connections for multiple users
  • Tests real-time message broadcasting
  • Measures message delivery latency
  • Identifies WebSocket connection limits

Configuration

Configuration files are located in config/:

  • benchmark_config.yaml - Main benchmark settings
  • compute_profiles.yaml - Resource profiles for Docker containers

Environment Variables

All configuration can be set via environment variables (loaded from .env file):

Variable Description Default
OPEN_WEBUI_URL Open WebUI URL for benchmarking http://localhost:8080
Variable Description Default
---------- ------------- ---------
OPEN_WEBUI_URL Open WebUI URL http://localhost:8080
OLLAMA_BASE_URL Ollama API URL http://host.docker.internal:11434
ENABLE_CHANNELS Enable Channels feature true
ADMIN_USER_EMAIL Admin email -
ADMIN_USER_PASSWORD Admin password -
MAX_CONCURRENT_USERS Max concurrent users 50
USER_STEP_SIZE User increment step 10
SUSTAIN_TIME_SECONDS Test duration per level 30
MESSAGE_FREQUENCY Messages/sec per user 0.5
OPEN_WEBUI_PORT Container port 8080
CPU_LIMIT CPU limit 2.0
MEMORY_LIMIT Memory limit 8g
  1. Create a new file in benchmark/scenarios/:
from benchmark.core.base import BaseBenchmark
from benchmark.core.metrics import BenchmarkResult

class MyNewBenchmark(BaseBenchmark):
    name = "My New Benchmark"
    description = "Tests something new"
    version = "1.0.0"
    
    async def setup(self) -> None:
        # Set up test environment
        pass
    
    async def run(self) -> BenchmarkResult:
        # Execute the benchmark
        # Use self.metrics to record timings
        return self.metrics.get_result(self.name)
    
    async def teardown(self) -> None:
        # Clean up
        pass
  1. Register the benchmark in benchmark/cli.py

  2. Add configuration options if needed in config/benchmark_config.yaml

Custom Metrics Collection

from benchmark.core.metrics import MetricsCollector

metrics = MetricsCollector()
metrics.start()

# Time individual operations
with metrics.time_operation("my_operation"):
    await do_something()

# Or record manually
metrics.record_timing(
    operation="api_call",
    duration_ms=150.5,
    success=True,
)

metrics.stop()
result = metrics.get_result("My Benchmark")

Understanding Results

Key Metrics

Metric Description Good Threshold
avg_response_time_ms Average response time < 2000ms
p95_response_time_ms 95th percentile response time < 3000ms
error_rate_percent Percentage of failed requests < 1%
requests_per_second Throughput > 10

Result Files

  • *.json - Detailed results for each benchmark run
  • benchmark_results_*.csv - Combined results in CSV format
  • summary_*.txt - Human-readable summary

Interpreting Chat UI Benchmark Results

The chat-ui benchmark in auto-scale mode reports:

  • max_sustainable_users: Maximum users where P95 stays under threshold
  • levels_tested: Performance data at each user count level
  • % of Threshold: How close P95 is to the configured limit

Example auto-scale result:

                   Auto-Scale Results                    
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Users ┃ P95 (ms) ┃ Avg (ms) ┃ % of Threshold ┃ Errors ┃
┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│    10 │      731 │      662 │            37% │   0.0% │
│    30 │      881 │      748 │            44% │   0.0% │
│    50 │     1178 │     1064 │            59% │   0.0% │
│    70 │     2133 │     1854 │           107% │   0.8% │
└───────┴──────────┴──────────┴────────────────┴────────┘

P95 Threshold: 2000ms
Maximum Sustainable Users: 50

Interpreting Channel Benchmark Results

The channel benchmark reports:

  • max_sustainable_users: Maximum users where performance thresholds are met
  • results_by_level: Performance at each user count level
  • tested_levels: All user counts that were tested

Example result analysis:

Users: 10  | P95: 150ms  | Errors: 0%    | ✓ PASS
Users: 20  | P95: 280ms  | Errors: 0.1%  | ✓ PASS
Users: 30  | P95: 520ms  | Errors: 0.3%  | ✓ PASS
Users: 40  | P95: 1200ms | Errors: 0.8%  | ✓ PASS
Users: 50  | P95: 3500ms | Errors: 2.1%  | ✗ FAIL

Maximum sustainable users: 40

Architecture

benchmark/
├── benchmark/
│   ├── core/           # Core framework
│   │   ├── base.py     # Base benchmark class
│   │   ├── config.py   # Configuration management
│   │   ├── metrics.py  # Metrics collection
│   │   └── runner.py   # Benchmark orchestration
│   ├── clients/        # API clients
│   │   ├── http_client.py      # HTTP/REST client
│   │   ├── websocket_client.py # WebSocket client
│   │   └── browser_client.py   # Playwright browser automation
│   ├── scenarios/      # Benchmark implementations
│   │   ├── channels.py # Channel benchmarks
│   │   └── chat_ui.py  # Browser-based chat benchmark
│   ├── utils/          # Utilities
│   │   └── docker.py   # Docker management
│   └── cli.py          # Command-line interface
├── config/             # Configuration files
├── docker/             # Docker Compose for benchmarking
└── results/            # Benchmark output organized by {benchmark}/{timestamp}/

Dependencies

The benchmark suite reuses Open WebUI dependencies where possible:

From Open WebUI:

  • httpx - HTTP client
  • aiohttp - Async HTTP
  • python-socketio - WebSocket client
  • pydantic - Data validation
  • pandas - Data analysis

Benchmark-specific:

  • playwright - Browser automation for UI testing
  • locust - Load testing (optional, for advanced scenarios)
  • rich - Terminal output
  • docker - Docker SDK
  • matplotlib - Plotting results

Troubleshooting

Common Issues

  1. Connection refused: Ensure Open WebUI is running and accessible
  2. Authentication errors: Check admin credentials in config
  3. Docker resource errors: Ensure Docker has enough resources allocated
  4. WebSocket timeout: Increase websocket_timeout in config
  5. Browser launch failures: Run playwright install chromium to install browsers
  6. Login timeout in browser tests: Check that .env has correct ADMIN_USER_NAME (with quotes if it contains spaces)
  7. High browser concurrency fails: Use --headless mode and ensure sufficient system resources

Debug Mode

Set logging level to DEBUG:

export BENCHMARK_LOG_LEVEL=DEBUG
owb run channels

Contributing

When adding new benchmarks:

  1. Follow the BaseBenchmark interface
  2. Add tests for the new benchmark
  3. Update configuration schema if needed
  4. Add documentation to this README

License

MIT License - See LICENSE file

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published