Vexel is a high-performance inference engine for Large Language Models (LLMs) on Apple Silicon. It leverages Metal for hardware acceleration, optimized kernels (FlashAttention-2), and a custom scheduler to deliver state-of-the-art token generation speeds.
- Metal Acceleration: Fully optimized for Apple M-series chips (M1/M2/M3/M4) using custom Metal kernels.
- FlashAttention-2: Implements FlashAttention-2 for efficient attention computation.
- Continuous Batching: Event-driven scheduler for high-throughput concurrent inference.
- Streaming Support: Server-Sent Events (SSE) for real-time token streaming.
- Go Client Library: High-level Go client (
vexel/client) for easy integration. - Direct Inference API: Use the runtime directly for custom pipelines and benchmarking.
- GGUF Support: Compatible with GGUF model format (Q4_0, Q4_K_M, Q6_K quantizations).
- Multi-Architecture: Supports LLaMA family (LLaMA 2/3, Mistral) and Phi family (Phi-2, Phi-3).
- macOS 14.0+ (Sonoma) or later.
- Go 1.22+.
- Xcode Command Line Tools (for Metal compilation).
Build the unified vexel binary:
make buildThis produces a single ./vexel binary with all subcommands. Requires macOS with Metal support.
vexel [global flags] <subcommand> [flags]
Global flags:
--model Path to GGUF model file (required for serve/generate/chat)
--verbose Enable verbose logging
Subcommands:
serve Start the HTTP inference server
generate One-shot text generation from a prompt
chat Interactive chat REPL
bench Run scheduling benchmarks
tokenize Tokenize text to token IDs
./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf serve --port 8080Serve flags:
--port: HTTP port (default: 8080).--max-tokens: Max tokens per request (default: 256).--max-batch-size: Max batch size for scheduler (default: 1).
./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf generate \
--prompt "Once upon a time" --max-tokens 100 --temperature 0.7./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf chat \
--system-prompt "You are a helpful assistant."./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf tokenize \
--input "Hello world"
# Or pipe from stdin:
echo "Hello world" | ./vexel tokenize --tokenizer path/to/tokenizer.json./vexel bench --batch 128 --seq-len 128 --num-seqs 256Vexel provides a Go client library in client/ for communicating with a running server.
package main
import (
"context"
"fmt"
"vexel/client"
)
func main() {
c := client.New(client.Config{BaseURL: "http://localhost:8080"})
// Blocking generation
text, _ := c.Generate(context.Background(), "What is Go?", nil)
fmt.Println(text)
// Streaming generation (token-by-token)
tokenChan, _ := c.Stream(context.Background(), "Hello, world!", nil)
for token := range tokenChan {
fmt.Print(token)
}
}See examples/client/main.go for a full example with options.
For custom pipelines, you can use the runtime directly without the HTTP layer:
// Load model and run prefill + decode loop
logits, _ := model.DecodeWithGPUKV(promptTokens, 0) // prefill
nextToken := argmax(logits)
for i := 0; i < maxTokens; i++ {
logits, _ = model.DecodeWithGPUKV([]int{nextToken}, pos) // decode
nextToken = argmax(logits)
pos++
}See examples/generate/main.go for a complete working example.
You can embed Vexel's HTTP server into your own Go application:
sched, _ := scheduler.NewScheduler(model, tok, scheduler.Config{
MaxBatchSize: 1,
MaxSequences: 64,
MaxTokens: 256,
SamplerConfig: sampler.DefaultConfig(),
})
go sched.Run(ctx)
srv := serve.NewServer(sched)
http.ListenAndServe(":8080", srv)See examples/server/main.go for a complete working example with model loading and graceful shutdown.
Generate (Blocking):
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Tell me a joke"}'Stream (SSE):
curl -N -X POST http://localhost:8080/stream \
-H "Content-Type: application/json" \
-d '{"prompt": "Tell me a joke"}'| Example | Description | Build Tags |
|---|---|---|
examples/client/ |
HTTP client with Generate and Stream | none |
examples/server/ |
Embed Vexel server in your Go app | metal |
examples/generate/ |
Direct inference loop (no HTTP) | metal |
Run examples with:
# Client (requires a running server)
go run ./examples/client
# Server (Metal required)
go run -tags metal ./examples/server -model path/to/model.gguf
# Direct generation (Metal required)
go run -tags metal ./examples/generate -model path/to/model.gguf -prompt "Hello!"client/ Go HTTP client library (no Metal dependency)
inference/
cmd/vexel/ Unified CLI binary (serve, generate, chat, bench, tokenize)
cmd/vexel/internal REPL and chat loop implementation
backend/metal/ Metal GPU backend and kernel dispatch
runtime/ Model loading, weight management, inference loop
scheduler/ Continuous batching scheduler with metrics
serve/ HTTP server (/generate, /stream endpoints)
pkg/
gguf/ GGUF model format parser
tokenizer/ Tokenizer with chat template support
sampler/ Temperature, top-k, top-p sampling
examples/ Usage examples (client, server, direct generation)
scripts/ Performance harness and benchmarking tools
Benchmarks on Apple M3 Max (128 GB) with LLaMA 2 7B:
Q4_0 (3.56 GB):
| Metric | Vexel | llama.cpp | MLX | Gap (vs llama.cpp) |
|---|---|---|---|---|
| Decode throughput | 64.8 tok/s | 76.3 tok/s | 83.5 tok/s | -15.1% |
| Decode ctx degradation (16→512) | ~-10% | -2.7% | -2.5% | — |
| Prefill (128 tokens) | 377 tok/s | 803 tok/s | 725 tok/s | -53% |
| Model load time | ~885 ms | ~1100 ms | — | +20% faster |
Q4_K_M (4.08 GB) — optimized kernels:
| Metric | Vexel | vs Q4_0 | Before Optimization |
|---|---|---|---|
| Decode throughput | 53.5 tok/s | 0.84x | 14.2 tok/s (3.8x faster) |
| Prefill (128 tokens) | 220.4 tok/s | 0.58x | 16.6 tok/s (13.3x faster) |
| Context degradation (16→512) | -10.8% | — | — |
Vexel achieves 85% of llama.cpp's decode throughput on Q4_0 and supports Q4_K_M with custom sub-block strided decode and simdgroup tiled prefill kernels.
Flash Attention: Tiled split-KV decode with online softmax reduced context scaling degradation from -24.5% to ~-10% (ctx=16 → ctx=512).
See benchmarks/RESULTS.md for detailed analysis,
context-length scaling data, and optimization roadmap.
Where Vexel differentiates:
- Go scheduler with continuous batching (multi-client throughput)
- gRPC + HTTP dual protocol for production serving
- Single binary deployment — no Python dependency chain
- Pure Go codebase — easy to embed, extend, and deploy
- Faster cold start than llama.cpp (~885ms vs ~1100ms)
Run the benchmark suite:
# Unit-level benchmarks
go test -tags metal -run TestThroughput -v ./inference/runtime/
go test -tags metal -run TestLatency -v ./inference/runtime/
# Competitive benchmarks (vs llama.cpp, MLX, Ollama)
./benchmarks/run_benchmark.sh --model path/to/model.gguf --engines vexel,llama
# Scheduling benchmarks via CLI
./vexel bench --batch 128 --seq-len 128 --num-seqs 256Use the provided harness to compare Vexel (Metal) against llama.cpp on a fixed prompt set.
- Model: default
models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf(override withMODEL_PATH). - Vexel binary: default
./vexel(override withVEXEL_BIN). - llama.cpp binary: default
llama-cli(override withLLAMA_BIN).
bash scripts/perf_harness.shThe harness uses deterministic sampling (temp=0) to make correctness comparisons meaningful.
Environment overrides:
MODEL_PATH– GGUF model pathVEXEL_BIN– path to Vexel binaryLLAMA_BIN– path to llama.cpp binaryOUT_DIR– output directory (defaultperf_reports)