Vexel

Vexel is a high-performance inference engine for Large Language Models (LLMs) on Apple Silicon. It leverages Metal for hardware acceleration, optimized kernels (FlashAttention-2), and a custom scheduler to deliver state-of-the-art token generation speeds.

Features

Metal Acceleration: Fully optimized for Apple M-series chips (M1/M2/M3/M4) using custom Metal kernels.
FlashAttention-2: Implements FlashAttention-2 for efficient attention computation.
Continuous Batching: Event-driven scheduler for high-throughput concurrent inference.
Streaming Support: Server-Sent Events (SSE) for real-time token streaming.
Go Client Library: High-level Go client (vexel/client) for easy integration.
Direct Inference API: Use the runtime directly for custom pipelines and benchmarking.
GGUF Support: Compatible with GGUF model format (Q4_0, Q4_K_M, Q6_K quantizations).
Multi-Architecture: Supports LLaMA family (LLaMA 2/3, Mistral) and Phi family (Phi-2, Phi-3).

Getting Started

Prerequisites

macOS 14.0+ (Sonoma) or later.
Go 1.22+.
Xcode Command Line Tools (for Metal compilation).

Building

Build the unified vexel binary:

make build

This produces a single ./vexel binary with all subcommands. Requires macOS with Metal support.

CLI Subcommands

vexel [global flags] <subcommand> [flags]

Global flags:
  --model    Path to GGUF model file (required for serve/generate/chat)
  --verbose  Enable verbose logging

Subcommands:
  serve      Start the HTTP inference server
  generate   One-shot text generation from a prompt
  chat       Interactive chat REPL
  bench      Run scheduling benchmarks
  tokenize   Tokenize text to token IDs

Running the Server

./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf serve --port 8080

Serve flags:

--port: HTTP port (default: 8080).
--max-tokens: Max tokens per request (default: 256).
--max-batch-size: Max batch size for scheduler (default: 1).

Text Generation

./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf generate \
  --prompt "Once upon a time" --max-tokens 100 --temperature 0.7

Interactive Chat

./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf chat \
  --system-prompt "You are a helpful assistant."

Tokenization

./vexel --model models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf tokenize \
  --input "Hello world"
# Or pipe from stdin:
echo "Hello world" | ./vexel tokenize --tokenizer path/to/tokenizer.json

Benchmarking

./vexel bench --batch 128 --seq-len 128 --num-seqs 256

Usage

Go Client

Vexel provides a Go client library in client/ for communicating with a running server.

package main

import (
	"context"
	"fmt"
	"vexel/client"
)

func main() {
	c := client.New(client.Config{BaseURL: "http://localhost:8080"})

	// Blocking generation
	text, _ := c.Generate(context.Background(), "What is Go?", nil)
	fmt.Println(text)

	// Streaming generation (token-by-token)
	tokenChan, _ := c.Stream(context.Background(), "Hello, world!", nil)
	for token := range tokenChan {
		fmt.Print(token)
	}
}

See examples/client/main.go for a full example with options.

Direct Inference

For custom pipelines, you can use the runtime directly without the HTTP layer:

// Load model and run prefill + decode loop
logits, _ := model.DecodeWithGPUKV(promptTokens, 0)  // prefill
nextToken := argmax(logits)

for i := 0; i < maxTokens; i++ {
	logits, _ = model.DecodeWithGPUKV([]int{nextToken}, pos)  // decode
	nextToken = argmax(logits)
	pos++
}

See examples/generate/main.go for a complete working example.

Embedding the Server

You can embed Vexel's HTTP server into your own Go application:

sched, _ := scheduler.NewScheduler(model, tok, scheduler.Config{
	MaxBatchSize:  1,
	MaxSequences:  64,
	MaxTokens:     256,
	SamplerConfig: sampler.DefaultConfig(),
})
go sched.Run(ctx)

srv := serve.NewServer(sched)
http.ListenAndServe(":8080", srv)

See examples/server/main.go for a complete working example with model loading and graceful shutdown.

HTTP API

Generate (Blocking):

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me a joke"}'

Stream (SSE):

curl -N -X POST http://localhost:8080/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me a joke"}'

Examples

Example	Description	Build Tags
`examples/client/`	HTTP client with Generate and Stream	none
`examples/server/`	Embed Vexel server in your Go app	`metal`
`examples/generate/`	Direct inference loop (no HTTP)	`metal`

Run examples with:

# Client (requires a running server)
go run ./examples/client

# Server (Metal required)
go run -tags metal ./examples/server -model path/to/model.gguf

# Direct generation (Metal required)
go run -tags metal ./examples/generate -model path/to/model.gguf -prompt "Hello!"

Architecture

client/              Go HTTP client library (no Metal dependency)
inference/
  cmd/vexel/         Unified CLI binary (serve, generate, chat, bench, tokenize)
  cmd/vexel/internal REPL and chat loop implementation
  backend/metal/     Metal GPU backend and kernel dispatch
  runtime/           Model loading, weight management, inference loop
  scheduler/         Continuous batching scheduler with metrics
  serve/             HTTP server (/generate, /stream endpoints)
  pkg/
    gguf/            GGUF model format parser
    tokenizer/       Tokenizer with chat template support
    sampler/         Temperature, top-k, top-p sampling
examples/            Usage examples (client, server, direct generation)
scripts/             Performance harness and benchmarking tools

Performance

Benchmarks on Apple M3 Max (128 GB) with LLaMA 2 7B:

Q4_0 (3.56 GB):

Metric	Vexel	llama.cpp	MLX	Gap (vs llama.cpp)
Decode throughput	64.8 tok/s	76.3 tok/s	83.5 tok/s	-15.1%
Decode ctx degradation (16→512)	~-10%	-2.7%	-2.5%	—
Prefill (128 tokens)	377 tok/s	803 tok/s	725 tok/s	-53%
Model load time	~885 ms	~1100 ms	—	+20% faster

Q4_K_M (4.08 GB) — optimized kernels:

Metric	Vexel	vs Q4_0	Before Optimization
Decode throughput	53.5 tok/s	0.84x	14.2 tok/s (3.8x faster)
Prefill (128 tokens)	220.4 tok/s	0.58x	16.6 tok/s (13.3x faster)
Context degradation (16→512)	-10.8%	—	—

Vexel achieves 85% of llama.cpp's decode throughput on Q4_0 and supports Q4_K_M with custom sub-block strided decode and simdgroup tiled prefill kernels.

Flash Attention: Tiled split-KV decode with online softmax reduced context scaling degradation from -24.5% to ~-10% (ctx=16 → ctx=512).

See benchmarks/RESULTS.md for detailed analysis, context-length scaling data, and optimization roadmap.

Where Vexel differentiates:

Go scheduler with continuous batching (multi-client throughput)
gRPC + HTTP dual protocol for production serving
Single binary deployment — no Python dependency chain
Pure Go codebase — easy to embed, extend, and deploy
Faster cold start than llama.cpp (~885ms vs ~1100ms)

Run the benchmark suite:

# Unit-level benchmarks
go test -tags metal -run TestThroughput -v ./inference/runtime/
go test -tags metal -run TestLatency -v ./inference/runtime/

# Competitive benchmarks (vs llama.cpp, MLX, Ollama)
./benchmarks/run_benchmark.sh --model path/to/model.gguf --engines vexel,llama

# Scheduling benchmarks via CLI
./vexel bench --batch 128 --seq-len 128 --num-seqs 256

Performance & Correctness Harness

Use the provided harness to compare Vexel (Metal) against llama.cpp on a fixed prompt set.

Requirements

Model: default models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf (override with MODEL_PATH).
Vexel binary: default ./vexel (override with VEXEL_BIN).
llama.cpp binary: default llama-cli (override with LLAMA_BIN).

Run

bash scripts/perf_harness.sh

The harness uses deterministic sampling (temp=0) to make correctness comparisons meaningful.

Environment overrides:

MODEL_PATH – GGUF model path
VEXEL_BIN – path to Vexel binary
LLAMA_BIN – path to llama.cpp binary
OUT_DIR – output directory (default perf_reports)

Name		Name	Last commit message	Last commit date
Latest commit History 395 Commits
.old-conductor		.old-conductor
benchmarks		benchmarks
client		client
conductor		conductor
data		data
docs		docs
examples		examples
inference		inference
perf_reports		perf_reports
scripts		scripts
starling_messages		starling_messages
test/golden		test/golden
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conductor.md		conductor.md
coverage.out		coverage.out
debug-inference.sh		debug-inference.sh
go.mod		go.mod
go.sum		go.sum
long_prompt.txt		long_prompt.txt
metal.test		metal.test
mod_test.go		mod_test.go
structure_test.go		structure_test.go
vexel-debug		vexel-debug
vexel-test		vexel-test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vexel

Features

Getting Started

Prerequisites

Building

CLI Subcommands

Running the Server

Text Generation

Interactive Chat

Tokenization

Benchmarking

Usage

Go Client

Direct Inference

Embedding the Server

HTTP API

Examples

Architecture

Performance

Performance & Correctness Harness

Requirements

Run

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Vexel

Features

Getting Started

Prerequisites

Building

CLI Subcommands

Running the Server

Text Generation

Interactive Chat

Tokenization

Benchmarking

Usage

Go Client

Direct Inference

Embedding the Server

HTTP API

Examples

Architecture

Performance

Performance & Correctness Harness

Requirements

Run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages