tiny-duo-infer

Project site: https://jameswei.github.io/tiny-duo-infer/

A learning-first LLM inference engine built from scratch in pure Python on Apple Silicon (MLX). Every inference concept — prefill, decode, KV cache, GQA, RoPE, SwiGLU, weight-only quantization — is explicitly implemented in readable code rather than hidden behind transformers, mlx-lm, or vLLM.

Inspired by nano-vllm, MinivLLM, and tiny-llm.

What's Implemented

Prefill & decode loop — full generation lifecycle with EOS detection and stop-string support
Grouped-query attention (GQA) — explicit head expansion, causal masking, KV cache update/advance protocol
Rotary position embeddings (RoPE) — frequency precomputation and pair-wise rotation
SwiGLU feed-forward networks — gate/up/down projections with explicit SiLU activation
KV cache — pre-allocated per-layer buffers with position-consistent write/commit semantics
Sampling — greedy, temperature scaling, top-k, top-p nucleus
Weight-only quantization — INT4/INT8 via MLX-native quantized_matmul; per-run memory accounting
HTTP serving — single-request FastAPI server with JSON and NDJSON streaming endpoints
Observability — per-request TTFT, decode throughput, KV-cache memory, context-budget policy
Profiling — repeatable latency/throughput benchmarks across prompts and quantization modes
Multi-model support — Llama-3.2-1B and Qwen3-0.6B on the same engine

Models

Model	HuggingFace
Llama-3.2-1B (base)	meta-llama/Llama-3.2-1B
Qwen3-0.6B	Qwen/Qwen3-0.6B

Getting Started

Requirements: Python 3.12, uv, Apple Silicon Mac (MLX)

# Clone and install
git clone https://github.com/jameswei/tiny-duo-infer.git
cd tiny-duo-infer
uv sync

# Download a model (example: Qwen3-0.6B)
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./models/qwen3-0.6b

# Run your first generation
uv run python -m tiny_duo_infer.cli \
  --model-path ./models/qwen3-0.6b \
  --prompt "What is attention in transformers?" \
  --chat --max-new-tokens 64

CLI

Plain completion (Llama-3.2-1B):

uv run python -m tiny_duo_infer.cli \
  --model-path ./models/llama-3.2-1b \
  --prompt "The capital of France is" \
  --max-new-tokens 32 --temperature 0.0

Qwen3 chat with explicit messages:

uv run python -m tiny_duo_infer.cli \
  --model-path ./models/qwen3-0.6b \
  --message system:"You are a concise assistant." \
  --message user:"Explain KV cache in one paragraph." \
  --max-new-tokens 128 --temperature 0.7

INT4 weight-only quantization with stats:

uv run python -m tiny_duo_infer.cli \
  --model-path ./models/llama-3.2-1b \
  --prompt "The capital of France is" \
  --max-new-tokens 32 --temperature 0.0 \
  --quantization int4 --show-stats

Key flags:

Flag	Description
`--chat`	Wrap prompt as a user message and apply ChatML template (Qwen3 only).
`--message ROLE:TEXT`	Explicit system/user messages (repeatable).
`--quantization MODE`	`none` (default), `int4`, or `int8` weight-only quantization.
`--quant-group-size N`	Group size along the input dimension. Default `64`.
`--show-stats`	Print timing, KV-cache memory, and quantization stats to stderr.
`--context-policy POLICY`	`allow_context_stop` (default), `reject`, `truncate_left`, `truncate_right`, `reserve_generation`.
`--stop TEXT`	Stop when TEXT appears in output (repeatable).
`--seed N`	Seed for deterministic sampling.

HTTP Server

Start the server:

uv run python -m tiny_duo_infer.serving.api \
  --model-path ./models/qwen3-0.6b \
  --max-seq-len 2048

Also accepts --quantization {none,int4,int8} and --quant-group-size N.

Full-response generation:

curl -s http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_new_tokens": 16, "temperature": 0.0}'

Streaming (NDJSON, one object per line):

curl -s http://127.0.0.1:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "max_new_tokens": 32}'

Health check: curl http://127.0.0.1:8000/health

The server handles one request at a time; concurrent requests receive a 503 response.

Profiling

Measure latency, throughput, and KV-cache memory across prompt sets:

uv run python scripts/profile_generation.py \
  --model-path ./models/qwen3-0.6b \
  --max-seq-len 512 --max-new-tokens 64 \
  --runs 5 --warmup-runs 1

Add --quantization int8 to compare quantized vs full-precision runs side by side. Use --json for machine-readable output.

Development

# Install dev dependencies
uv sync --group dev

# Run tests
uv run pytest -q

# Check for whitespace issues
git diff --check

GitHub Actions runs the same test suite on every push and pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
docs		docs
learning_materials		learning_materials
scripts		scripts
tests		tests
tiny_duo_infer		tiny_duo_infer
website		website
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-duo-infer

What's Implemented

Models

Getting Started

CLI

HTTP Server

Profiling

Development

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-duo-infer

What's Implemented

Models

Getting Started

CLI

HTTP Server

Profiling

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages