bitnet.cpp

This is a fork of the bitnet.cpp repo; bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU.

Acknowledgements

This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.

Official Models

Model	Parameters	CPU	Kernel
Model	Parameters	CPU	I2_S	TL1	TL2
BitNet-b1.58-2B-4T	2.4B	x86	✅	❌	✅
BitNet-b1.58-2B-4T	2.4B	ARM	✅	✅	❌

Supported Models

❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.

Model	Parameters	CPU	Kernel
Model	Parameters	CPU	I2_S	TL1	TL2
bitnet_b1_58-large	0.7B	x86	✅	❌	✅
bitnet_b1_58-large	0.7B	ARM	✅	✅	❌
bitnet_b1_58-3B	3.3B	x86	❌	❌	✅
bitnet_b1_58-3B	3.3B	ARM	❌	✅	❌
Llama3-8B-1.58-100B-tokens	8.0B	x86	✅	❌	✅
Llama3-8B-1.58-100B-tokens	8.0B	ARM	✅	✅	❌
Falcon3 Family	1B-10B	x86	✅	❌	✅
Falcon3 Family	1B-10B	ARM	✅	✅	❌
Falcon-E Family	1B-3B	x86	��	❌	✅
Falcon-E Family	1B-3B	ARM	✅	✅	❌

Installation

Requirements

python>=3.9,<=3.11
cmake>=3.22
llvm/clang>=18
uv

Build from source

Clone the repo

git clone --recursive https://github.com/buixuanloc/BitNet.git
cd BitNet

Install the dependencies

uv python pin 3.11
uv init --bare
uv add -r requirements.txt
source .venv/bin/activate

Build the project

# Manually download the model and run with local path
hf download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s -uv

usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] [--use-pretuned] [--use-uv]

Setup the environment for running inference

optional arguments:
  -h, --help            show this help message and exit
  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
                        Model used for inference
  --model-dir MODEL_DIR, -md MODEL_DIR
                        Directory to save/load the model
  --log-dir LOG_DIR, -ld LOG_DIR
                        Directory to save the logging info
  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
                        Quantization type
  --quant-embd          Quantize the embeddings to f16
  --use-pretuned, -p    Use the pretuned kernel parameters
  --use-uv, -uv         Use uv to install python packages

Usage

Basic usage

# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]

Run inference

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Path to model file
  -n N_PREDICT, --n-predict N_PREDICT
                        Number of tokens to predict when generating text
  -p PROMPT, --prompt PROMPT
                        Prompt to generate text from
  -t THREADS, --threads THREADS
                        Number of threads to use
  -c CTX_SIZE, --ctx-size CTX_SIZE
                        Size of the prompt context
  -temp TEMPERATURE, --temperature TEMPERATURE
                        Temperature, a hyperparameter that controls the randomness of the generated text
  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)

Benchmark

We provide scripts to run the inference benchmark providing a model.

usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]

Setup the environment for running the inference

required arguments:
  -m MODEL, --model MODEL
                        Path to the model file.

optional arguments:
  -h, --help
                        Show this help message and exit.
  -n N_TOKEN, --n-token N_TOKEN
                        Number of generated tokens.
  -p N_PROMPT, --n-prompt N_PROMPT
                        Prompt to generate text from.
  -t THREADS, --threads THREADS
                        Number of threads to use.

Here's a brief explanation of each argument:

-m, --model: The path to the model file. This is a required argument that must be provided when running the script.
-n, --n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
-p, --n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
-t, --threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
-h, --help: Show the help message and exit. Use this argument to display usage information.

For example:

python utils/e2e_benchmark.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -n 200 -p 256 -t 4

This command would run the inference benchmark using the model located at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:

python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M

# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128

Convert from `.safetensors` Checkpoints

# Prepare the .safetensors model file
hf download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16

# Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16

GPU support

See gpu/README.md

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
3rdparty		3rdparty
assets		assets
docs		docs
gpu		gpu
include		include
media		media
preset_kernels		preset_kernels
src		src
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
run_inference.py		run_inference.py
run_inference_server.py		run_inference_server.py
setup_env.py		setup_env.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bitnet.cpp

Acknowledgements

Official Models

Supported Models

Installation

Requirements

Build from source

Usage

Basic usage

Benchmark

Convert from `.safetensors` Checkpoints

GPU support

About

Uh oh!

Releases

Packages

Languages

License

buixuanloc/BitNet

Folders and files

Latest commit

History

Repository files navigation

bitnet.cpp

Acknowledgements

Official Models

Supported Models

Installation

Requirements

Build from source

Usage

Basic usage

Benchmark

Convert from .safetensors Checkpoints

GPU support

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Convert from `.safetensors` Checkpoints

Packages