This is a fork of the bitnet.cpp repo; bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU.
This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| BitNet-b1.58-2B-4T | 2.4B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| bitnet_b1_58-large | 0.7B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| bitnet_b1_58-3B | 3.3B | x86 | ❌ | ❌ | ✅ |
| ARM | ❌ | ✅ | ❌ | ||
| Llama3-8B-1.58-100B-tokens | 8.0B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon3 Family | 1B-10B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon-E Family | 1B-3B | x86 | ��� | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
- python>=3.9,<=3.11
- cmake>=3.22
- llvm/clang>=18
- uv
- Clone the repo
git clone --recursive https://github.com/buixuanloc/BitNet.git
cd BitNet- Install the dependencies
uv python pin 3.11
uv init --bare
uv add -r requirements.txt
source .venv/bin/activate- Build the project
# Manually download the model and run with local path
hf download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s -uvusage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] [--use-pretuned] [--use-uv]
Setup the environment for running inference
optional arguments:
-h, --help show this help message and exit
--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
Model used for inference
--model-dir MODEL_DIR, -md MODEL_DIR
Directory to save/load the model
--log-dir LOG_DIR, -ld LOG_DIR
Directory to save the logging info
--quant-type {i2_s,tl1}, -q {i2_s,tl1}
Quantization type
--quant-embd Quantize the embeddings to f16
--use-pretuned, -p Use the pretuned kernel parameters
--use-uv, -uv Use uv to install python packages
# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnvusage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
Run inference
optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to model file
-n N_PREDICT, --n-predict N_PREDICT
Number of tokens to predict when generating text
-p PROMPT, --prompt PROMPT
Prompt to generate text from
-t THREADS, --threads THREADS
Number of threads to use
-c CTX_SIZE, --ctx-size CTX_SIZE
Size of the prompt context
-temp TEMPERATURE, --temperature TEMPERATURE
Temperature, a hyperparameter that controls the randomness of the generated text
-cnv, --conversation Whether to enable chat mode or not (for instruct models.)
(When this option is turned on, the prompt specified by -p will be used as the system prompt.)
We provide scripts to run the inference benchmark providing a model.
usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]
Setup the environment for running the inference
required arguments:
-m MODEL, --model MODEL
Path to the model file.
optional arguments:
-h, --help
Show this help message and exit.
-n N_TOKEN, --n-token N_TOKEN
Number of generated tokens.
-p N_PROMPT, --n-prompt N_PROMPT
Prompt to generate text from.
-t THREADS, --threads THREADS
Number of threads to use.
Here's a brief explanation of each argument:
-m,--model: The path to the model file. This is a required argument that must be provided when running the script.-n,--n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.-p,--n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.-t,--threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.-h,--help: Show the help message and exit. Use this argument to display usage information.
For example:
python utils/e2e_benchmark.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -n 200 -p 256 -t 4This command would run the inference benchmark using the model located at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf, generating 200 tokens from a 256 token prompt, utilizing 4 threads.
For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128# Prepare the .safetensors model file
hf download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16
# Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16See gpu/README.md