bitnet_b1_58-3B-vlut-gguf

This repository contains state-of-the-art ternary-packed versions of bitnet_b1_58-3B in GGUF format, optimized for efficient on-device inference using the Vec-LUT method.

Key Features

  • 🎯 SOTA Compression: Achieves BPW (bits per weight) as low as 1.60 through lossless sub-2-bit ternary packing.
  • âš¡ SOTA Performance: Delivers superior throughput (4.2x speedup) in parallel inference scenarios via vector lookup table (LUT).
  • 🔌 Drop-in Ready: Seamless integration with vlut.cpp for immediate deployment on edge devices.

Available Model Variants

Models are named as ggml-model-{PACKING}_{TILE}.gguf:

File Name Packing (BPW) Tile Size Comment
ggml-model-I1_V.gguf I1_V (1.60) 1
ggml-model-I1_V_2.gguf I1_V (1.60) 2 Recommended
ggml-model-I2_V.gguf I2_V (2.00) 1
ggml-model-I2_V_4.gguf I2_V (2.00) 4 Recommended
ggml-model-I2_V_8.gguf I2_V (2.00) 8

Selection Guide

  • BPW vs. Speed: I1_V achieves lower memory usage but may not always outperform I2_V in speed.
  • Tiling Trade-off: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.
  • Starting Point: Use I1_V_2 or I2_V_4 as a starting point.

For detailed tiling parameter analysis, see Evaluation.md and the paper.

Usage

Prerequisites

Install vlut.cpp (these models require vlut.cpp, not vanilla llama.cpp):

git clone https://github.com/Cipherxzc/vlut.cpp.git
cd vlut.cpp
cmake -B build && cmake --build build --config Release -j4

Download & Run

# Download the recommended variant, e.g., I2_V_4
hf download <repo_id> \
  ggml-model-I2_V_4.gguf --local-dir ./models

# Run parallel inference
./build/bin/llama-batched \
  -m ./models/ggml-model-I2_V_4.gguf \
  -p "I believe the meaning of life is" \
  -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5

# Benchmark performance
./build/bin/llama-bench \
  -m ./models/ggml-model-I2_V_4.gguf \
  -t 1 -p 128 -n 0

For comprehensive usage instructions, refer to the vlut.cpp Quick Start Guide.

Citation

If you use these models, please cite our paper:

@article{li2025veclut,
  title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
  author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
  journal={arXiv preprint arXiv:2512.06443},
  year={2025},
  url={https://arxiv.org/abs/2512.06443}
}
Downloads last month
52
GGUF
Model size
3B params
Architecture
bitnet
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XXXXyu/bitnet_b1_58-3B-vlut-gguf

Quantized
(9)
this model

Collection including XXXXyu/bitnet_b1_58-3B-vlut-gguf

Paper for XXXXyu/bitnet_b1_58-3B-vlut-gguf