BitDecoding is a high-performance, GPU-optimized system
designed to accelerate long-context LLMs decoding with a low-bit KV
cache. Achieve 3-9x speedup than Flash Attention v2.
git clone --recursive https://github.com/DD-DuDa/BitDecoding.git
conda create -n bitdecode python=3.10
conda activate bitdecode
pip install -r requirements.txt
python setup.py install
- See benchmark/bench_single_decode.ipynb
- (Optional) Play with libtorch c++
# download libtorch cd BitDecoding/csrc/bit_decode mkdir build && cd build cmake -DCMAKE_PREFIX_PATH=<libtorch_path> .. make -j12
- End2end inference example, please see e2e
If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:
@misc{du2025bitdecodingunlockingtensorcores,
title={BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache},
author={Dayou Du and Shijie Cao and Jianyi Cheng and Ting Cao and Mao Yang},
year={2025},
eprint={2503.18773},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2503.18773},
}
BitDecoding is inspired by many open-source libraries, including (but not limited to) flash-attention, flute, Atom, omniserve, KIVI.