Skip to content

DD-DuDa/BitDecoding

Repository files navigation

BitDecoding

arXiv License

BitDecoding is a high-performance, GPU-optimized system designed to accelerate long-context LLMs decoding with a low-bit KV cache. Achieve 3-9x speedup than Flash Attention v2. overview scheme

Benchmark

  • Kernel Performance in RTX4090 overview
  • Kernel Performance in A100 overview

Installation

git clone --recursive https://github.com/DD-DuDa/BitDecoding.git
conda create -n bitdecode python=3.10
conda activate bitdecode
pip install -r requirements.txt
python setup.py install

Quick Start

  1. See benchmark/bench_single_decode.ipynb
  2. (Optional) Play with libtorch c++
    # download libtorch 
    
    cd BitDecoding/csrc/bit_decode
    mkdir build && cd build
    cmake -DCMAKE_PREFIX_PATH=<libtorch_path> ..
    make -j12
    
  3. End2end inference example, please see e2e

Citation

If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:

@misc{du2025bitdecodingunlockingtensorcores,
      title={BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache}, 
      author={Dayou Du and Shijie Cao and Jianyi Cheng and Ting Cao and Mao Yang},
      year={2025},
      eprint={2503.18773},
      archivePrefix={arXiv},
      primaryClass={cs.AR},
      url={https://arxiv.org/abs/2503.18773}, 
}

Acknowledgement

BitDecoding is inspired by many open-source libraries, including (but not limited to) flash-attention, flute, Atom, omniserve, KIVI.

About

A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published