Releases: flashinfer-ai/flashinfer
Releases · flashinfer-ai/flashinfer
v0.2.7.post1
What's Changed
- [feat] optimize persistent batch attention perf. by @happierpig in #1200
- Feature/cudnn dynamic cubin by @Anerudhan in #1187
- Fix flashinfer.comm module missing by @BBuf in #1203
- chore: bump flashinfer v0.2.7.post1 by @zhyncs in #1205
New Contributors
- @Anerudhan made their first contribution in #1187
- @BBuf made their first contribution in #1203
Full Changelog: v0.2.7...v0.2.7.post1
v0.2.7
What's Changed
- ci: Update images for self-hosted ARM64 runner by @yongwww in #1128
- Fix pointer dtype bug in rope by @Edenzzzz in #1129
- feat: update and test create_ipc_buffer by @yyihuang in #1130
- misc: update runllm widget by @yzh119 in #1132
- misc: correct runllm widget (again) by @MasterJH5574 in #1133
- [Feature] Support PDL for batch Prefill and Decode by @Edenzzzz in #1117
- fix: negative zero by type trait --> binary value by @yyihuang in #1136
- fix: sync after create_workspace by @yyihuang in #1138
- refactor: use functools.cache instead of global dict for caching modules by @yzh119 in #1135
- [feat] add unified batch attention w/ correctness tests. by @happierpig in #1137
- Fix FA2 and FA3 multi-item scoring and cuda illegal memory access error by @arde171 in #1140
- feat: Add support for FLASHINFER_EXTRA_LDFLAGS environment variable by @jennifgcrl in #1144
- misc: remove sync between persistent runners and use packed_causal_kv_end for SM90Plan by @Edenzzzz in #1146
- [fix] fix precision errors when applying causal mask on Qwen-2.5 series models by @happierpig in #1148
- ci: Install mpi4py by @yongwww in #1149
- feat: add trtllm moe_allreduce_fusion by @yyihuang in #1108
- feat: add trtllm all-reduce fusion by @yyihuang in #1131
- Add more logging to TRTLLM-GEN debug trace (NFC) by @joker-eph in #1158
- feat: update non-fused moe by @yyihuang in #1161
- Add fp4 quantization swizzling tests by @wenscarl in #1157
- refactor: communication module by @yyihuang in #1162
- feat: add finalize_moe_allreduce from trtllm by @yyihuang in #1159
- feat: experimental support of green ctx by @yzh119 in #1163
- feat: Fused temperature online softmax kernel by @xslingcn in #1153
- MNNVL MoE All-to-All Support by @cyx-6 in #1134
- feat: nvshmem python bindings by @yzh119 in #1160
- Fix missing symbols in trtllm_utils.so by @tiran in #1168
- feat: logits processor fustion rule for temperature softmax by @xslingcn in #1170
- Expose fp4 blockscale swizzling kernel by @wenscarl in #1176
- add nvshmem sum_reduce for mnnvl allreduce by @Amir-19 in #1152
- bugfix: softmax NaN results caused by large -inf masks by @xslingcn in #1178
- [CI] Update is_last_build by @yongwww in #1183
- [feat] support block sparse attention w/ variable block sizes and head-wise sparse patterns by @happierpig in #1177
- bugfix: fix invalid blackwell fmha unittests by @yzh119 in #1181
- feat: support green ctx creation by a list of SM counts by @Conless in #1190
- fix: trtllm_comm module aot arch issues by @yyihuang in #1196
- bugfix: fix broken docs build by adding missing dependencies by @Conless in #1197
- chore: bump v0.2.7 by @zhyncs in #1199
New Contributors
- @jennifgcrl made their first contribution in #1144
- @tiran made their first contribution in #1168
- @Amir-19 made their first contribution in #1152
- @Conless made their first contribution in #1190
Full Changelog: v0.2.6.post1...v0.2.7
v0.2.6.post1
What's Changed
- [CI] Add x86_64 tag for x86 self-hosted runner by @yongwww in #1126
- hotfix: fix installation script behavior by @yzh119 in #1125
Full Changelog: v0.2.6...v0.2.6.post1
v0.2.6
What's Changed
- ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
- misc: update REAMDME.md by @yzh119 in #1003
- bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
- misc: fix kv-layout doc references by @Edenzzzz in #1009
- misc: more benchmark scripts in Python by @yzh119 in #1010
- misc: fix instrument code for mla profiler by @yzh119 in #1014
- bugfix: import wrapper of mla decode by @dhy2000 in #1013
- feat: update decode attention APIs by @yzh119 in #1007
- doc: use latest protobuf for profiler by @xslingcn in #1021
- feat: SM-constraint Communication Kernels by @yyihuang in #994
- feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
- bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
- fix: add zero init for KV tiled copy by @happierpig in #1029
- [NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
- Add workflow to build aarch64 wheel by @yongwww in #1036
- Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
- fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
- feat: Softmax free sampling by @kf-zhang in #1035
- feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
- add multi-item scoring by @arde171 in #1015
- [nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
- [nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
- fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
- Benchmark: POD vs batched prefill by @Edenzzzz in #1052
- [nvidia] initial support for blackwell kernels by @yzh119 in #1039
- Fix KV chunking for POD. by @AKKamath in #1054
- bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
- bugfix: remove device allocation by @yzh119 in #1056
- Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
- bugfix: move
cum_m
calculation inside kernels by @yzh119 in #1060 - misc: add pull request template by @yzh119 in #1062
- bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
- Add PyTorch 2.7.0 build by @huydhn in #1063
- bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
- bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
- misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
- fix: fix a typo in docs by @acelyc111 in #1077
- misc: jit: Deprecate
load_cuda_ops()
by @abcdabcd987 in #1066 - misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
- misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
- misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
- misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
- misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
- misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
- misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
- ci: upgrade docker ci image by @yzh119 in #1082
- bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
- perf: accelerate blackwell grouped gemm by @yzh119 in #1086
- misc: update pull request template by @yzh119 in #1088
- Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
- bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
- comm: refactor and initialize
flashinfer.comm
module by @yzh119 in #1089 - misc: cleanup by @b8zhong in #1092
- misc: followup by @b8zhong in #1093
- [nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
- bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
- jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
- SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
- misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
- feat: composable logits processor by @xslingcn in #1099
- feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
- bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
- doc: fix LogitsPipe example by @xslingcn in #1110
- bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
- Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
- fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
- hotfix: fix the blackwell fmha stream by @yzh119 in #1116
- fix head_dim not defined if sm_scale is not None by @majian4work in #1119
- doc: add Ask-AI widget by @xslingcn in #1121
- bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
- misc: update slack link by @yzh119 in #1120
- release: bump version to v0.2.6 by @yzh119 in #1122
New Contributors
- @yongchaoding made their first contribution in #1008
- @Edenzzzz made their first contribution in #1009
- @dhy2000 made their first contribution in #1013
- @kaixih made their first contribution in #1031
- @yongwww made their first contribution in #1036
- @rickyfeng0119 made their first contribution in #1042
- @kf-zhang made their first contribution in #1035
- @arde171 made their first contribution in #1015
- @farnasirim made their first contribution in #1058
- @huydhn made their first contribution in #1063
- @acelyc111 made their first contribution in #1077
- @b8zhong made their first contribution in #1092
- @joker-eph made their first contribution in #1051
- @wenscarl made their first contribution in #1113
- @majian4work made their first contribution in #1119
Full Changelog: v0.2.5...v0.2.6
v0.2.5
What's Changed
- Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
- misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
- Triton
rms_norm
kernels by @nandor in #983 - Allow passing workspace base directory via environment variable by @jsuchome in #973
- [CHORE] Rename
output_emitted_token_num
->output_emitted_draft_token_num
by @jon-chuang in #977 - ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
- misc: update devcontainer by @yzh119 in #986
- ci: add torch 2.6+cu126 wheel by @yzh119 in #985
- misc: fix devcontainer conda path by @yzh119 in #989
- perf: prefetch page indices for mla kernel by @yzh119 in #991
- SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
- 3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
- perf: add
-DNDEBUG
compilation flag by @yzh119 in #998 - release: bump version to v0.2.5 by @yzh119 in #999
New Contributors
- @jsuchome made their first contribution in #973
- @jon-chuang made their first contribution in #977
- @yyihuang made their first contribution in #982
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- typo: fix pdl terminology by @yzh119 in #933
- Fix "specutate" typo by @markmc in #934
- typo: fix target_probs docs after uniform_samples removal by @markmc in #935
- typo: remove another uniform samples leftover by @markmc in #937
- Fix/precommit issues by @diptorupd in #931
- ci: setup Jenkins by @yzh119 in #874
- bugfix: fix include header name conflict by @yzh119 in #939
- fix: Fix MLA TVM binding for the latest changes by @MasterJH5574 in #940
- feat - support mla kvcache store by @baowendin in #888
- Add POD-Attention to FlashInfer by @AKKamath in #858
- bugfix: fix potential issues of FA3 template loading nans for PageAttention by @yzh119 in #945
- fix - fix bug when not relevant seq has nan data by @baowendin in #942
- misc: add ci-badge, update blog list by @yzh119 in #948
- bugfix: Fix missing PyModuleDef field initializers by @sampan26 in #946
- fix: fix pod-attention compilation time by @yzh119 in #954
- bugfix: bugfix to #949 by @yzh119 in #951
- misc: Temporarily disable POD from AOT wheels by @abcdabcd987 in #956
- ci: improve jenkins by @yzh119 in #943
- Fix compilation on cuda 12.2 by @goliaro in #961
- doc: remove misleading docstring about
non_blocking
by @yzh119 in #966 - perf: reduce torch.library dispatch overhead by @yzh119 in #968
- [TVM] Added tvm binding for sampling kernel by @annanyapr in #958
- perf: Fix python API overhead when CUDAGraph is not enabled by @yzh119 in #969
- Fix POD JIT bugs by @AKKamath in #971
- benchmark: add sampling.renorm benchmarks by @xslingcn in #970
- perf: dual pivot top-p/top-k renorm by @xslingcn in #974
- perf: Use 2WG pipeline design for MLA implementation on Hopper by @yzh119 in #952
- release: bump version to v0.2.4 by @yzh119 in #980
New Contributors
- @markmc made their first contribution in #934
- @diptorupd made their first contribution in #931
- @AKKamath made their first contribution in #858
- @sampan26 made their first contribution in #946
- @goliaro made their first contribution in #961
- @annanyapr made their first contribution in #958
Full Changelog: v0.2.3...v0.2.4
v0.2.3
Breaking Changes
We changed the interface for sampling APIs, more specifically (see #912 ):
- The sampling API removes the
success
return value of all sampling API, which is not compatible with earlier design. - Instead of passing
uniform
tensor, we changed the sampling interface to accepttorch.Generator
(optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.
What's Changed
- release: bump version v0.2.2.post1 by @yzh119 in #902
- Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization by @happierpig in #869
- bugfix: Fix no return type error by @yzh119 in #904
- ci: add dockerfile for CI by @yzh119 in #909
- ci: bugfix on release-ci-docker github action by @yzh119 in #910
- feat: flashinfer intra-kernel profiler by @yzh119 in #913
- [Package] Add tvm binding to
flashinfer.data
when packaging by @MasterJH5574 in #917 - refactor: move triton dependency to flashinfer.triton by @yzh119 in #918
- sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency by @yzh119 in #912
- feat: support non-contiguous input/output in normalization functions by @yzh119 in #921
- feat: improve sampling algorithm robustness by @yzh119 in #923
- perf: use max probability instead of 1 as upper bound in top-p/k sampling by @yzh119 in #925
- fix: add install step of profiler's dependency by @zobinHuang in #929
- fix: undefined symbol cudaGetDriverEntryPointByVersion with CUDA >= 12.5 by @zobinHuang in #928
- feat: experimenta support of PDL by @yzh119 in #930
- release: bump version to v0.2.3 by @yzh119 in #932
New Contributors
- @happierpig made their first contribution in #869
- @zobinHuang made their first contribution in #929
Full Changelog: v0.2.2.post1...v0.2.3
v0.2.2.post1
What's Changed
- bump version to v0.2.2 by @yzh119 in #891
- perf: fix the performance of second stage of split-k by @yzh119 in #894
- fix: pin_memory use cpu as default device by @KnowingNothing in #895
- perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
- perf: fix MLA split-k performance bug by @yzh119 in #898
- perf: use f16 as split-k partial output data type by @yzh119 in #900
- perf: tweak the pipeline design of mla kernel by @yzh119 in #901
Full Changelog: v0.2.2...v0.2.2.post1
v0.2.2
What's Changed
- fix cu121 torch2.6 by @zhyncs in #867
- unittest: add MLA test cases where kv_len is evenly divided by page_size. by @foreverlms in #861
- bugfix: fix the behavior of MLA kernel when kv-length is 0 by @yzh119 in #868
- Merge of previous PRs for typos in a single one. As per your request. by @didier-durand in #862
- add lightllm adoption by @zhyncs in #871
- fix geneate_dispatch_inc args from parser by @baowendin in #870
- [API] Fix top_k_top_p_sampling_from_logits param typo by @kasohrab in #875
- misc:Remove unused k_smem_offset_w update in MLA kernel by @muoshuosha in #878
- JIT compilation support for TVM by @MasterJH5574 in #880
- [Hotfix] Add flashinfer.jit.attention into packages by @zhouye in #881
- perf: FlashAttention-3 style MLA PageAttention by @yzh119 in #887
- [JIT] Fix MLA header in TVM binding by @MasterJH5574 in #889
- Fixing several typos in doc file kv_layout.rst by @didier-durand in #884
- unittest: add unittests for MLA + cudagraph by @yzh119 in #890
New Contributors
- @baowendin made their first contribution in #870
- @kasohrab made their first contribution in #875
- @zhouye made their first contribution in #881
Full Changelog: v0.2.1.post2...v0.2.2
v0.2.1.post2
What's Changed
- use 3 latest pytorch version by @youkaichao in #835
- docs: update installation by @zhyncs in #839
- Update README.md: fixing a typo for "hierical" by @didier-durand in #836
- Update page.rst: fixing 1 typo by @didier-durand in #841
- Update README.md: fixing 1 typo by @didier-durand in #842
- adds
TensorRT-LLM
to the list of projects adopting FlashInfer by @yzh119 in #843 - perf: MLA decode kernel implemented by CuTe targeted to SM80 by @tsu-bin in #844
- Update installation.rst: fixing 2 typos by @didier-durand in #840
- fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() by @sfc-gh-yewang in #808
- bugfix: Fix inline RoPE in decode kernels by @MasterJH5574 in #847
- misc: Remove duplicate param set in MLA kernel by @MasterJH5574 in #850
- feat: adding
out
andlse
parameters torun
functions to allow user allocated output buffer by @yzh119 in #854 - Unique the symbol of maybe_q_rope_offset_v. by @foreverlms in #855
- typo: update
decode_maybe_q_rope_offset
by @MasterJH5574 in #856 - update ci by @zhyncs in #857
- fix some compiler pre-check. by @foreverlms in #859
- perf: dynamic split-k for MLA by @yzh119 in #863
- Revert "fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() (… by @zhyncs in #864
- chore: bump v0.2.1.post2 by @zhyncs in #865
- fix compile by @zhyncs in #866
New Contributors
- @didier-durand made their first contribution in #836
- @sfc-gh-yewang made their first contribution in #808
- @foreverlms made their first contribution in #855
Full Changelog: v0.2.1.post1...v0.2.1.post2