Releases · flashinfer-ai/flashinfer

01 Jul 18:14

yzh119

v0.2.7.post1

3fb73b3

v0.2.7.post1 Latest

Latest

What's Changed

[feat] optimize persistent batch attention perf. by @happierpig in #1200
Feature/cudnn dynamic cubin by @Anerudhan in #1187
Fix flashinfer.comm module missing by @BBuf in #1203
chore: bump flashinfer v0.2.7.post1 by @zhyncs in #1205

New Contributors

@Anerudhan made their first contribution in #1187
@BBuf made their first contribution in #1203

Full Changelog: v0.2.7...v0.2.7.post1

Contributors

Anerudhan, BBuf, and 2 other contributors

Assets 2

30 Jun 19:39

yzh119

v0.2.7

4d3fb6d

v0.2.7

What's Changed

ci: Update images for self-hosted ARM64 runner by @yongwww in #1128
Fix pointer dtype bug in rope by @Edenzzzz in #1129
feat: update and test create_ipc_buffer by @yyihuang in #1130
misc: update runllm widget by @yzh119 in #1132
misc: correct runllm widget (again) by @MasterJH5574 in #1133
[Feature] Support PDL for batch Prefill and Decode by @Edenzzzz in #1117
fix: negative zero by type trait --> binary value by @yyihuang in #1136
fix: sync after create_workspace by @yyihuang in #1138
refactor: use functools.cache instead of global dict for caching modules by @yzh119 in #1135
[feat] add unified batch attention w/ correctness tests. by @happierpig in #1137
Fix FA2 and FA3 multi-item scoring and cuda illegal memory access error by @arde171 in #1140
feat: Add support for FLASHINFER_EXTRA_LDFLAGS environment variable by @jennifgcrl in #1144
misc: remove sync between persistent runners and use packed_causal_kv_end for SM90Plan by @Edenzzzz in #1146
[fix] fix precision errors when applying causal mask on Qwen-2.5 series models by @happierpig in #1148
ci: Install mpi4py by @yongwww in #1149
feat: add trtllm moe_allreduce_fusion by @yyihuang in #1108
feat: add trtllm all-reduce fusion by @yyihuang in #1131
Add more logging to TRTLLM-GEN debug trace (NFC) by @joker-eph in #1158
feat: update non-fused moe by @yyihuang in #1161
Add fp4 quantization swizzling tests by @wenscarl in #1157
refactor: communication module by @yyihuang in #1162
feat: add finalize_moe_allreduce from trtllm by @yyihuang in #1159
feat: experimental support of green ctx by @yzh119 in #1163
feat: Fused temperature online softmax kernel by @xslingcn in #1153
MNNVL MoE All-to-All Support by @cyx-6 in #1134
feat: nvshmem python bindings by @yzh119 in #1160
Fix missing symbols in trtllm_utils.so by @tiran in #1168
feat: logits processor fustion rule for temperature softmax by @xslingcn in #1170
Expose fp4 blockscale swizzling kernel by @wenscarl in #1176
add nvshmem sum_reduce for mnnvl allreduce by @Amir-19 in #1152
bugfix: softmax NaN results caused by large -inf masks by @xslingcn in #1178
[CI] Update is_last_build by @yongwww in #1183
[feat] support block sparse attention w/ variable block sizes and head-wise sparse patterns by @happierpig in #1177
bugfix: fix invalid blackwell fmha unittests by @yzh119 in #1181
feat: support green ctx creation by a list of SM counts by @Conless in #1190
fix: trtllm_comm module aot arch issues by @yyihuang in #1196
bugfix: fix broken docs build by adding missing dependencies by @Conless in #1197
chore: bump v0.2.7 by @zhyncs in #1199

New Contributors

@jennifgcrl made their first contribution in #1144
@tiran made their first contribution in #1168
@Amir-19 made their first contribution in #1152
@Conless made their first contribution in #1190

Full Changelog: v0.2.6.post1...v0.2.7

Contributors

tiran, joker-eph, and 14 other contributors

Assets 2

07 Jun 03:24

yzh119

v0.2.6.post1

bc50f1a

v0.2.6.post1

What's Changed

[CI] Add x86_64 tag for x86 self-hosted runner by @yongwww in #1126
hotfix: fix installation script behavior by @yzh119 in #1125

Full Changelog: v0.2.6...v0.2.6.post1

Contributors

yongwww and yzh119

Assets 2

06 Jun 19:13

yzh119

v0.2.6

608a343

v0.2.6

What's Changed

ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
misc: update REAMDME.md by @yzh119 in #1003
bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
misc: fix kv-layout doc references by @Edenzzzz in #1009
misc: more benchmark scripts in Python by @yzh119 in #1010
misc: fix instrument code for mla profiler by @yzh119 in #1014
bugfix: import wrapper of mla decode by @dhy2000 in #1013
feat: update decode attention APIs by @yzh119 in #1007
doc: use latest protobuf for profiler by @xslingcn in #1021
feat: SM-constraint Communication Kernels by @yyihuang in #994
feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
fix: add zero init for KV tiled copy by @happierpig in #1029
[NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
Add workflow to build aarch64 wheel by @yongwww in #1036
Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
feat: Softmax free sampling by @kf-zhang in #1035
feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
add multi-item scoring by @arde171 in #1015
[nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
[nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
Benchmark: POD vs batched prefill by @Edenzzzz in #1052
[nvidia] initial support for blackwell kernels by @yzh119 in #1039
Fix KV chunking for POD. by @AKKamath in #1054
bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
bugfix: remove device allocation by @yzh119 in #1056
Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
bugfix: move cum_m calculation inside kernels by @yzh119 in #1060
misc: add pull request template by @yzh119 in #1062
bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
Add PyTorch 2.7.0 build by @huydhn in #1063
bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
fix: fix a typo in docs by @acelyc111 in #1077
misc: jit: Deprecate load_cuda_ops() by @abcdabcd987 in #1066
misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
ci: upgrade docker ci image by @yzh119 in #1082
bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
perf: accelerate blackwell grouped gemm by @yzh119 in #1086
misc: update pull request template by @yzh119 in #1088
Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
comm: refactor and initialize flashinfer.comm module by @yzh119 in #1089
misc: cleanup by @b8zhong in #1092
misc: followup by @b8zhong in #1093
[nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
feat: composable logits processor by @xslingcn in #1099
feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
doc: fix LogitsPipe example by @xslingcn in #1110
bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
hotfix: fix the blackwell fmha stream by @yzh119 in #1116
fix head_dim not defined if sm_scale is not None by @majian4work in #1119
doc: add Ask-AI widget by @xslingcn in #1121
bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
misc: update slack link by @yzh119 in #1120
release: bump version to v0.2.6 by @yzh119 in #1122

New Contributors

@yongchaoding made their first contribution in #1008
@Edenzzzz made their first contribution in #1009
@dhy2000 made their first contribution in #1013
@kaixih made their first contribution in #1031
@yongwww made their first contribution in #1036
@rickyfeng0119 made their first contribution in #1042
@kf-zhang made their first contribution in #1035
@arde171 made their first contribution in #1015
@farnasirim made their first contribution in #1058
@huydhn made their first contribution in #1063
@acelyc111 made their first contribution in #1077
@b8zhong made their first contribution in #1092
@joker-eph made their first contribution in #1051
@wenscarl made their first contribution in #1113
@majian4work made their first contribution in #1119

Full Changelog: v0.2.5...v0.2.6

Contributors

huydhn, nandor, and 22 other contributors

Assets 2

04 Apr 00:41

yzh119

v0.2.5

592b110

v0.2.5

What's Changed

Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
Triton rms_norm kernels by @nandor in #983
Allow passing workspace base directory via environment variable by @jsuchome in #973
[CHORE] Rename output_emitted_token_num -> output_emitted_draft_token_num by @jon-chuang in #977
ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
misc: update devcontainer by @yzh119 in #986
ci: add torch 2.6+cu126 wheel by @yzh119 in #985
misc: fix devcontainer conda path by @yzh119 in #989
perf: prefetch page indices for mla kernel by @yzh119 in #991
SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
perf: add -DNDEBUG compilation flag by @yzh119 in #998
release: bump version to v0.2.5 by @yzh119 in #999

New Contributors

@jsuchome made their first contribution in #973
@jon-chuang made their first contribution in #977
@yyihuang made their first contribution in #982

Full Changelog: v0.2.4...v0.2.5

Contributors

jsuchome, nandor, and 4 other contributors

Assets 14

29 Mar 05:09

yzh119

v0.2.4

bc81a59

v0.2.4

What's Changed

typo: fix pdl terminology by @yzh119 in #933
Fix "specutate" typo by @markmc in #934
typo: fix target_probs docs after uniform_samples removal by @markmc in #935
typo: remove another uniform samples leftover by @markmc in #937
Fix/precommit issues by @diptorupd in #931
ci: setup Jenkins by @yzh119 in #874
bugfix: fix include header name conflict by @yzh119 in #939
fix: Fix MLA TVM binding for the latest changes by @MasterJH5574 in #940
feat - support mla kvcache store by @baowendin in #888
Add POD-Attention to FlashInfer by @AKKamath in #858
bugfix: fix potential issues of FA3 template loading nans for PageAttention by @yzh119 in #945
fix - fix bug when not relevant seq has nan data by @baowendin in #942
misc: add ci-badge, update blog list by @yzh119 in #948
bugfix: Fix missing PyModuleDef field initializers by @sampan26 in #946
fix: fix pod-attention compilation time by @yzh119 in #954
bugfix: bugfix to #949 by @yzh119 in #951
misc: Temporarily disable POD from AOT wheels by @abcdabcd987 in #956
ci: improve jenkins by @yzh119 in #943
Fix compilation on cuda 12.2 by @goliaro in #961
doc: remove misleading docstring about non_blocking by @yzh119 in #966
perf: reduce torch.library dispatch overhead by @yzh119 in #968
[TVM] Added tvm binding for sampling kernel by @annanyapr in #958
perf: Fix python API overhead when CUDAGraph is not enabled by @yzh119 in #969
Fix POD JIT bugs by @AKKamath in #971
benchmark: add sampling.renorm benchmarks by @xslingcn in #970
perf: dual pivot top-p/top-k renorm by @xslingcn in #974
perf: Use 2WG pipeline design for MLA implementation on Hopper by @yzh119 in #952
release: bump version to v0.2.4 by @yzh119 in #980

New Contributors

@markmc made their first contribution in #934
@diptorupd made their first contribution in #931
@AKKamath made their first contribution in #858
@sampan26 made their first contribution in #946
@goliaro made their first contribution in #961
@annanyapr made their first contribution in #958

Full Changelog: v0.2.3...v0.2.4

Contributors

markmc, abcdabcd987, and 9 other contributors

Assets 10

11 Mar 02:22

yzh119

v0.2.3

fdedc43

v0.2.3

Breaking Changes

We changed the interface for sampling APIs, more specifically (see #912 ):

The sampling API removes the success return value of all sampling API, which is not compatible with earlier design.
Instead of passing uniform tensor, we changed the sampling interface to accept torch.Generator (optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.

What's Changed

release: bump version v0.2.2.post1 by @yzh119 in #902
Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization by @happierpig in #869
bugfix: Fix no return type error by @yzh119 in #904
ci: add dockerfile for CI by @yzh119 in #909
ci: bugfix on release-ci-docker github action by @yzh119 in #910
feat: flashinfer intra-kernel profiler by @yzh119 in #913
[Package] Add tvm binding to flashinfer.data when packaging by @MasterJH5574 in #917
refactor: move triton dependency to flashinfer.triton by @yzh119 in #918
sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency by @yzh119 in #912
feat: support non-contiguous input/output in normalization functions by @yzh119 in #921
feat: improve sampling algorithm robustness by @yzh119 in #923
perf: use max probability instead of 1 as upper bound in top-p/k sampling by @yzh119 in #925
fix: add install step of profiler's dependency by @zobinHuang in #929
fix: undefined symbol cudaGetDriverEntryPointByVersion with CUDA >= 12.5 by @zobinHuang in #928
feat: experimenta support of PDL by @yzh119 in #930
release: bump version to v0.2.3 by @yzh119 in #932

New Contributors

@happierpig made their first contribution in #869
@zobinHuang made their first contribution in #929

Full Changelog: v0.2.2.post1...v0.2.3

Contributors

yzh119, MasterJH5574, and 2 other contributors

Assets 10

27 Feb 06:00

yzh119

v0.2.2.post1

1c88d65

v0.2.2.post1

What's Changed

bump version to v0.2.2 by @yzh119 in #891
perf: fix the performance of second stage of split-k by @yzh119 in #894
fix: pin_memory use cpu as default device by @KnowingNothing in #895
perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
perf: fix MLA split-k performance bug by @yzh119 in #898
perf: use f16 as split-k partial output data type by @yzh119 in #900
perf: tweak the pipeline design of mla kernel by @yzh119 in #901

Full Changelog: v0.2.2...v0.2.2.post1

Contributors

yzh119 and KnowingNothing

Assets 10

23 Feb 22:28

yzh119

v0.2.2

986e5b1

v0.2.2

What's Changed

fix cu121 torch2.6 by @zhyncs in #867
unittest: add MLA test cases where kv_len is evenly divided by page_size. by @foreverlms in #861
bugfix: fix the behavior of MLA kernel when kv-length is 0 by @yzh119 in #868
Merge of previous PRs for typos in a single one. As per your request. by @didier-durand in #862
add lightllm adoption by @zhyncs in #871
fix geneate_dispatch_inc args from parser by @baowendin in #870
[API] Fix top_k_top_p_sampling_from_logits param typo by @kasohrab in #875
misc:Remove unused k_smem_offset_w update in MLA kernel by @muoshuosha in #878
JIT compilation support for TVM by @MasterJH5574 in #880
[Hotfix] Add flashinfer.jit.attention into packages by @zhouye in #881
perf: FlashAttention-3 style MLA PageAttention by @yzh119 in #887
[JIT] Fix MLA header in TVM binding by @MasterJH5574 in #889
Fixing several typos in doc file kv_layout.rst by @didier-durand in #884
unittest: add unittests for MLA + cudagraph by @yzh119 in #890

New Contributors

@baowendin made their first contribution in #870
@kasohrab made their first contribution in #875
@zhouye made their first contribution in #881

Full Changelog: v0.2.1.post2...v0.2.2

Contributors

didier-durand, zhouye, and 7 other contributors

Assets 10

17 Feb 18:05

github-actions

v0.2.1.post2

8127793

v0.2.1.post2

What's Changed

use 3 latest pytorch version by @youkaichao in #835
docs: update installation by @zhyncs in #839
Update README.md: fixing a typo for "hierical" by @didier-durand in #836
Update page.rst: fixing 1 typo by @didier-durand in #841
Update README.md: fixing 1 typo by @didier-durand in #842
adds TensorRT-LLM to the list of projects adopting FlashInfer by @yzh119 in #843
perf: MLA decode kernel implemented by CuTe targeted to SM80 by @tsu-bin in #844
Update installation.rst: fixing 2 typos by @didier-durand in #840
fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() by @sfc-gh-yewang in #808
bugfix: Fix inline RoPE in decode kernels by @MasterJH5574 in #847
misc: Remove duplicate param set in MLA kernel by @MasterJH5574 in #850
feat: adding out and lse parameters to run functions to allow user allocated output buffer by @yzh119 in #854
Unique the symbol of maybe_q_rope_offset_v. by @foreverlms in #855
typo: update decode_maybe_q_rope_offset by @MasterJH5574 in #856
update ci by @zhyncs in #857
fix some compiler pre-check. by @foreverlms in #859
perf: dynamic split-k for MLA by @yzh119 in #863
Revert "fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() (… by @zhyncs in #864
chore: bump v0.2.1.post2 by @zhyncs in #865
fix compile by @zhyncs in #866

New Contributors

@didier-durand made their first contribution in #836
@sfc-gh-yewang made their first contribution in #808
@foreverlms made their first contribution in #855

Full Changelog: v0.2.1.post1...v0.2.1.post2

Contributors

didier-durand, yzh119, and 6 other contributors

Assets 10

Releases: flashinfer-ai/flashinfer

v0.2.7.post1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.7

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.6.post1

What's Changed

Contributors

Uh oh!

v0.2.6

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.5

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.4

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.3

Breaking Changes

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.2.post1

What's Changed

Contributors

Uh oh!

v0.2.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.1.post2

What's Changed

New Contributors

Contributors

Uh oh!