Closed
Description
Triton Backend
-
support next II (multi MTP heads) (WIP @pankajroark )
FlashInfer Backend
-
compatible with disable MLA
-
support FlashInfer nightly MLA ragged prefill and CUDA Core MLA decoding
-
support FlashInfer v0.2.0.post3 MLA ragged, paged prefill and decoding (@zhyncs @yzh119 )
-
nextn parts can be shared with Triton Backend
EAGLE 2
-
implement sampling kernel in sgl-kernel (drop cutex) kernel part, python part
-
bunch of fixes non greedy fix, disable cuda graph fix 1, fix 2, cleanup 1, cleanup 2, fix cuda graph capture failure, fix 2, reduce one draft forward
-
compatible with radix cache and chunked prefill (WIP @Ying1123 )