-
Notifications
You must be signed in to change notification settings - Fork 2.3k
sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benchmark result # test time interval, smaller is better
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 106.895998 110.079996 101.631999 82.208000
1 8.0 100.896001 104.255997 90.208001 78.432001
2 16.0 102.720000 105.984002 80.704004 75.999998
3 32.0 101.071998 104.223996 80.544002 75.392000
4 64.0 101.888001 105.184004 80.799997 75.616002
5 128.0 97.184002 101.631999 116.832003 78.847997
6 256.0 123.903997 127.168000 165.695995 94.655998
7 512.0 236.256003 229.855999 315.295994 162.368000
8 1024.0 463.551998 438.015997 617.231965 284.447998
9 2048.0 918.528020 860.607982 1222.831964 563.935995
10 4096.0 1787.647963 1570.639968 2254.159927 1182.047963
deepseek-ai/DeepSeek-V3 N=32768 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 19.168001 22.624001 22.720000 19.200001
1 8.0 19.040000 22.816001 26.240001 17.792000
2 16.0 19.168001 22.879999 25.248000 18.400000
3 32.0 19.328000 22.976000 25.280001 19.040000
4 64.0 19.455999 23.040000 24.208000 18.816000
5 128.0 19.648001 23.424000 26.176000 22.336001
6 256.0 25.664000 28.640000 33.183999 24.192000
7 512.0 41.343998 42.847998 57.376001 40.784001
8 1024.0 74.560001 71.776003 96.896000 62.495999
9 2048.0 143.808007 128.191993 181.375995 111.840002
10 4096.0 276.320010 236.831993 340.384007 196.047992
deepseek-ai/DeepSeek-V3 N=7168 K=16384:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 91.903999 93.759999 130.528003 63.712001
1 8.0 91.104001 93.216002 78.127995 61.055999
2 16.0 90.304002 93.024001 78.143999 59.487998
3 32.0 91.232002 93.152002 78.304000 56.864001
4 64.0 90.687998 93.344003 78.207999 56.543998
5 128.0 90.464003 93.120001 96.320003 58.079999
6 256.0 93.631998 97.360000 128.703997 69.440000
7 512.0 178.432003 180.447996 243.456006 134.143993
8 1024.0 347.808003 343.775988 559.552014 200.479999
9 2048.0 602.656007 587.199986 830.208004 379.247993
10 4096.0 1197.296023 1163.712025 1650.943995 767.567992
deepseek-ai/DeepSeek-V3 N=7168 K=18432:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 103.615999 103.583999 135.199994 68.928003
1 8.0 101.088002 102.399997 85.823998 67.776002
2 16.0 101.503998 102.431998 90.976000 66.496000
3 32.0 100.607999 103.008002 89.823999 62.688001
4 64.0 101.503998 103.136003 87.392002 60.543999
5 128.0 100.768000 102.944002 108.800001 62.112000
6 256.0 104.192004 107.744001 138.687998 77.760004
7 512.0 199.744001 201.376006 262.912005 147.136003
8 1024.0 389.759988 383.087993 615.743995 211.935997
9 2048.0 675.935984 660.303950 898.368001 449.136019
10 4096.0 1343.136072 1304.656029 1790.912032 840.416014
deepseek-ai/DeepSeek-V3 N=36864 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 153.824002 155.072004 167.728007 102.144003
1 8.0 145.855993 149.887994 144.319996 102.623999
2 16.0 150.399998 153.760001 133.215994 103.295997
3 32.0 148.448005 151.488006 132.640004 103.615999
4 64.0 150.879994 153.983995 134.207994 104.928002
5 128.0 140.928000 143.792003 171.664000 109.407999
6 256.0 199.423999 198.208004 266.335994 154.367998
7 512.0 350.816011 334.223986 467.727989 224.544004
8 1024.0 690.559983 648.064017 868.592024 435.663998
9 2048.0 1335.999966 1173.792005 1687.423944 876.703978
10 4096.0 2653.664112 2329.312086 3347.504139 1774.703979
deepseek-ai/DeepSeek-V3 N=24576 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 106.335998 108.447999 99.519998 82.144000
1 8.0 100.832000 104.000002 90.176001 78.464001
2 16.0 103.040002 106.495999 80.640003 75.808004
3 32.0 101.439998 104.543999 79.967998 75.199999
4 64.0 102.048002 105.120003 81.568003 76.095998
5 128.0 96.639998 99.951997 117.536001 79.328001
6 256.0 124.736004 126.208007 165.631995 95.104001
7 512.0 236.448005 231.296003 315.279990 163.455993
8 1024.0 463.936001 440.607995 617.504001 299.488008
9 2048.0 918.175995 860.127985 1229.552031 592.320025
10 4096.0 1787.296057 1566.239953 2395.391941 1190.368056
deepseek-ai/DeepSeek-V3 N=32768 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 19.328000 22.752000 22.911999 18.528000
1 8.0 19.072000 22.720000 22.464000 17.247999
2 16.0 19.231999 22.911999 22.480000 17.568000
3 32.0 19.360000 23.008000 23.232000 21.536000
4 64.0 19.552000 23.135999 22.624001 18.816000
5 128.0 19.648001 23.264000 23.424000 21.856001
6 256.0 25.696000 28.832000 32.928001 23.871999
7 512.0 41.503999 42.911999 56.095999 40.064000
8 1024.0 74.592002 72.927997 97.759999 60.608000
9 2048.0 136.639997 136.255994 189.408004 105.375998
10 4096.0 257.728010 254.976004 346.303999 196.224004
deepseek-ai/DeepSeek-V3 N=24576 K=1536:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 34.304000 37.023999 38.816001 25.760001
1 8.0 32.639999 35.936002 27.519999 25.312001
2 16.0 32.960001 36.063999 27.360000 26.016001
3 32.0 33.504002 36.736000 27.680000 25.536001
4 64.0 33.376001 36.672000 27.904000 26.944000
5 128.0 31.904001 35.071999 37.535999 30.592000
6 256.0 37.439998 40.128000 48.640002 34.400001
7 512.0 63.840002 65.343998 86.144000 51.263999
8 1024.0 118.336000 116.319999 161.791995 91.104001
9 2048.0 232.271999 207.231998 309.184015 155.328006
10 4096.0 442.912012 386.224002 562.687993 293.119997
deepseek-ai/DeepSeek-V3 N=4096 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 46.271998 48.767999 53.536002 28.960001
1 8.0 45.600001 48.735999 48.928000 28.255999
2 16.0 45.407999 48.735999 48.864000 29.184001
3 32.0 45.632001 48.864000 48.831999 24.800001
4 64.0 45.472000 48.896000 49.215999 25.504000
5 128.0 45.536000 49.024001 49.376000 26.192000
6 256.0 45.823999 49.279999 49.952000 29.088000
7 512.0 45.759998 50.175998 59.680000 36.543999
8 1024.0 84.767997 87.679997 110.335998 59.136000
9 2048.0 160.383999 161.040008 212.255999 109.568000
10 4096.0 312.000006 313.199997 414.687991 208.767995
deepseek-ai/DeepSeek-V3 N=7168 K=18432:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 102.240004 103.632003 135.263994 68.832003
1 8.0 100.575998 103.200004 87.488003 67.968003
2 16.0 100.736000 103.519998 89.599997 66.271998
3 32.0 100.992002 102.976002 89.535996 62.624000
4 64.0 100.864001 103.359997 87.807998 60.543999
5 128.0 100.096002 103.072003 109.375998 62.112000
6 256.0 104.415998 108.704001 138.912007 77.855997
7 512.0 199.328005 201.184005 262.751997 146.783993
8 1024.0 389.759988 382.912010 615.631998 211.968005
9 2048.0 676.352024 658.976018 902.112007 434.768021
10 4096.0 1342.720032 1306.879997 1687.744021 854.272008
deepseek-ai/DeepSeek-V3 N=7168 K=16384:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 93.248002 93.567997 130.080000 63.712001
1 8.0 89.919999 92.639998 78.175999 61.055999
2 16.0 90.304002 92.960000 78.015998 59.519999
3 32.0 90.559997 93.312003 78.383997 56.639999
4 64.0 90.847999 93.695998 78.400001 56.543998
5 128.0 90.655997 93.184002 96.543998 57.856001
6 256.0 90.655997 93.952000 129.023999 69.728002
7 512.0 178.207994 181.439996 243.456006 135.135993
8 1024.0 349.151999 342.335999 561.183989 193.151996
9 2048.0 602.912009 587.360024 834.111989 392.832011
10 4096.0 1197.504044 1160.464048 1651.551962 764.720023
deepseek-ai/DeepSeek-V3 N=7168 K=2048:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 21.312000 24.896000 25.087999 26.303999
1 8.0 20.832000 24.383999 23.615999 22.112001
2 16.0 20.959999 24.544001 23.615999 21.504000
3 32.0 20.927999 24.480000 23.088001 20.976000
4 64.0 20.992000 24.416000 23.232000 21.280000
5 128.0 20.959999 24.288001 24.335999 26.128000
6 256.0 20.864001 24.416000 24.639999 27.807999
7 512.0 32.224000 35.392001 39.136000 28.255999
8 1024.0 55.392001 56.063998 83.456002 38.160000
9 2048.0 92.207998 90.240002 123.103999 66.239998
10 4096.0 167.935997 167.120010 233.567998 114.720002
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 106.416002 108.191997 99.327996 82.176000
1 8.0 100.256003 103.712000 90.080000 78.464001
2 16.0 102.816001 106.239997 80.608003 75.839996
3 32.0 101.024002 104.223996 79.839997 75.199999
4 64.0 101.503998 104.832001 81.472002 75.903997
5 128.0 98.399997 101.792000 118.047997 79.264000
6 256.0 124.672003 126.047999 166.063994 95.168002
7 512.0 237.120003 231.391996 315.775990 164.223999
8 1024.0 463.551998 438.847989 618.848026 299.199998
9 2048.0 918.304026 861.760020 1224.671960 560.383976
10 4096.0 1782.431960 1668.192029 2259.360075 1151.328087
deepseek-ai/DeepSeek-V3 N=32768 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 19.168001 22.720000 22.208000 18.848000
1 8.0 19.104000 22.752000 22.560000 18.015999
2 16.0 19.392001 22.944000 22.112001 18.560000
3 32.0 19.360000 23.040000 22.080000 17.664000
4 64.0 19.520000 23.072001 22.240000 18.048000
5 128.0 19.711999 23.296000 23.488000 21.568000
6 256.0 25.920000 28.928000 32.992002 23.552001
7 512.0 41.407999 42.879999 56.127999 40.064000
8 1024.0 74.592002 72.864003 97.471997 60.447998
9 2048.0 137.024000 135.168001 181.247994 104.287997
10 4096.0 256.159991 254.592001 345.775992 196.255997
deepseek-ai/DeepSeek-V3 N=7168 K=16384:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 92.639998 94.240002 130.303994 63.616000
1 8.0 90.368003 93.056001 78.304000 61.087999
2 16.0 90.240002 93.120001 78.143999 59.487998
3 32.0 90.496004 93.184002 78.143999 56.639999
4 64.0 91.040000 93.440004 78.304000 56.543998
5 128.0 90.016000 93.216002 96.383996 57.856001
6 256.0 90.304002 97.439997 129.695997 69.760002
7 512.0 179.168001 180.255994 243.423998 135.232002
8 1024.0 347.743988 342.128009 560.320020 195.712000
9 2048.0 602.656007 587.360024 829.343975 398.784012
10 4096.0 1196.768045 1156.991959 1545.439959 765.743971
deepseek-ai/DeepSeek-V3 N=7168 K=18432:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 104.800001 103.840001 135.008007 68.896003
1 8.0 100.383997 103.104003 86.144000 67.840002
2 16.0 100.447997 103.136003 89.328006 66.496000
3 32.0 100.511998 102.976002 89.535996 62.560000
4 64.0 100.607999 103.104003 87.839998 60.575999
5 128.0 100.128002 103.455998 109.088004 62.080000
6 256.0 100.064002 103.519998 133.471996 77.760004
7 512.0 199.967995 202.448010 262.560010 144.639999
8 1024.0 389.535993 383.616000 615.423977 221.760005
9 2048.0 676.064014 659.888029 898.624003 443.360001
10 4096.0 1342.911959 1306.976080 1787.968040 826.367974
deepseek-ai/DeepSeek-V3 N=18432 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 102.944002 101.856001 114.288002 60.368001
1 8.0 100.639999 103.391998 73.087998 60.288001
2 16.0 101.471998 104.560003 72.768003 60.479999
3 32.0 100.192003 99.519998 72.864003 60.463998
4 64.0 101.487994 104.399994 73.536001 60.896002
5 128.0 95.760003 98.976001 105.839998 63.167997
6 256.0 123.328000 124.672003 162.783995 80.959998
7 512.0 198.431998 194.656000 262.879997 152.224004
8 1024.0 349.631995 334.271997 466.432005 292.656004
9 2048.0 690.880001 650.815964 864.768028 426.560014
10 4096.0 1337.327957 1253.455997 1683.552027 882.016003
deepseek-ai/DeepSeek-V3 N=12288 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 56.256000 60.128000 91.328003 48.767999
1 8.0 54.976001 59.360001 61.280001 45.536000
2 16.0 60.511999 64.336002 61.055999 45.456000
3 32.0 50.624002 54.559998 61.120000 45.664001
4 64.0 57.472002 61.728001 61.439998 45.568001
5 128.0 50.175998 54.016002 65.920003 47.040001
6 256.0 84.256001 88.895999 112.255998 62.399998
7 512.0 122.879997 124.416001 162.655994 85.968003
8 1024.0 235.808000 229.471996 313.840002 152.544007
9 2048.0 463.584006 439.007998 618.319988 288.672000
10 4096.0 919.264019 862.335980 1226.896048 564.736009
deepseek-ai/DeepSeek-V3 N=16384 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 14.464000 17.728001 19.872000 21.472000
1 8.0 14.304000 17.888000 19.392001 21.695999
2 16.0 14.304000 17.920000 18.719999 21.888001
3 32.0 14.368000 17.728001 20.320000 17.983999
4 64.0 14.112000 17.664000 17.344000 22.112001
5 128.0 14.656000 17.888000 17.728001 23.167999
6 256.0 17.408000 20.288000 20.800000 23.776000
7 512.0 24.960000 27.424000 31.104000 26.720000
8 1024.0 41.216001 42.495999 54.079998 40.415999
9 2048.0 73.664002 71.872003 97.407997 62.816001
10 4096.0 136.608005 126.816005 190.528005 106.144004
deepseek-ai/DeepSeek-V3 N=12288 K=1536:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 21.600001 25.087999 28.128000 21.504000
1 8.0 20.032000 24.143999 20.671999 24.639999
2 16.0 20.736000 24.383999 21.280000 23.391999
3 32.0 19.520000 23.328001 20.864001 21.215999
4 64.0 19.487999 23.167999 21.280000 20.768000
5 128.0 19.487999 23.040000 22.208000 23.712000
6 256.0 27.968001 30.719999 32.960001 23.647999
7 512.0 36.768001 38.816001 46.016000 31.744000
8 1024.0 63.040003 62.912002 81.632003 49.568001
9 2048.0 122.288004 115.840003 159.904003 86.080000
10 4096.0 230.304003 205.504000 308.319986 156.448007
deepseek-ai/DeepSeek-V3 N=2048 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 27.904000 40.511999 47.839999 27.488001
1 8.0 27.519999 40.640000 45.568001 27.616000
2 16.0 27.648000 40.704001 45.664001 29.376000
3 32.0 27.712001 40.080000 45.504000 26.880000
4 64.0 27.775999 40.863998 45.791999 25.504000
5 128.0 27.584000 40.479999 46.144001 27.744001
6 256.0 31.776000 37.312001 46.592001 32.800000
7 512.0 39.840002 42.656001 46.879999 40.736001
8 1024.0 55.071998 52.416001 58.527999 34.944002
9 2048.0 96.992001 92.448004 110.816002 59.103999
10 4096.0 175.712004 170.432001 212.927997 110.048003
deepseek-ai/DeepSeek-V3 N=7168 K=9216:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 57.119999 60.479999 86.336002 39.360002
1 8.0 56.607999 59.360001 64.912006 39.296001
2 16.0 55.615999 59.103999 64.928003 38.272001
3 32.0 56.127999 59.424002 65.056004 37.567999
4 64.0 56.031998 59.328001 65.407999 37.376001
5 128.0 56.000002 59.551999 65.471999 38.911998
6 256.0 57.296000 61.376002 75.808004 45.120001
7 512.0 105.375998 108.672000 137.247995 77.983998
8 1024.0 201.407999 199.456006 265.311986 118.079998
9 2048.0 346.623987 337.632000 460.000008 229.951993
10 4096.0 686.671972 660.272002 911.311984 441.471994
deepseek-ai/DeepSeek-V3 N=7168 K=8192:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 52.800000 54.719999 77.791996 36.832001
1 8.0 50.880000 54.079998 59.296001 36.031999
2 16.0 51.456001 54.143999 58.720000 35.360001
3 32.0 50.655998 54.079998 58.880001 34.432001
4 64.0 51.104002 54.528002 59.487998 34.688000
5 128.0 51.008001 54.591998 59.904002 35.680000
6 256.0 51.167998 54.848000 67.872003 41.375998
7 512.0 94.783999 98.687999 128.160000 71.488000
8 1024.0 180.831999 179.183990 246.879995 109.728001
9 2048.0 310.112000 300.464004 427.807987 207.359999
10 4096.0 613.344014 584.223986 846.463978 398.048013
deepseek-ai/DeepSeek-V3 N=7168 K=1024:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 16.384000 19.520000 24.672000 23.135999
1 8.0 16.480001 19.743999 24.512000 25.936000
2 16.0 15.776001 19.007999 25.056001 23.040000
3 32.0 15.872000 19.104000 24.192000 23.584001
4 64.0 15.872000 19.104000 25.664000 24.000000
5 128.0 15.776001 19.007999 24.496000 25.024001
6 256.0 15.584000 18.880000 25.280001 26.176000
7 512.0 21.824000 24.607999 28.352000 31.008000
8 1024.0 34.655999 36.575999 51.711999 27.712001
9 2048.0 55.167999 54.143999 70.175998 44.799998
10 4096.0 100.319996 95.264003 135.775998 74.207999
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 106.207997 108.351998 99.391997 82.176000
1 8.0 100.383997 103.583999 90.080000 78.272000
2 16.0 102.752000 106.271997 80.831997 75.967997
3 32.0 101.135999 104.192004 79.791993 75.167999
4 64.0 101.439998 104.672000 81.248000 75.839996
5 128.0 96.192002 99.647999 114.416003 78.879997
6 256.0 120.512001 126.496002 165.536001 95.040001
7 512.0 237.056002 230.943993 315.423995 163.167998
8 1024.0 463.519990 439.999998 618.463993 300.240010
9 2048.0 918.335974 860.480011 1229.840040 592.480004
10 4096.0 1787.392020 1565.888047 2399.712086 1168.352008
deepseek-ai/DeepSeek-V3 N=32768 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 19.360000 22.816001 24.256000 19.040000
1 8.0 19.040000 22.720000 24.224000 19.392001
2 16.0 19.200001 22.911999 24.095999 20.512000
3 32.0 19.360000 23.040000 24.351999 19.680001
4 64.0 19.552000 23.040000 24.831999 20.671999
5 128.0 19.680001 23.264000 26.432000 23.776000
6 256.0 25.664000 28.736001 33.087999 24.383999
7 512.0 41.503999 42.879999 56.127999 40.192001
8 1024.0 74.304000 72.480001 96.928000 60.288001
9 2048.0 137.344003 129.040003 181.088001 111.584000
10 4096.0 276.735991 241.183996 343.136013 198.175997
deepseek-ai/DeepSeek-V3 N=7168 K=16384:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 94.144002 93.759999 130.528003 63.648000
1 8.0 90.496004 93.312003 78.432001 61.152000
2 16.0 90.496004 93.280002 78.288004 59.712000
3 32.0 90.879999 93.408003 78.432001 56.607999
4 64.0 91.072001 93.503997 78.336000 56.511998
5 128.0 90.080000 93.440004 96.256003 57.760000
6 256.0 89.695998 93.567997 125.152007 69.920003
7 512.0 178.335994 181.823999 243.584007 134.368002
8 1024.0 347.647995 343.488008 560.383976 199.839994
9 2048.0 602.735996 587.487996 831.055999 391.647995
10 4096.0 1197.247982 1158.207893 1545.248032 716.624022
deepseek-ai/DeepSeek-V3 N=7168 K=18432:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 103.519998 103.423998 135.135993 68.928003
1 8.0 100.575998 103.391998 86.208001 67.808002
2 16.0 100.479998 103.391998 89.583993 66.271998
3 32.0 100.511998 103.136003 89.535996 62.624000
4 64.0 100.703999 103.359997 87.935999 60.736001
5 128.0 101.120003 102.848001 108.832002 61.856002
6 256.0 100.000001 103.712000 133.824006 77.791996
7 512.0 199.744001 200.864002 263.424009 144.768000
8 1024.0 389.887989 383.359998 614.447951 223.808005
9 2048.0 675.967991 658.976018 902.239978 438.688010
10 4096.0 1342.720032 1306.671977 1787.775993 856.288016
deepseek-ai/DeepSeek-V3 N=9216 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 47.488000 50.912000 80.959998 38.495999
1 8.0 47.136001 50.464001 56.992002 37.408002
2 16.0 47.456000 51.295999 56.896001 37.471998
3 32.0 46.464000 50.271999 56.928001 37.344001
4 64.0 46.879999 50.560001 57.503998 37.535999
5 128.0 46.367999 49.952000 62.752001 39.071999
6 256.0 83.424002 88.032000 99.104002 45.791999
7 512.0 122.175999 122.592002 160.512000 79.456002
8 1024.0 198.111996 193.087995 262.208015 150.976002
9 2048.0 349.808007 335.391998 464.895993 273.472011
10 4096.0 692.767978 654.415965 920.383990 428.591996
deepseek-ai/DeepSeek-V3 N=6144 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 47.008000 49.263999 65.536000 38.176000
1 8.0 45.952000 49.408000 52.544001 36.063999
2 16.0 46.080001 49.568001 52.671999 36.607999
3 32.0 45.984000 49.568001 52.576002 35.583999
4 64.0 45.887999 49.408000 52.832000 33.408001
5 128.0 45.887999 49.600001 53.264000 38.240001
6 256.0 46.176001 49.759999 59.039999 38.575999
7 512.0 84.959999 88.128000 110.335998 56.320000
8 1024.0 123.103999 124.448001 161.216006 83.200000
9 2048.0 235.072002 229.568005 312.864006 158.848003
10 4096.0 464.464009 443.520010 617.151976 295.455992
deepseek-ai/DeepSeek-V3 N=8192 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 13.248000 16.640000 20.959999 26.208000
1 8.0 13.280000 16.704001 21.632001 26.032001
2 16.0 13.280000 16.640000 19.776000 25.024001
3 32.0 13.120000 16.608000 20.736000 23.840001
4 64.0 13.184000 16.608000 20.288000 26.176000
5 128.0 13.184000 16.704001 19.936001 28.608000
6 256.0 13.120000 16.480001 20.800000 28.960001
7 512.0 16.352000 19.200001 21.536000 27.008001
8 1024.0 24.672000 27.200000 30.848000 29.759999
9 2048.0 41.503999 42.560000 53.215999 40.832002
10 4096.0 74.816003 71.295999 96.783996 61.184000
deepseek-ai/DeepSeek-V3 N=6144 K=1536:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 18.495999 21.792000 27.071999 30.015999
1 8.0 18.144000 21.632001 23.903999 28.031999
2 16.0 18.080000 21.600001 23.264000 27.488001
3 32.0 18.176001 21.695999 24.016000 27.264001
4 64.0 18.112000 21.663999 23.776000 29.568000
5 128.0 18.304000 21.856001 23.744000 27.872000
6 256.0 18.208001 21.695999 23.104001 26.496001
7 512.0 27.264001 30.176001 32.416001 28.896000
8 1024.0 36.063999 38.431998 45.024000 33.199999
9 2048.0 63.744001 62.495999 81.504002 49.632002
10 4096.0 119.199999 111.071996 152.799994 90.623997
deepseek-ai/DeepSeek-V3 N=1024 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 25.567999 38.624000 45.919999 25.152000
1 8.0 25.312001 38.336001 43.871999 24.863999
2 16.0 25.504000 38.975999 43.839999 25.664000
3 32.0 25.248000 38.176000 43.903999 24.256000
4 64.0 25.184000 38.495999 44.160001 25.087999
5 128.0 25.216000 38.400002 44.447999 25.728000
6 256.0 25.952000 38.688000 44.319998 23.488000
7 512.0 31.328000 36.832001 45.056000 27.807999
8 1024.0 39.967999 42.784002 46.335999 33.504002
9 2048.0 54.752000 51.711999 57.376001 40.608000
10 4096.0 96.864000 92.607997 111.584000 59.200000
deepseek-ai/DeepSeek-V3 N=7168 K=4608:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 34.495998 37.376001 48.032001 30.239999
1 8.0 33.631999 37.312001 38.592000 28.192000
2 16.0 33.920001 37.503999 38.527999 28.640000
3 32.0 33.472002 37.216000 38.688000 28.543999
4 64.0 33.888001 37.471998 39.168000 32.224000
5 128.0 33.599999 37.439998 39.391998 32.063998
6 256.0 34.047998 37.792001 43.104000 33.183999
7 512.0 57.280000 60.031999 73.856004 44.992000
8 1024.0 107.712001 106.720001 140.543997 67.936003
9 2048.0 183.167994 174.592003 241.888002 121.728003
10 4096.0 356.736004 333.248019 472.575992 216.447994
deepseek-ai/DeepSeek-V3 N=7168 K=4096:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 32.000002 35.328001 43.712001 34.591999
1 8.0 31.136001 34.688000 34.559999 33.792000
2 16.0 30.784000 34.527998 34.591999 35.007998
3 32.0 31.104000 34.559999 34.784000 29.216001
4 64.0 30.912001 34.623999 34.784000 34.944002
5 128.0 30.848000 34.688000 35.103999 32.159999
6 256.0 31.168001 34.784000 39.328001 38.431998
7 512.0 52.928001 55.071998 69.311999 43.136001
8 1024.0 97.792000 97.152002 132.192001 62.176000
9 2048.0 164.928004 157.856002 226.879999 110.495999
10 4096.0 320.479989 300.080001 442.400008 196.160004
deepseek-ai/DeepSeek-V3 N=7168 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 13.216000 16.448000 24.512000 31.808000
1 8.0 13.120000 16.543999 25.664000 30.112000
2 16.0 13.248000 16.576000 24.544001 31.711999
3 32.0 13.024000 16.416000 24.863999 31.744000
4 64.0 13.024000 16.448000 24.480000 32.800000
5 128.0 13.120000 16.511999 25.152000 32.352000
6 256.0 13.024000 16.352000 25.087999 33.696000
7 512.0 16.352000 19.136000 24.896000 31.199999
8 1024.0 24.672000 27.136000 30.272000 29.088000
9 2048.0 37.376001 38.816001 47.359999 34.047998
10 4096.0 66.560000 63.904002 86.144000 52.960001
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 106.335998 108.736001 99.776000 82.368001
1 8.0 100.960001 104.111999 90.144001 78.272000
2 16.0 103.168003 106.527999 80.640003 75.839996
3 32.0 101.503998 104.479998 80.063999 75.199999
4 64.0 101.856001 106.463999 81.855997 76.063998
5 128.0 98.463997 101.728000 117.728002 79.424001
6 256.0 124.799997 126.432002 165.887997 95.296003
7 512.0 237.184003 231.135994 315.391988 163.167998
8 1024.0 463.440001 440.384001 618.560016 321.343988
9 2048.0 918.447971 859.824002 1226.112008 566.303968
10 4096.0 1787.375927 1570.816040 2396.192074 1192.224026
deepseek-ai/DeepSeek-V3 N=32768 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 19.328000 22.720000 29.056000 28.543999
1 8.0 19.040000 22.720000 28.960001 23.264000
2 16.0 19.200001 22.911999 29.440001 30.528000
3 32.0 19.328000 23.008000 30.944001 23.456000
4 64.0 19.487999 23.008000 29.216001 26.335999
5 128.0 19.648001 23.264000 29.552000 33.408001
6 256.0 25.664000 28.736001 35.615999 29.120000
7 512.0 41.423999 42.847998 55.840001 40.128000
8 1024.0 73.632002 71.456000 101.311997 60.864002
9 2048.0 138.671994 136.224002 181.439996 112.063996
10 4096.0 275.119990 235.936001 340.223998 196.480006
deepseek-ai/DeepSeek-V3 N=7168 K=16384:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 94.048001 93.984000 129.951999 63.648000
1 8.0 90.527996 93.216002 78.592002 61.152000
2 16.0 90.655997 93.024001 78.240000 59.519999
3 32.0 90.591997 93.248002 78.207999 56.671999
4 64.0 91.008000 93.503997 78.368001 56.511998
5 128.0 90.496004 93.344003 96.447997 57.792000
6 256.0 90.304002 93.791999 123.839997 70.096001
7 512.0 178.880006 181.375995 243.711993 132.640004
8 1024.0 348.383993 343.679994 561.007977 200.736001
9 2048.0 603.232026 588.544011 830.272019 378.080010
10 4096.0 1197.280049 1158.112049 1658.975959 759.679973
deepseek-ai/DeepSeek-V3 N=7168 K=18432:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 103.391998 103.136003 135.232002 69.055997
1 8.0 100.415997 103.040002 86.080000 67.840002
2 16.0 100.415997 103.104003 89.472003 66.303998
3 32.0 100.543998 102.144003 89.120001 62.592000
4 64.0 100.639999 103.072003 87.360002 60.608000
5 128.0 100.607999 103.391998 109.407999 62.144000
6 256.0 104.255997 108.415999 138.720006 77.776000
7 512.0 199.552000 202.304006 262.847990 146.847993
8 1024.0 389.759988 382.991999 615.856051 213.760003
9 2048.0 675.743997 660.511971 897.696018 425.024003
10 4096.0 1342.479944 1306.848049 1687.008023 856.800020
deepseek-ai/DeepSeek-V3 N=4608 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 46.592001 48.703998 58.063999 33.472002
1 8.0 45.632001 48.992001 38.943999 35.664000
2 16.0 45.696001 48.992001 38.832001 36.448002
3 32.0 45.919999 48.703998 38.768001 35.135999
4 64.0 45.696001 49.343999 39.423998 31.872001
5 128.0 46.208002 49.279999 46.847999 36.448002
6 256.0 46.080001 49.536001 57.952002 38.896002
7 512.0 84.192000 87.119997 98.352000 44.319998
8 1024.0 122.063994 122.272000 164.928004 79.264000
9 2048.0 198.287994 193.087995 266.400009 150.720000
10 4096.0 350.224018 337.568015 466.143996 270.687997
deepseek-ai/DeepSeek-V3 N=3072 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 45.600001 48.576001 48.432000 31.840000
1 8.0 45.600001 48.928000 39.391998 36.224000
2 16.0 45.536000 48.640002 37.344001 34.944002
3 32.0 45.488000 48.672002 37.824001 32.448001
4 64.0 45.632001 48.992001 38.800001 33.151999
5 128.0 45.696001 48.480000 41.616000 33.856001
6 256.0 45.823999 49.152002 45.823999 36.192000
7 512.0 46.112001 48.864000 57.696000 40.128000
8 1024.0 84.480003 87.647997 111.904003 53.695999
9 2048.0 122.911997 124.159999 161.216006 84.624000
10 4096.0 236.335993 239.695996 316.224009 159.263998
deepseek-ai/DeepSeek-V3 N=4096 K=512:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 12.352000 15.616000 30.656001 26.912000
1 8.0 12.320000 15.552000 30.080000 29.711999
2 16.0 12.352000 15.584000 29.856000 25.567999
3 32.0 12.352000 15.648000 30.528000 29.632000
4 64.0 12.352000 15.552000 29.376000 31.488001
5 128.0 12.448000 15.744001 29.408000 31.199999
6 256.0 12.480000 15.744001 33.280000 30.975999
7 512.0 12.256000 15.440000 30.592000 28.640000
8 1024.0 16.160000 19.040000 29.120000 31.808000
9 2048.0 24.704000 27.104000 35.039999 31.008000
10 4096.0 41.792002 43.264002 51.743999 37.248001
deepseek-ai/DeepSeek-V3 N=3072 K=1536:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 17.568000 20.832000 25.823999 30.688001
1 8.0 17.568000 20.896001 24.623999 29.136000
2 16.0 17.600000 20.992000 25.696000 27.327999
3 32.0 17.600000 20.992000 25.184000 30.048000
4 64.0 17.632000 21.024000 23.808001 32.256000
5 128.0 17.728001 21.120001 24.383999 32.671999
6 256.0 17.759999 21.024000 25.536001 30.975999
7 512.0 17.503999 20.864001 24.320001 31.936001
8 1024.0 26.880000 29.632000 32.384001 31.583998
9 2048.0 36.256000 38.368002 44.672001 34.623999
10 4096.0 63.904002 65.311998 80.640003 48.608001
deepseek-ai/DeepSeek-V3 N=512 K=7168:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 24.496000 37.344001 45.024000 31.072000
1 8.0 24.607999 37.248001 43.488000 30.495999
2 16.0 24.351999 37.280001 43.327998 30.784000
3 32.0 24.480000 37.087999 43.423999 29.472001
4 64.0 24.416000 37.439998 43.488000 26.240001
5 128.0 24.576001 37.248001 44.032000 27.200000
6 256.0 24.768000 37.696000 43.712001 26.720000
7 512.0 25.920000 38.368002 44.000000 29.552000
8 1024.0 31.328000 37.087999 45.248002 36.192000
9 2048.0 39.616000 42.592000 47.904000 35.776000
10 4096.0 56.000002 51.936001 59.744000 40.911999
deepseek-ai/DeepSeek-V3 N=7168 K=2304:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 22.752000 26.240001 33.344001 32.607999
1 8.0 22.048000 25.599999 32.543998 32.175999
2 16.0 22.175999 25.760001 32.575998 31.296000
3 32.0 22.431999 25.920000 31.136001 31.264000
4 64.0 22.112001 25.696000 30.816000 34.015998
5 128.0 22.208000 25.888000 31.552002 28.416000
6 256.0 22.336001 25.984000 37.440002 30.624000
7 512.0 34.880001 37.696000 43.391999 39.296001
8 1024.0 59.808001 61.071999 95.264003 42.879999
9 2048.0 97.888000 98.880000 133.487999 77.119999
10 4096.0 194.271997 173.984006 242.944002 140.064001
deepseek-ai/DeepSeek-V3 N=7168 K=2048:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 21.407999 24.896000 31.520002 34.047998
1 8.0 20.864001 24.512000 30.560000 33.599999
2 16.0 20.927999 24.512000 31.663999 34.272000
3 32.0 20.832000 24.480000 31.264000 33.760000
4 64.0 20.992000 24.416000 29.503999 32.896001
5 128.0 20.800000 24.320001 36.256000 34.272000
6 256.0 20.864001 24.480000 31.360000 35.808001
7 512.0 32.384001 35.424002 39.584000 39.551999
8 1024.0 55.583999 55.936001 83.871998 38.656000
9 2048.0 92.096001 86.879998 118.752003 69.343999
10 4096.0 177.312002 167.359993 233.664006 114.656001
deepseek-ai/DeepSeek-V3 N=7168 K=256:
fp8 blockwise scaled matmul:
batch_size vllm sgl-kernel sglang triton deepgemm
0 1.0 11.168000 14.496000 26.400000 31.168001
1 8.0 11.168000 14.496000 26.016001 29.968001
2 16.0 11.168000 14.496000 25.696000 29.600000
3 32.0 11.168000 14.464000 26.272001 30.944001
4 64.0 11.040000 14.368000 24.000000 26.912000
5 128.0 11.040000 14.368000 29.888000 33.535998
6 256.0 11.168000 14.464000 25.408000 29.408000
7 512.0 13.824000 16.704001 25.312001 32.768000
8 1024.0 19.264000 22.208000 24.192000 23.520000
9 2048.0 28.128000 30.304000 36.160000 29.088000
10 4096.0 48.640002 49.247999 64.000003 45.632001
Benchmark finished! |
follow ups
|
zhyncs
approved these changes
Apr 9, 2025
cnwenf
pushed a commit
to cnwenf/sglang
that referenced
this pull request
Apr 10, 2025
* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py
finger92
pushed a commit
to protagolabs/sglang
that referenced
this pull request
Apr 10, 2025
thyecust
pushed a commit
to thyecust/sglang
that referenced
this pull request
Apr 11, 2025
jimoosciuc
pushed a commit
to Furion-cn/sglang
that referenced
this pull request
Apr 17, 2025
pi314ever
pushed a commit
to pi314ever/sglang
that referenced
this pull request
Apr 23, 2025
* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <[email protected]> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <[email protected]> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <[email protected]> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <[email protected]> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <[email protected]> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: ch-wan <[email protected]> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: GeLee <[email protected]> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <[email protected]> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <[email protected]> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <[email protected]> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <[email protected]> Co-authored-by: laixinn <[email protected]> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Chunan Zeng <[email protected]> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <[email protected]> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ispobock <[email protected]> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <[email protected]> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <[email protected]> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <[email protected]> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <[email protected]> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <[email protected]> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <[email protected]> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <[email protected]> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <[email protected]> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <[email protected]> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <[email protected]> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <[email protected]> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <[email protected]> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <[email protected]> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <[email protected]> Co-authored-by: sleepcoo <[email protected]> Co-authored-by: zhyncs <[email protected]> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: shangmingc <[email protected]> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <[email protected]> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <[email protected]> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <[email protected]> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <[email protected]> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <[email protected]> Co-authored-by: hebiao064 <[email protected]> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: Jinn <[email protected]> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <[email protected]> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <[email protected]> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <[email protected]> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <[email protected]> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <[email protected]> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <[email protected]> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <[email protected]> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: ybyang <[email protected]> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <[email protected]> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <[email protected]> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <[email protected]> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: windsonsea <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Juwan Yoo <[email protected]> Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: chaobo jia <[email protected]> Co-authored-by: rudy152 <[email protected]> Co-authored-by: Fr4nk1in <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Sleepcoo <[email protected]> Co-authored-by: SEPLOS <[email protected]> Co-authored-by: ch-wan <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: GeLee <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: hebiao064 <[email protected]> Co-authored-by: zcnrex <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: renxin <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: yuethe <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Yifan Zhang <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: Tommy Yang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: Yun Dai <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Yubo Wang <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: DangKai <[email protected]> Co-authored-by: dangkai.dk <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Ma Mingfei <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: zhaochenyang20 <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: grimoire <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: Zhaoyang Hao <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yusong Gao <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: tianlian yi <[email protected]> Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Jinn <[email protected]> Co-authored-by: yulei <[email protected]> Co-authored-by: Yongtong Wu <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: mRSun15 <[email protected]> Co-authored-by: ryang <[email protected]> Co-authored-by: Yuhao Yang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
TODO
Test
Checklist