sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207

yizhang2077 · 2025-04-09T17:34:56Z

Motivation

use latest cutlass to implement fp8 blockwise for sgl kernel
remove related vllm files in sgl-kernel
update benchmark for triton and deepgemm

TODO

use padding to support N/K which can not be divided by 128
finetune for more shape (only TileShape M and cluster shape can be tuned)

Test

# correctness
python3 tests/test_fp8_blockwise_gemm.py

# benchmark
python3 benchmark/bench_fp8_blockwise_gemm.py --tp-sizes 1 2 4 8

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yizhang2077 · 2025-04-09T17:39:37Z

benchmark result

# test time interval, smaller is better
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   106.895998   110.079996     101.631999    82.208000
1          8.0   100.896001   104.255997      90.208001    78.432001
2         16.0   102.720000   105.984002      80.704004    75.999998
3         32.0   101.071998   104.223996      80.544002    75.392000
4         64.0   101.888001   105.184004      80.799997    75.616002
5        128.0    97.184002   101.631999     116.832003    78.847997
6        256.0   123.903997   127.168000     165.695995    94.655998
7        512.0   236.256003   229.855999     315.295994   162.368000
8       1024.0   463.551998   438.015997     617.231965   284.447998
9       2048.0   918.528020   860.607982    1222.831964   563.935995
10      4096.0  1787.647963  1570.639968    2254.159927  1182.047963
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.168001   22.624001      22.720000   19.200001
1          8.0   19.040000   22.816001      26.240001   17.792000
2         16.0   19.168001   22.879999      25.248000   18.400000
3         32.0   19.328000   22.976000      25.280001   19.040000
4         64.0   19.455999   23.040000      24.208000   18.816000
5        128.0   19.648001   23.424000      26.176000   22.336001
6        256.0   25.664000   28.640000      33.183999   24.192000
7        512.0   41.343998   42.847998      57.376001   40.784001
8       1024.0   74.560001   71.776003      96.896000   62.495999
9       2048.0  143.808007  128.191993     181.375995  111.840002
10      4096.0  276.320010  236.831993     340.384007  196.047992
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    91.903999    93.759999     130.528003   63.712001
1          8.0    91.104001    93.216002      78.127995   61.055999
2         16.0    90.304002    93.024001      78.143999   59.487998
3         32.0    91.232002    93.152002      78.304000   56.864001
4         64.0    90.687998    93.344003      78.207999   56.543998
5        128.0    90.464003    93.120001      96.320003   58.079999
6        256.0    93.631998    97.360000     128.703997   69.440000
7        512.0   178.432003   180.447996     243.456006  134.143993
8       1024.0   347.808003   343.775988     559.552014  200.479999
9       2048.0   602.656007   587.199986     830.208004  379.247993
10      4096.0  1197.296023  1163.712025    1650.943995  767.567992
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   103.615999   103.583999     135.199994   68.928003
1          8.0   101.088002   102.399997      85.823998   67.776002
2         16.0   101.503998   102.431998      90.976000   66.496000
3         32.0   100.607999   103.008002      89.823999   62.688001
4         64.0   101.503998   103.136003      87.392002   60.543999
5        128.0   100.768000   102.944002     108.800001   62.112000
6        256.0   104.192004   107.744001     138.687998   77.760004
7        512.0   199.744001   201.376006     262.912005  147.136003
8       1024.0   389.759988   383.087993     615.743995  211.935997
9       2048.0   675.935984   660.303950     898.368001  449.136019
10      4096.0  1343.136072  1304.656029    1790.912032  840.416014
deepseek-ai/DeepSeek-V3 N=36864 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   153.824002   155.072004     167.728007   102.144003
1          8.0   145.855993   149.887994     144.319996   102.623999
2         16.0   150.399998   153.760001     133.215994   103.295997
3         32.0   148.448005   151.488006     132.640004   103.615999
4         64.0   150.879994   153.983995     134.207994   104.928002
5        128.0   140.928000   143.792003     171.664000   109.407999
6        256.0   199.423999   198.208004     266.335994   154.367998
7        512.0   350.816011   334.223986     467.727989   224.544004
8       1024.0   690.559983   648.064017     868.592024   435.663998
9       2048.0  1335.999966  1173.792005    1687.423944   876.703978
10      4096.0  2653.664112  2329.312086    3347.504139  1774.703979
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   106.335998   108.447999      99.519998    82.144000
1          8.0   100.832000   104.000002      90.176001    78.464001
2         16.0   103.040002   106.495999      80.640003    75.808004
3         32.0   101.439998   104.543999      79.967998    75.199999
4         64.0   102.048002   105.120003      81.568003    76.095998
5        128.0    96.639998    99.951997     117.536001    79.328001
6        256.0   124.736004   126.208007     165.631995    95.104001
7        512.0   236.448005   231.296003     315.279990   163.455993
8       1024.0   463.936001   440.607995     617.504001   299.488008
9       2048.0   918.175995   860.127985    1229.552031   592.320025
10      4096.0  1787.296057  1566.239953    2395.391941  1190.368056
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.328000   22.752000      22.911999   18.528000
1          8.0   19.072000   22.720000      22.464000   17.247999
2         16.0   19.231999   22.911999      22.480000   17.568000
3         32.0   19.360000   23.008000      23.232000   21.536000
4         64.0   19.552000   23.135999      22.624001   18.816000
5        128.0   19.648001   23.264000      23.424000   21.856001
6        256.0   25.696000   28.832000      32.928001   23.871999
7        512.0   41.503999   42.911999      56.095999   40.064000
8       1024.0   74.592002   72.927997      97.759999   60.608000
9       2048.0  136.639997  136.255994     189.408004  105.375998
10      4096.0  257.728010  254.976004     346.303999  196.224004
deepseek-ai/DeepSeek-V3 N=24576 K=1536: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   34.304000   37.023999      38.816001   25.760001
1          8.0   32.639999   35.936002      27.519999   25.312001
2         16.0   32.960001   36.063999      27.360000   26.016001
3         32.0   33.504002   36.736000      27.680000   25.536001
4         64.0   33.376001   36.672000      27.904000   26.944000
5        128.0   31.904001   35.071999      37.535999   30.592000
6        256.0   37.439998   40.128000      48.640002   34.400001
7        512.0   63.840002   65.343998      86.144000   51.263999
8       1024.0  118.336000  116.319999     161.791995   91.104001
9       2048.0  232.271999  207.231998     309.184015  155.328006
10      4096.0  442.912012  386.224002     562.687993  293.119997
deepseek-ai/DeepSeek-V3 N=4096 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   46.271998   48.767999      53.536002   28.960001
1          8.0   45.600001   48.735999      48.928000   28.255999
2         16.0   45.407999   48.735999      48.864000   29.184001
3         32.0   45.632001   48.864000      48.831999   24.800001
4         64.0   45.472000   48.896000      49.215999   25.504000
5        128.0   45.536000   49.024001      49.376000   26.192000
6        256.0   45.823999   49.279999      49.952000   29.088000
7        512.0   45.759998   50.175998      59.680000   36.543999
8       1024.0   84.767997   87.679997     110.335998   59.136000
9       2048.0  160.383999  161.040008     212.255999  109.568000
10      4096.0  312.000006  313.199997     414.687991  208.767995
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   102.240004   103.632003     135.263994   68.832003
1          8.0   100.575998   103.200004      87.488003   67.968003
2         16.0   100.736000   103.519998      89.599997   66.271998
3         32.0   100.992002   102.976002      89.535996   62.624000
4         64.0   100.864001   103.359997      87.807998   60.543999
5        128.0   100.096002   103.072003     109.375998   62.112000
6        256.0   104.415998   108.704001     138.912007   77.855997
7        512.0   199.328005   201.184005     262.751997  146.783993
8       1024.0   389.759988   382.912010     615.631998  211.968005
9       2048.0   676.352024   658.976018     902.112007  434.768021
10      4096.0  1342.720032  1306.879997    1687.744021  854.272008
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    93.248002    93.567997     130.080000   63.712001
1          8.0    89.919999    92.639998      78.175999   61.055999
2         16.0    90.304002    92.960000      78.015998   59.519999
3         32.0    90.559997    93.312003      78.383997   56.639999
4         64.0    90.847999    93.695998      78.400001   56.543998
5        128.0    90.655997    93.184002      96.543998   57.856001
6        256.0    90.655997    93.952000     129.023999   69.728002
7        512.0   178.207994   181.439996     243.456006  135.135993
8       1024.0   349.151999   342.335999     561.183989  193.151996
9       2048.0   602.912009   587.360024     834.111989  392.832011
10      4096.0  1197.504044  1160.464048    1651.551962  764.720023
deepseek-ai/DeepSeek-V3 N=7168 K=2048: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   21.312000   24.896000      25.087999   26.303999
1          8.0   20.832000   24.383999      23.615999   22.112001
2         16.0   20.959999   24.544001      23.615999   21.504000
3         32.0   20.927999   24.480000      23.088001   20.976000
4         64.0   20.992000   24.416000      23.232000   21.280000
5        128.0   20.959999   24.288001      24.335999   26.128000
6        256.0   20.864001   24.416000      24.639999   27.807999
7        512.0   32.224000   35.392001      39.136000   28.255999
8       1024.0   55.392001   56.063998      83.456002   38.160000
9       2048.0   92.207998   90.240002     123.103999   66.239998
10      4096.0  167.935997  167.120010     233.567998  114.720002
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   106.416002   108.191997      99.327996    82.176000
1          8.0   100.256003   103.712000      90.080000    78.464001
2         16.0   102.816001   106.239997      80.608003    75.839996
3         32.0   101.024002   104.223996      79.839997    75.199999
4         64.0   101.503998   104.832001      81.472002    75.903997
5        128.0    98.399997   101.792000     118.047997    79.264000
6        256.0   124.672003   126.047999     166.063994    95.168002
7        512.0   237.120003   231.391996     315.775990   164.223999
8       1024.0   463.551998   438.847989     618.848026   299.199998
9       2048.0   918.304026   861.760020    1224.671960   560.383976
10      4096.0  1782.431960  1668.192029    2259.360075  1151.328087
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.168001   22.720000      22.208000   18.848000
1          8.0   19.104000   22.752000      22.560000   18.015999
2         16.0   19.392001   22.944000      22.112001   18.560000
3         32.0   19.360000   23.040000      22.080000   17.664000
4         64.0   19.520000   23.072001      22.240000   18.048000
5        128.0   19.711999   23.296000      23.488000   21.568000
6        256.0   25.920000   28.928000      32.992002   23.552001
7        512.0   41.407999   42.879999      56.127999   40.064000
8       1024.0   74.592002   72.864003      97.471997   60.447998
9       2048.0  137.024000  135.168001     181.247994  104.287997
10      4096.0  256.159991  254.592001     345.775992  196.255997
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    92.639998    94.240002     130.303994   63.616000
1          8.0    90.368003    93.056001      78.304000   61.087999
2         16.0    90.240002    93.120001      78.143999   59.487998
3         32.0    90.496004    93.184002      78.143999   56.639999
4         64.0    91.040000    93.440004      78.304000   56.543998
5        128.0    90.016000    93.216002      96.383996   57.856001
6        256.0    90.304002    97.439997     129.695997   69.760002
7        512.0   179.168001   180.255994     243.423998  135.232002
8       1024.0   347.743988   342.128009     560.320020  195.712000
9       2048.0   602.656007   587.360024     829.343975  398.784012
10      4096.0  1196.768045  1156.991959    1545.439959  765.743971
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   104.800001   103.840001     135.008007   68.896003
1          8.0   100.383997   103.104003      86.144000   67.840002
2         16.0   100.447997   103.136003      89.328006   66.496000
3         32.0   100.511998   102.976002      89.535996   62.560000
4         64.0   100.607999   103.104003      87.839998   60.575999
5        128.0   100.128002   103.455998     109.088004   62.080000
6        256.0   100.064002   103.519998     133.471996   77.760004
7        512.0   199.967995   202.448010     262.560010  144.639999
8       1024.0   389.535993   383.616000     615.423977  221.760005
9       2048.0   676.064014   659.888029     898.624003  443.360001
10      4096.0  1342.911959  1306.976080    1787.968040  826.367974
deepseek-ai/DeepSeek-V3 N=18432 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   102.944002   101.856001     114.288002   60.368001
1          8.0   100.639999   103.391998      73.087998   60.288001
2         16.0   101.471998   104.560003      72.768003   60.479999
3         32.0   100.192003    99.519998      72.864003   60.463998
4         64.0   101.487994   104.399994      73.536001   60.896002
5        128.0    95.760003    98.976001     105.839998   63.167997
6        256.0   123.328000   124.672003     162.783995   80.959998
7        512.0   198.431998   194.656000     262.879997  152.224004
8       1024.0   349.631995   334.271997     466.432005  292.656004
9       2048.0   690.880001   650.815964     864.768028  426.560014
10      4096.0  1337.327957  1253.455997    1683.552027  882.016003
deepseek-ai/DeepSeek-V3 N=12288 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   56.256000   60.128000      91.328003   48.767999
1          8.0   54.976001   59.360001      61.280001   45.536000
2         16.0   60.511999   64.336002      61.055999   45.456000
3         32.0   50.624002   54.559998      61.120000   45.664001
4         64.0   57.472002   61.728001      61.439998   45.568001
5        128.0   50.175998   54.016002      65.920003   47.040001
6        256.0   84.256001   88.895999     112.255998   62.399998
7        512.0  122.879997  124.416001     162.655994   85.968003
8       1024.0  235.808000  229.471996     313.840002  152.544007
9       2048.0  463.584006  439.007998     618.319988  288.672000
10      4096.0  919.264019  862.335980    1226.896048  564.736009
deepseek-ai/DeepSeek-V3 N=16384 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   14.464000   17.728001      19.872000   21.472000
1          8.0   14.304000   17.888000      19.392001   21.695999
2         16.0   14.304000   17.920000      18.719999   21.888001
3         32.0   14.368000   17.728001      20.320000   17.983999
4         64.0   14.112000   17.664000      17.344000   22.112001
5        128.0   14.656000   17.888000      17.728001   23.167999
6        256.0   17.408000   20.288000      20.800000   23.776000
7        512.0   24.960000   27.424000      31.104000   26.720000
8       1024.0   41.216001   42.495999      54.079998   40.415999
9       2048.0   73.664002   71.872003      97.407997   62.816001
10      4096.0  136.608005  126.816005     190.528005  106.144004
deepseek-ai/DeepSeek-V3 N=12288 K=1536: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   21.600001   25.087999      28.128000   21.504000
1          8.0   20.032000   24.143999      20.671999   24.639999
2         16.0   20.736000   24.383999      21.280000   23.391999
3         32.0   19.520000   23.328001      20.864001   21.215999
4         64.0   19.487999   23.167999      21.280000   20.768000
5        128.0   19.487999   23.040000      22.208000   23.712000
6        256.0   27.968001   30.719999      32.960001   23.647999
7        512.0   36.768001   38.816001      46.016000   31.744000
8       1024.0   63.040003   62.912002      81.632003   49.568001
9       2048.0  122.288004  115.840003     159.904003   86.080000
10      4096.0  230.304003  205.504000     308.319986  156.448007
deepseek-ai/DeepSeek-V3 N=2048 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   27.904000   40.511999      47.839999   27.488001
1          8.0   27.519999   40.640000      45.568001   27.616000
2         16.0   27.648000   40.704001      45.664001   29.376000
3         32.0   27.712001   40.080000      45.504000   26.880000
4         64.0   27.775999   40.863998      45.791999   25.504000
5        128.0   27.584000   40.479999      46.144001   27.744001
6        256.0   31.776000   37.312001      46.592001   32.800000
7        512.0   39.840002   42.656001      46.879999   40.736001
8       1024.0   55.071998   52.416001      58.527999   34.944002
9       2048.0   96.992001   92.448004     110.816002   59.103999
10      4096.0  175.712004  170.432001     212.927997  110.048003
deepseek-ai/DeepSeek-V3 N=7168 K=9216: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   57.119999   60.479999      86.336002   39.360002
1          8.0   56.607999   59.360001      64.912006   39.296001
2         16.0   55.615999   59.103999      64.928003   38.272001
3         32.0   56.127999   59.424002      65.056004   37.567999
4         64.0   56.031998   59.328001      65.407999   37.376001
5        128.0   56.000002   59.551999      65.471999   38.911998
6        256.0   57.296000   61.376002      75.808004   45.120001
7        512.0  105.375998  108.672000     137.247995   77.983998
8       1024.0  201.407999  199.456006     265.311986  118.079998
9       2048.0  346.623987  337.632000     460.000008  229.951993
10      4096.0  686.671972  660.272002     911.311984  441.471994
deepseek-ai/DeepSeek-V3 N=7168 K=8192: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   52.800000   54.719999      77.791996   36.832001
1          8.0   50.880000   54.079998      59.296001   36.031999
2         16.0   51.456001   54.143999      58.720000   35.360001
3         32.0   50.655998   54.079998      58.880001   34.432001
4         64.0   51.104002   54.528002      59.487998   34.688000
5        128.0   51.008001   54.591998      59.904002   35.680000
6        256.0   51.167998   54.848000      67.872003   41.375998
7        512.0   94.783999   98.687999     128.160000   71.488000
8       1024.0  180.831999  179.183990     246.879995  109.728001
9       2048.0  310.112000  300.464004     427.807987  207.359999
10      4096.0  613.344014  584.223986     846.463978  398.048013
deepseek-ai/DeepSeek-V3 N=7168 K=1024: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton   deepgemm
0          1.0   16.384000   19.520000      24.672000  23.135999
1          8.0   16.480001   19.743999      24.512000  25.936000
2         16.0   15.776001   19.007999      25.056001  23.040000
3         32.0   15.872000   19.104000      24.192000  23.584001
4         64.0   15.872000   19.104000      25.664000  24.000000
5        128.0   15.776001   19.007999      24.496000  25.024001
6        256.0   15.584000   18.880000      25.280001  26.176000
7        512.0   21.824000   24.607999      28.352000  31.008000
8       1024.0   34.655999   36.575999      51.711999  27.712001
9       2048.0   55.167999   54.143999      70.175998  44.799998
10      4096.0  100.319996   95.264003     135.775998  74.207999
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   106.207997   108.351998      99.391997    82.176000
1          8.0   100.383997   103.583999      90.080000    78.272000
2         16.0   102.752000   106.271997      80.831997    75.967997
3         32.0   101.135999   104.192004      79.791993    75.167999
4         64.0   101.439998   104.672000      81.248000    75.839996
5        128.0    96.192002    99.647999     114.416003    78.879997
6        256.0   120.512001   126.496002     165.536001    95.040001
7        512.0   237.056002   230.943993     315.423995   163.167998
8       1024.0   463.519990   439.999998     618.463993   300.240010
9       2048.0   918.335974   860.480011    1229.840040   592.480004
10      4096.0  1787.392020  1565.888047    2399.712086  1168.352008
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.360000   22.816001      24.256000   19.040000
1          8.0   19.040000   22.720000      24.224000   19.392001
2         16.0   19.200001   22.911999      24.095999   20.512000
3         32.0   19.360000   23.040000      24.351999   19.680001
4         64.0   19.552000   23.040000      24.831999   20.671999
5        128.0   19.680001   23.264000      26.432000   23.776000
6        256.0   25.664000   28.736001      33.087999   24.383999
7        512.0   41.503999   42.879999      56.127999   40.192001
8       1024.0   74.304000   72.480001      96.928000   60.288001
9       2048.0  137.344003  129.040003     181.088001  111.584000
10      4096.0  276.735991  241.183996     343.136013  198.175997
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    94.144002    93.759999     130.528003   63.648000
1          8.0    90.496004    93.312003      78.432001   61.152000
2         16.0    90.496004    93.280002      78.288004   59.712000
3         32.0    90.879999    93.408003      78.432001   56.607999
4         64.0    91.072001    93.503997      78.336000   56.511998
5        128.0    90.080000    93.440004      96.256003   57.760000
6        256.0    89.695998    93.567997     125.152007   69.920003
7        512.0   178.335994   181.823999     243.584007  134.368002
8       1024.0   347.647995   343.488008     560.383976  199.839994
9       2048.0   602.735996   587.487996     831.055999  391.647995
10      4096.0  1197.247982  1158.207893    1545.248032  716.624022
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   103.519998   103.423998     135.135993   68.928003
1          8.0   100.575998   103.391998      86.208001   67.808002
2         16.0   100.479998   103.391998      89.583993   66.271998
3         32.0   100.511998   103.136003      89.535996   62.624000
4         64.0   100.703999   103.359997      87.935999   60.736001
5        128.0   101.120003   102.848001     108.832002   61.856002
6        256.0   100.000001   103.712000     133.824006   77.791996
7        512.0   199.744001   200.864002     263.424009  144.768000
8       1024.0   389.887989   383.359998     614.447951  223.808005
9       2048.0   675.967991   658.976018     902.239978  438.688010
10      4096.0  1342.720032  1306.671977    1787.775993  856.288016
deepseek-ai/DeepSeek-V3 N=9216 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   47.488000   50.912000      80.959998   38.495999
1          8.0   47.136001   50.464001      56.992002   37.408002
2         16.0   47.456000   51.295999      56.896001   37.471998
3         32.0   46.464000   50.271999      56.928001   37.344001
4         64.0   46.879999   50.560001      57.503998   37.535999
5        128.0   46.367999   49.952000      62.752001   39.071999
6        256.0   83.424002   88.032000      99.104002   45.791999
7        512.0  122.175999  122.592002     160.512000   79.456002
8       1024.0  198.111996  193.087995     262.208015  150.976002
9       2048.0  349.808007  335.391998     464.895993  273.472011
10      4096.0  692.767978  654.415965     920.383990  428.591996
deepseek-ai/DeepSeek-V3 N=6144 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   47.008000   49.263999      65.536000   38.176000
1          8.0   45.952000   49.408000      52.544001   36.063999
2         16.0   46.080001   49.568001      52.671999   36.607999
3         32.0   45.984000   49.568001      52.576002   35.583999
4         64.0   45.887999   49.408000      52.832000   33.408001
5        128.0   45.887999   49.600001      53.264000   38.240001
6        256.0   46.176001   49.759999      59.039999   38.575999
7        512.0   84.959999   88.128000     110.335998   56.320000
8       1024.0  123.103999  124.448001     161.216006   83.200000
9       2048.0  235.072002  229.568005     312.864006  158.848003
10      4096.0  464.464009  443.520010     617.151976  295.455992
deepseek-ai/DeepSeek-V3 N=8192 K=512: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  13.248000   16.640000      20.959999  26.208000
1          8.0  13.280000   16.704001      21.632001  26.032001
2         16.0  13.280000   16.640000      19.776000  25.024001
3         32.0  13.120000   16.608000      20.736000  23.840001
4         64.0  13.184000   16.608000      20.288000  26.176000
5        128.0  13.184000   16.704001      19.936001  28.608000
6        256.0  13.120000   16.480001      20.800000  28.960001
7        512.0  16.352000   19.200001      21.536000  27.008001
8       1024.0  24.672000   27.200000      30.848000  29.759999
9       2048.0  41.503999   42.560000      53.215999  40.832002
10      4096.0  74.816003   71.295999      96.783996  61.184000
deepseek-ai/DeepSeek-V3 N=6144 K=1536: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton   deepgemm
0          1.0   18.495999   21.792000      27.071999  30.015999
1          8.0   18.144000   21.632001      23.903999  28.031999
2         16.0   18.080000   21.600001      23.264000  27.488001
3         32.0   18.176001   21.695999      24.016000  27.264001
4         64.0   18.112000   21.663999      23.776000  29.568000
5        128.0   18.304000   21.856001      23.744000  27.872000
6        256.0   18.208001   21.695999      23.104001  26.496001
7        512.0   27.264001   30.176001      32.416001  28.896000
8       1024.0   36.063999   38.431998      45.024000  33.199999
9       2048.0   63.744001   62.495999      81.504002  49.632002
10      4096.0  119.199999  111.071996     152.799994  90.623997
deepseek-ai/DeepSeek-V3 N=1024 K=7168: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  25.567999   38.624000      45.919999  25.152000
1          8.0  25.312001   38.336001      43.871999  24.863999
2         16.0  25.504000   38.975999      43.839999  25.664000
3         32.0  25.248000   38.176000      43.903999  24.256000
4         64.0  25.184000   38.495999      44.160001  25.087999
5        128.0  25.216000   38.400002      44.447999  25.728000
6        256.0  25.952000   38.688000      44.319998  23.488000
7        512.0  31.328000   36.832001      45.056000  27.807999
8       1024.0  39.967999   42.784002      46.335999  33.504002
9       2048.0  54.752000   51.711999      57.376001  40.608000
10      4096.0  96.864000   92.607997     111.584000  59.200000
deepseek-ai/DeepSeek-V3 N=7168 K=4608: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   34.495998   37.376001      48.032001   30.239999
1          8.0   33.631999   37.312001      38.592000   28.192000
2         16.0   33.920001   37.503999      38.527999   28.640000
3         32.0   33.472002   37.216000      38.688000   28.543999
4         64.0   33.888001   37.471998      39.168000   32.224000
5        128.0   33.599999   37.439998      39.391998   32.063998
6        256.0   34.047998   37.792001      43.104000   33.183999
7        512.0   57.280000   60.031999      73.856004   44.992000
8       1024.0  107.712001  106.720001     140.543997   67.936003
9       2048.0  183.167994  174.592003     241.888002  121.728003
10      4096.0  356.736004  333.248019     472.575992  216.447994
deepseek-ai/DeepSeek-V3 N=7168 K=4096: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   32.000002   35.328001      43.712001   34.591999
1          8.0   31.136001   34.688000      34.559999   33.792000
2         16.0   30.784000   34.527998      34.591999   35.007998
3         32.0   31.104000   34.559999      34.784000   29.216001
4         64.0   30.912001   34.623999      34.784000   34.944002
5        128.0   30.848000   34.688000      35.103999   32.159999
6        256.0   31.168001   34.784000      39.328001   38.431998
7        512.0   52.928001   55.071998      69.311999   43.136001
8       1024.0   97.792000   97.152002     132.192001   62.176000
9       2048.0  164.928004  157.856002     226.879999  110.495999
10      4096.0  320.479989  300.080001     442.400008  196.160004
deepseek-ai/DeepSeek-V3 N=7168 K=512: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  13.216000   16.448000      24.512000  31.808000
1          8.0  13.120000   16.543999      25.664000  30.112000
2         16.0  13.248000   16.576000      24.544001  31.711999
3         32.0  13.024000   16.416000      24.863999  31.744000
4         64.0  13.024000   16.448000      24.480000  32.800000
5        128.0  13.120000   16.511999      25.152000  32.352000
6        256.0  13.024000   16.352000      25.087999  33.696000
7        512.0  16.352000   19.136000      24.896000  31.199999
8       1024.0  24.672000   27.136000      30.272000  29.088000
9       2048.0  37.376001   38.816001      47.359999  34.047998
10      4096.0  66.560000   63.904002      86.144000  52.960001
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0   106.335998   108.736001      99.776000    82.368001
1          8.0   100.960001   104.111999      90.144001    78.272000
2         16.0   103.168003   106.527999      80.640003    75.839996
3         32.0   101.503998   104.479998      80.063999    75.199999
4         64.0   101.856001   106.463999      81.855997    76.063998
5        128.0    98.463997   101.728000     117.728002    79.424001
6        256.0   124.799997   126.432002     165.887997    95.296003
7        512.0   237.184003   231.135994     315.391988   163.167998
8       1024.0   463.440001   440.384001     618.560016   321.343988
9       2048.0   918.447971   859.824002    1226.112008   566.303968
10      4096.0  1787.375927  1570.816040    2396.192074  1192.224026
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.328000   22.720000      29.056000   28.543999
1          8.0   19.040000   22.720000      28.960001   23.264000
2         16.0   19.200001   22.911999      29.440001   30.528000
3         32.0   19.328000   23.008000      30.944001   23.456000
4         64.0   19.487999   23.008000      29.216001   26.335999
5        128.0   19.648001   23.264000      29.552000   33.408001
6        256.0   25.664000   28.736001      35.615999   29.120000
7        512.0   41.423999   42.847998      55.840001   40.128000
8       1024.0   73.632002   71.456000     101.311997   60.864002
9       2048.0  138.671994  136.224002     181.439996  112.063996
10      4096.0  275.119990  235.936001     340.223998  196.480006
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    94.048001    93.984000     129.951999   63.648000
1          8.0    90.527996    93.216002      78.592002   61.152000
2         16.0    90.655997    93.024001      78.240000   59.519999
3         32.0    90.591997    93.248002      78.207999   56.671999
4         64.0    91.008000    93.503997      78.368001   56.511998
5        128.0    90.496004    93.344003      96.447997   57.792000
6        256.0    90.304002    93.791999     123.839997   70.096001
7        512.0   178.880006   181.375995     243.711993  132.640004
8       1024.0   348.383993   343.679994     561.007977  200.736001
9       2048.0   603.232026   588.544011     830.272019  378.080010
10      4096.0  1197.280049  1158.112049    1658.975959  759.679973
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0   103.391998   103.136003     135.232002   69.055997
1          8.0   100.415997   103.040002      86.080000   67.840002
2         16.0   100.415997   103.104003      89.472003   66.303998
3         32.0   100.543998   102.144003      89.120001   62.592000
4         64.0   100.639999   103.072003      87.360002   60.608000
5        128.0   100.607999   103.391998     109.407999   62.144000
6        256.0   104.255997   108.415999     138.720006   77.776000
7        512.0   199.552000   202.304006     262.847990  146.847993
8       1024.0   389.759988   382.991999     615.856051  213.760003
9       2048.0   675.743997   660.511971     897.696018  425.024003
10      4096.0  1342.479944  1306.848049    1687.008023  856.800020
deepseek-ai/DeepSeek-V3 N=4608 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   46.592001   48.703998      58.063999   33.472002
1          8.0   45.632001   48.992001      38.943999   35.664000
2         16.0   45.696001   48.992001      38.832001   36.448002
3         32.0   45.919999   48.703998      38.768001   35.135999
4         64.0   45.696001   49.343999      39.423998   31.872001
5        128.0   46.208002   49.279999      46.847999   36.448002
6        256.0   46.080001   49.536001      57.952002   38.896002
7        512.0   84.192000   87.119997      98.352000   44.319998
8       1024.0  122.063994  122.272000     164.928004   79.264000
9       2048.0  198.287994  193.087995     266.400009  150.720000
10      4096.0  350.224018  337.568015     466.143996  270.687997
deepseek-ai/DeepSeek-V3 N=3072 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   45.600001   48.576001      48.432000   31.840000
1          8.0   45.600001   48.928000      39.391998   36.224000
2         16.0   45.536000   48.640002      37.344001   34.944002
3         32.0   45.488000   48.672002      37.824001   32.448001
4         64.0   45.632001   48.992001      38.800001   33.151999
5        128.0   45.696001   48.480000      41.616000   33.856001
6        256.0   45.823999   49.152002      45.823999   36.192000
7        512.0   46.112001   48.864000      57.696000   40.128000
8       1024.0   84.480003   87.647997     111.904003   53.695999
9       2048.0  122.911997  124.159999     161.216006   84.624000
10      4096.0  236.335993  239.695996     316.224009  159.263998
deepseek-ai/DeepSeek-V3 N=4096 K=512: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  12.352000   15.616000      30.656001  26.912000
1          8.0  12.320000   15.552000      30.080000  29.711999
2         16.0  12.352000   15.584000      29.856000  25.567999
3         32.0  12.352000   15.648000      30.528000  29.632000
4         64.0  12.352000   15.552000      29.376000  31.488001
5        128.0  12.448000   15.744001      29.408000  31.199999
6        256.0  12.480000   15.744001      33.280000  30.975999
7        512.0  12.256000   15.440000      30.592000  28.640000
8       1024.0  16.160000   19.040000      29.120000  31.808000
9       2048.0  24.704000   27.104000      35.039999  31.008000
10      4096.0  41.792002   43.264002      51.743999  37.248001
deepseek-ai/DeepSeek-V3 N=3072 K=1536: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  17.568000   20.832000      25.823999  30.688001
1          8.0  17.568000   20.896001      24.623999  29.136000
2         16.0  17.600000   20.992000      25.696000  27.327999
3         32.0  17.600000   20.992000      25.184000  30.048000
4         64.0  17.632000   21.024000      23.808001  32.256000
5        128.0  17.728001   21.120001      24.383999  32.671999
6        256.0  17.759999   21.024000      25.536001  30.975999
7        512.0  17.503999   20.864001      24.320001  31.936001
8       1024.0  26.880000   29.632000      32.384001  31.583998
9       2048.0  36.256000   38.368002      44.672001  34.623999
10      4096.0  63.904002   65.311998      80.640003  48.608001
deepseek-ai/DeepSeek-V3 N=512 K=7168: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  24.496000   37.344001      45.024000  31.072000
1          8.0  24.607999   37.248001      43.488000  30.495999
2         16.0  24.351999   37.280001      43.327998  30.784000
3         32.0  24.480000   37.087999      43.423999  29.472001
4         64.0  24.416000   37.439998      43.488000  26.240001
5        128.0  24.576001   37.248001      44.032000  27.200000
6        256.0  24.768000   37.696000      43.712001  26.720000
7        512.0  25.920000   38.368002      44.000000  29.552000
8       1024.0  31.328000   37.087999      45.248002  36.192000
9       2048.0  39.616000   42.592000      47.904000  35.776000
10      4096.0  56.000002   51.936001      59.744000  40.911999
deepseek-ai/DeepSeek-V3 N=7168 K=2304: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   22.752000   26.240001      33.344001   32.607999
1          8.0   22.048000   25.599999      32.543998   32.175999
2         16.0   22.175999   25.760001      32.575998   31.296000
3         32.0   22.431999   25.920000      31.136001   31.264000
4         64.0   22.112001   25.696000      30.816000   34.015998
5        128.0   22.208000   25.888000      31.552002   28.416000
6        256.0   22.336001   25.984000      37.440002   30.624000
7        512.0   34.880001   37.696000      43.391999   39.296001
8       1024.0   59.808001   61.071999      95.264003   42.879999
9       2048.0   97.888000   98.880000     133.487999   77.119999
10      4096.0  194.271997  173.984006     242.944002  140.064001
deepseek-ai/DeepSeek-V3 N=7168 K=2048: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   21.407999   24.896000      31.520002   34.047998
1          8.0   20.864001   24.512000      30.560000   33.599999
2         16.0   20.927999   24.512000      31.663999   34.272000
3         32.0   20.832000   24.480000      31.264000   33.760000
4         64.0   20.992000   24.416000      29.503999   32.896001
5        128.0   20.800000   24.320001      36.256000   34.272000
6        256.0   20.864001   24.480000      31.360000   35.808001
7        512.0   32.384001   35.424002      39.584000   39.551999
8       1024.0   55.583999   55.936001      83.871998   38.656000
9       2048.0   92.096001   86.879998     118.752003   69.343999
10      4096.0  177.312002  167.359993     233.664006  114.656001
deepseek-ai/DeepSeek-V3 N=7168 K=256: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  11.168000   14.496000      26.400000  31.168001
1          8.0  11.168000   14.496000      26.016001  29.968001
2         16.0  11.168000   14.496000      25.696000  29.600000
3         32.0  11.168000   14.464000      26.272001  30.944001
4         64.0  11.040000   14.368000      24.000000  26.912000
5        128.0  11.040000   14.368000      29.888000  33.535998
6        256.0  11.168000   14.464000      25.408000  29.408000
7        512.0  13.824000   16.704001      25.312001  32.768000
8       1024.0  19.264000   22.208000      24.192000  23.520000
9       2048.0  28.128000   30.304000      36.160000  29.088000
10      4096.0  48.640002   49.247999      64.000003  45.632001
Benchmark finished!

zhyncs · 2025-04-09T18:46:34Z

follow ups

padding logic
tuning performance
make it compatible with DeepEP

* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py

…ject#5207)

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <[email protected]> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <[email protected]> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <[email protected]> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <[email protected]> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <[email protected]> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: ch-wan <[email protected]> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: GeLee <[email protected]> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <[email protected]> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <[email protected]> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <[email protected]> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <[email protected]> Co-authored-by: laixinn <[email protected]> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Chunan Zeng <[email protected]> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <[email protected]> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ispobock <[email protected]> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <[email protected]> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <[email protected]> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <[email protected]> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <[email protected]> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <[email protected]> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <[email protected]> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <[email protected]> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <[email protected]> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <[email protected]> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <[email protected]> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <[email protected]> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <[email protected]> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <[email protected]> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <[email protected]> Co-authored-by: sleepcoo <[email protected]> Co-authored-by: zhyncs <[email protected]> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: shangmingc <[email protected]> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <[email protected]> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <[email protected]> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <[email protected]> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <[email protected]> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <[email protected]> Co-authored-by: hebiao064 <[email protected]> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: Jinn <[email protected]> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <[email protected]> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <[email protected]> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <[email protected]> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <[email protected]> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <[email protected]> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <[email protected]> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <[email protected]> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: ybyang <[email protected]> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <[email protected]> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <[email protected]> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <[email protected]> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: windsonsea <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Juwan Yoo <[email protected]> Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: chaobo jia <[email protected]> Co-authored-by: rudy152 <[email protected]> Co-authored-by: Fr4nk1in <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Sleepcoo <[email protected]> Co-authored-by: SEPLOS <[email protected]> Co-authored-by: ch-wan <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: GeLee <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: hebiao064 <[email protected]> Co-authored-by: zcnrex <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: renxin <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: yuethe <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Yifan Zhang <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: Tommy Yang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: Yun Dai <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Yubo Wang <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: DangKai <[email protected]> Co-authored-by: dangkai.dk <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Ma Mingfei <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: zhaochenyang20 <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: grimoire <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: Zhaoyang Hao <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yusong Gao <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: tianlian yi <[email protected]> Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Jinn <[email protected]> Co-authored-by: yulei <[email protected]> Co-authored-by: Yongtong Wu <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: mRSun15 <[email protected]> Co-authored-by: ryang <[email protected]> Co-authored-by: Yuhao Yang <[email protected]>

use cutlass latest version for fp8 blockwise

3cc889b

yizhang2077 requested review from zhyncs, ispobock, HandH1998, BBuf, merrymercy and yinfan98 as code owners April 9, 2025 17:34

yizhang2077 changed the title ~~sgl-kernel use cutlass latest version for fp8 blockwise~~ sgl-kernel use cutlass latest version for fp8 blockwise gemm Apr 9, 2025

zhyncs approved these changes Apr 9, 2025

View reviewed changes

zhyncs merged commit ebf495f into main Apr 9, 2025
9 checks passed

zhyncs deleted the update-fp8-blockwise-kernel branch April 9, 2025 18:47

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025

sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-pro…

3fc8593

…ject#5207)

thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025

sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-pro…

77e9549

…ject#5207)

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-pro…

421e42d

…ject#5207)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207

sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207

Uh oh!

yizhang2077 commented Apr 9, 2025 •

edited

Loading

Uh oh!

yizhang2077 commented Apr 9, 2025

Uh oh!

zhyncs commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207

sgl-kernel use cutlass latest version for fp8 blockwise gemm #5207

Uh oh!

Conversation

yizhang2077 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

TODO

Test

Checklist

Uh oh!

yizhang2077 commented Apr 9, 2025

Uh oh!

zhyncs commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented Apr 9, 2025 •

edited

Loading