Skip to content

DeepGemm integrate to sgl-kernel #4165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Mar 10, 2025
Merged

Conversation

laixinn
Copy link
Contributor

@laixinn laixinn commented Mar 7, 2025

Motivation

Integrate DeepGemm in setup.
Linear usage: #4199 .

Modifications

Checklist

@zhyncs
Copy link
Member

zhyncs commented Mar 7, 2025

@HandH1998
Copy link
Collaborator

Please fix build error https://github.com/sgl-project/sglang/actions/runs/13715681407/job/38360041673?pr=4165

It seems that the version of setuptools in CI is old. We succeed to build it with >= 75.0.0 locally. Maybe can update setuptools in CI to solve this issue.

@shuaills shuaills force-pushed the jit-deep-gemm branch 3 times, most recently from 0db1e43 to 269b6c0 Compare March 8, 2025 04:37
@sleepcoo sleepcoo mentioned this pull request Mar 8, 2025
@yinfan98
Copy link
Collaborator

yinfan98 commented Mar 8, 2025

Test

We fix deepgemm build with JIT module.

  1. Install command:
cd sgl-kernel
make build
  1. Test script is copied from deepgemm/tests/test_core.py

Deepgemm test result below:

Library path:
 > ['/usr/local/lib/python3.10/dist-packages/deep_gemm']

Testing GEMM:
 > Performance (m=   64, n= 2112, k= 7168):   10 us | throughput:  193 TFLOPS, 1583 GB/s
 > Performance (m=   64, n=24576, k= 1536):   13 us | throughput:  363 TFLOPS, 3083 GB/s
 > Performance (m=   64, n=32768, k=  512):   10 us | throughput:  226 TFLOPS, 2210 GB/s
 > Performance (m=   64, n= 7168, k=16384):   36 us | throughput:  423 TFLOPS, 3363 GB/s
 > Performance (m=   64, n= 4096, k= 7168):   12 us | throughput:  313 TFLOPS, 2531 GB/s
 > Performance (m=   64, n= 7168, k= 2048):    7 us | throughput:  281 TFLOPS, 2349 GB/s
 > Performance (m=  128, n= 2112, k= 7168):   11 us | throughput:  347 TFLOPS, 1488 GB/s
 > Performance (m=  128, n=24576, k= 1536):   15 us | throughput:  648 TFLOPS, 2964 GB/s
 > Performance (m=  128, n=32768, k=  512):   11 us | throughput:  383 TFLOPS, 2247 GB/s
 > Performance (m=  128, n= 7168, k=16384):   38 us | throughput:  789 TFLOPS, 3184 GB/s
 > Performance (m=  128, n= 4096, k= 7168):   13 us | throughput:  566 TFLOPS, 2361 GB/s
 > Performance (m=  128, n= 7168, k= 2048):    8 us | throughput:  481 TFLOPS, 2147 GB/s
 > Performance (m= 4096, n= 2112, k= 7168):  118 us | throughput: 1053 TFLOPS,  525 GB/s
 > Performance (m= 4096, n=24576, k= 1536):  315 us | throughput:  980 TFLOPS,  778 GB/s
 > Performance (m= 4096, n=32768, k=  512):  231 us | throughput:  595 TFLOPS, 1245 GB/s
 > Performance (m= 4096, n= 7168, k=16384):  691 us | throughput: 1392 TFLOPS,  352 GB/s
 > Performance (m= 4096, n= 4096, k= 7168):  179 us | throughput: 1343 TFLOPS,  515 GB/s
 > Performance (m= 4096, n= 7168, k= 2048):  118 us | throughput: 1016 TFLOPS,  691 GB/s

Testing grouped contiguous GEMM:
 > Performance (num_groups=4, m_per_group=8192, n=4096, k=7168): 1418 us | throughput: 1357 TFLOPS,  438 GB/s
 > Performance (num_groups=4, m_per_group=8192, n=7168, k=2048):  883 us | throughput: 1089 TFLOPS,  674 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=4096, k=7168): 1427 us | throughput: 1348 TFLOPS,  517 GB/s
 > Performance (num_groups=8, m_per_group=4096, n=7168, k=2048):  884 us | throughput: 1089 TFLOPS,  740 GB/s

Testing grouped masked GEMM:
 > Performance (num_groups=1, m_per_group=1024, n=4096, k=7168):   48 us | throughput: 1261 TFLOPS,  945 GB/s
 > Performance (num_groups=1, m_per_group=1024, n=7168, k=2048):   32 us | throughput:  925 TFLOPS,  968 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=4096, k=7168):   49 us | throughput: 1216 TFLOPS, 1505 GB/s
 > Performance (num_groups=2, m_per_group= 512, n=7168, k=2048):   32 us | throughput:  931 TFLOPS, 1429 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=4096, k=7168):   54 us | throughput: 1105 TFLOPS, 2448 GB/s
 > Performance (num_groups=4, m_per_group= 256, n=7168, k=2048):   34 us | throughput:  878 TFLOPS, 2205 GB/s

How we build

We use setup.py to customize the DeepGEMM installation process.

Since DeepGEMM uses JIT compilation, we've integrated it as a third-party library. During the setup.py build process for our wheel package, we first build the AOT sgl-kernel, and then simply copy the DeepGEMM files into Python package.

How to use

import deep_gemm

@yinfan98
Copy link
Collaborator

yinfan98 commented Mar 8, 2025

cc: @sleepcoo @HandH1998 @laixinn
The PR is ready for review. cc: @zhyncs

@laixinn laixinn marked this pull request as ready for review March 8, 2025 12:30
@laixinn laixinn changed the title DeepGemm gemm_fp8_fp8_bf16_nt in JIT DeepGemm integrate to sgl-kernel Mar 8, 2025
@zhyncs
Copy link
Member

zhyncs commented Mar 8, 2025

We should also test the make build command and use the wheel directly.

@zhyncs zhyncs self-assigned this Mar 8, 2025
@zhyncs
Copy link
Member

zhyncs commented Mar 9, 2025

Please help rebase latest main

@laixinn
Copy link
Contributor Author

laixinn commented Mar 9, 2025

@zhyncs Symlinks are necessary for the head files of JIT. DeepGemm tests are forked into sgl-kernel tests.

@zhyncs
Copy link
Member

zhyncs commented Mar 10, 2025

Thank you all! The code is functional but messy. I will work on improving it later, but for now I will merge it.

@zhyncs zhyncs merged commit c553e16 into sgl-project:main Mar 10, 2025
3 of 5 checks passed
@inkhare
Copy link

inkhare commented Mar 10, 2025

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

@sleepcoo
Copy link
Collaborator

Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm?

Here #4199

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025
Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: yinfan98 <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
@lishicheng1996
Copy link

lishicheng1996 commented Mar 11, 2025

SGL_ENABLE_JIT_DEEPGEMM

@laixinn Hi, I don't see this env variable SGL_ENABLE_JIT_DEEPGEMM is code. May I ask where to find it?

@laixinn
Copy link
Contributor Author

laixinn commented Mar 12, 2025

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

@CUHKSZzxy
Copy link

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

@sleepcoo sleepcoo deleted the jit-deep-gemm branch March 26, 2025 10:21
@tbzhang
Copy link
Contributor

tbzhang commented Mar 26, 2025

@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration.

To use DeepGEMM, do we still need SGL_ENABLE_JIT_DEEPGEMM, or is this enabled by default?

deepgemm will be be used in the hopper architecture, you can check out this pull: #4613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants