-
Notifications
You must be signed in to change notification settings - Fork 2.3k
DeepGemm integrate to sgl-kernel #4165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It seems that the version of |
0db1e43
to
269b6c0
Compare
TestWe fix deepgemm build with JIT module.
cd sgl-kernel
make build
Deepgemm test result below: Library path:
> ['/usr/local/lib/python3.10/dist-packages/deep_gemm']
Testing GEMM:
> Performance (m= 64, n= 2112, k= 7168): 10 us | throughput: 193 TFLOPS, 1583 GB/s
> Performance (m= 64, n=24576, k= 1536): 13 us | throughput: 363 TFLOPS, 3083 GB/s
> Performance (m= 64, n=32768, k= 512): 10 us | throughput: 226 TFLOPS, 2210 GB/s
> Performance (m= 64, n= 7168, k=16384): 36 us | throughput: 423 TFLOPS, 3363 GB/s
> Performance (m= 64, n= 4096, k= 7168): 12 us | throughput: 313 TFLOPS, 2531 GB/s
> Performance (m= 64, n= 7168, k= 2048): 7 us | throughput: 281 TFLOPS, 2349 GB/s
> Performance (m= 128, n= 2112, k= 7168): 11 us | throughput: 347 TFLOPS, 1488 GB/s
> Performance (m= 128, n=24576, k= 1536): 15 us | throughput: 648 TFLOPS, 2964 GB/s
> Performance (m= 128, n=32768, k= 512): 11 us | throughput: 383 TFLOPS, 2247 GB/s
> Performance (m= 128, n= 7168, k=16384): 38 us | throughput: 789 TFLOPS, 3184 GB/s
> Performance (m= 128, n= 4096, k= 7168): 13 us | throughput: 566 TFLOPS, 2361 GB/s
> Performance (m= 128, n= 7168, k= 2048): 8 us | throughput: 481 TFLOPS, 2147 GB/s
> Performance (m= 4096, n= 2112, k= 7168): 118 us | throughput: 1053 TFLOPS, 525 GB/s
> Performance (m= 4096, n=24576, k= 1536): 315 us | throughput: 980 TFLOPS, 778 GB/s
> Performance (m= 4096, n=32768, k= 512): 231 us | throughput: 595 TFLOPS, 1245 GB/s
> Performance (m= 4096, n= 7168, k=16384): 691 us | throughput: 1392 TFLOPS, 352 GB/s
> Performance (m= 4096, n= 4096, k= 7168): 179 us | throughput: 1343 TFLOPS, 515 GB/s
> Performance (m= 4096, n= 7168, k= 2048): 118 us | throughput: 1016 TFLOPS, 691 GB/s
Testing grouped contiguous GEMM:
> Performance (num_groups=4, m_per_group=8192, n=4096, k=7168): 1418 us | throughput: 1357 TFLOPS, 438 GB/s
> Performance (num_groups=4, m_per_group=8192, n=7168, k=2048): 883 us | throughput: 1089 TFLOPS, 674 GB/s
> Performance (num_groups=8, m_per_group=4096, n=4096, k=7168): 1427 us | throughput: 1348 TFLOPS, 517 GB/s
> Performance (num_groups=8, m_per_group=4096, n=7168, k=2048): 884 us | throughput: 1089 TFLOPS, 740 GB/s
Testing grouped masked GEMM:
> Performance (num_groups=1, m_per_group=1024, n=4096, k=7168): 48 us | throughput: 1261 TFLOPS, 945 GB/s
> Performance (num_groups=1, m_per_group=1024, n=7168, k=2048): 32 us | throughput: 925 TFLOPS, 968 GB/s
> Performance (num_groups=2, m_per_group= 512, n=4096, k=7168): 49 us | throughput: 1216 TFLOPS, 1505 GB/s
> Performance (num_groups=2, m_per_group= 512, n=7168, k=2048): 32 us | throughput: 931 TFLOPS, 1429 GB/s
> Performance (num_groups=4, m_per_group= 256, n=4096, k=7168): 54 us | throughput: 1105 TFLOPS, 2448 GB/s
> Performance (num_groups=4, m_per_group= 256, n=7168, k=2048): 34 us | throughput: 878 TFLOPS, 2205 GB/s How we buildWe use setup.py to customize the DeepGEMM installation process. Since DeepGEMM uses JIT compilation, we've integrated it as a third-party library. During the setup.py build process for our wheel package, we first build the AOT sgl-kernel, and then simply copy the DeepGEMM files into Python package. How to useimport deep_gemm |
cc: @sleepcoo @HandH1998 @laixinn |
We should also test the |
Please help rebase latest main |
@zhyncs Symlinks are necessary for the head files of JIT. DeepGemm tests are forked into sgl-kernel tests. |
…& build DeepGemm in setup.py Co-authored-by: sleepcoo <[email protected]>
Thank you all! The code is functional but messy. I will work on improving it later, but for now I will merge it. |
Has anyone compared the performance improvement of Deepseek r1 after using DeepGemm? |
Here #4199 |
Co-authored-by: sleepcoo <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Yineng Zhang <[email protected]>
@laixinn Hi, I don't see this env variable |
@lishicheng1996 This env variable is deprecated. We update the description, please refer to #4199 for actual kernel integration. |
To use DeepGEMM, do we still need |
deepgemm will be be used in the hopper architecture, you can check out this pull: #4613 |
Motivation
Integrate DeepGemm in setup.
Linear usage: #4199 .
Modifications
Checklist