-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Kernel] moe wna16 marlin kernel #14447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] moe wna16 marlin kernel #14447
Conversation
Signed-off-by: Jinzhen Lin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
5d6921e
to
e6896d3
Compare
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
And maybe most importantly, for the case we were using Marlin MoE before, this kernel is now the best choice for Mixtral 8x7B as well
|
const int scales_expert_stride = prob_n * prob_k / group_size / 8; | ||
const int zp_expert_stride = | ||
is_zp_float ? prob_n * prob_k / group_size / 8 | ||
: prob_n * prob_k / group_size / (pack_factor * 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can prob_n * prob_k
overflow and int32? If so, could you use int64_t
instead? (This looks like it's probably fine, since we'd only overflow if a single expert was > 4GB but int32 overflows are common enough in vLLM that I look for these in every kernel PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression prob_n * prob_k
is hard to overflow, even with prob_n = 32768
and prob_k = 65,536
, their product only just reaches the maximum value of a 32-bit integer. Such values are almost impossible to encounter in an MoE model. However, if you're still concerned, we could consider using prob_k / group_size * prob_n / 8
instead of prob_n * prob_k / group_size / 8
.
For the parts of the code that are easy to overflow, I've already switched to using int64. However, int64 consumes more registers, so I'd prefer to avoid using it unless absolutely necessary.
I seem to be facing some overflow issues with this PR, which I don't face with mainline, possibly due to env:
Log
|
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
This problem is hard to reproduce, you could try to debug it by removing some arguments, for example, try to remove |
Signed-off-by: Jinzhen Lin <[email protected]>
Your error may due to OOM, 16×24G is really pushing the limits for running this model. I have tested the kernel with shapes of deepseek-r1-awq + tp16, but unfortunarely, I amd unable to reproduced this. Maybe you can look into more things. Your screenshot shows that the output is mostly correct, but there are a few instances of garbled text. So for now, I suspect the issue might lie in the attention, multithreading, or some other component, rather than this kernel itself. |
Well, I can normally run the DeepSeek R1 model at 16k ctx with this setup on AWQ, no issues. I can try that model again with the latest build. I don't have issues with mainline compared to this PR. Happy to keep testing if there's anything you can suggest. I'll trial the new build when it's done on CI. Could it be |
Yes, this kernel support group_size=32,64,128 and channelwise quantization. |
This is with the previous build before CI, but It looks to be a VRAM issue, that without Log
I will test new CI now with |
Signed-off-by: Jinzhen Lin <[email protected]>
@mgoin This PR is ready now. The failed tests seems not related to this PR. |
What's the meaning of "Performance on DeepSeek-V3-AWQ (on 8*A800)"? Output tokens per seconds? |
Yes, but these values are copied from the statistics logs printed by vllm, and not from an end-to-end benchmark test. The actual results may be more complex. Additionally, the benchmark results are from a month ago, and in the past month, I have optimized the operators multiple times, so the current results should be slightly better than those. |
Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Yang Wang <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Mu Huai <[email protected]>
#12185 and #13321 introduced the triton/cuda moe wna16 kernel to optimize the performance of moe gptq/awq. However, the best-performing gptq/awq kernel currently is the marlin kernel, and I hope to combine it with moe. Although there is already an implementation of moe + marlin kernel, it fails to fully leverage the performance advantages of the marlin kernel (especially when the number of experts is large).
This PR introduces a new moe wna16 marlin kernel that utilizes the m-parallel mechanism inherent to the marlin kernel to process all moe blocks in parallel. To prevent an excessive number of moe blocks from causing high
workspace
andc_tmp
capacity requirements, I have updated the utilization logic for workspace and c_tmp (considering that the maximum number ofslice_col_par
that shared by different SMs is at most the number of SMs, meaning we only need aworkspace
of fixed length equal to the number of SMs).This kernel is based on vllm's gptq_marlin implementation and fully supports all features of it (bfloat16/int8/act_order/...). It also supports expert parallelism.
EDIT: Benchmarks copied from comments in this PR
(The following benchmark result is outdated, after posting this, I made several rounds of optimizations in this PR. The final benchmark result is shown in #16850 (comment) (the "main" section))
kernel benchmarks (on A800): https://gist.github.com/jinzhen-lin/d5228895171a8970631dc953296cec0a
shapes of DeepSeek-V3-AWQ (with TP=8)
shapes of Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 (with TP=1)
shapes of Mixtral-8x7B-Instruct-v0.1-AWQ (with TP=1)
Summary:
Performance on DeepSeek-V3-AWQ (on 8*A800), with
VLLM_MARLIN_USE_ATOMIC_ADD=1
Accuracy Test on DeepSeek-R1-AWQ: