-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Improve DP attention #4390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DP attention #4390
Conversation
Co-authored-by: dhou-xai <[email protected]>
Can we run 671b models with --dp 2 --tp 8 in 16 x H100 ? |
Co-authored-by: dhou-xai <[email protected]> Co-authored-by: SangBin Cho <[email protected]>
@xihuai18 Yes. You can use |
Please share the command here and in the docs once you finish the testing |
We should have a hyperparameter tuning best practice in the documentation """ |
I also try to run it on tp16 dp2 setting. I found that capture cuda graph will cause segmentation fault. I can run it with --disable-cuda-graph or update nccl. Also, for older sglang version, lower nccl is okay. May I know is it necessary for me to update nccl for this version update? |
[2025-03-14 14:50:57 DP0 TP4] Scheduler hit an exception: Traceback (most recent call last): not compatible with MTP, will it be supported in the future? |
following options are tested but failed:
|
I also met OOM |
Do we still need |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
at TP=8 and bs=1.--dp
and--tp
. The constraint is that--dp
should be smaller than--tp
. You can first set--tp
as the number of total GPUs you have, then tune--dp
to trade-off between latency and KV cache capacity (or throughput). For example, to achieve better latency for small bs, you can do--tp 8 --dp 2
. To allow more KV cache capacity for larger bs, you can do--tp 8 --dp 8
. An example command:99% of the code is done by @dhou-xai .