-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Support overlapping two batches #4068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support overlapping two batches #4068
Conversation
Hi, is this a minimal available version for two batch overlap? e.g., I mean could we directly run/test it on two H800 nodes? |
@agiping Hi, this PR is currently still in the state of "Draft PR", i.e. I am working on it. When it is done, I will convert it to be non-draft. Indeed I continued programming today and was waiting for DeepGEMM and DeepEP integration for several weeks, which are prerequisite of this PR. |
[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None
File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
self.mlp = DeepseekV2MoE(
File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
return cls._member_map_[name]
KeyError: None looks buggy here |
missing |
btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro. |
After the series of PRs are merged, you can have a check, and there are some tools that may be useful for other kinds of two batch overlap |
Have you tried testing with the I tested using the latest branch from your repository and found that it ran into an error:
The command I use is as follows.
I tested the following cases:
The environment I used is a single machine with 8 H800 cards, and the model has been reduced in layers (down to 20 hidden layers) to ensure that there is no OOM issue. |
I will get back to two batch overlap after EPLB |
# Conflicts: # python/sglang/srt/operations_strategy.py
Hello! Your work is great! May I ask if you have considered splitting into multiple chunks before the GEMM and hiding the communication through multiple streams, I experimented with this and found that although it is a coarse-grained approach, there are some throughput gains. |
How to split when input batch-size = 1, like warm up or single request? |
There doesn't seem to be a need to split in this case, I've made a simple example:#6923 |
Hello, I'd like to ask you a question. Where can I find the code for the scheduling of the two micro batch Decode stage? I want to learn about its implementation. Thanks! @fzyzcjy |
just check code diff |
Update
If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch
2025.03.26
Just now I run some benchmark on 8xH200 and there seems to be performance improvements. Note that I have not done careful tuning, because still waiting for the kernels and features (e.g. DeepGEMM for grouped gemm, DeepEP low-latency). Also, other orthogonal techniques such as reducing imbalance between GPUs may also help.
Experiment setup
Command
For baseline and this PR, change
{{extra_args}}
to empty string and--enable-two-batch-overlap
, respectively.The
random-output
is set to 1 deliberately to disable decode phase, because decode relies on low-latency kernel and CUDA Graph support, which is still not there yet.The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).
Experiment result
Throughput
On average, it improves 6.4% throughput. Again, since the dependent PRs are not there yet, this is a very preliminary number without real kernels and carful optimization.
2025.03.20
Current status
Since both DeepGEMM and DeepEP integration are finally ready (which are prerequisites of this PR), today I updated the code. Now it seems to work with the new DeepEP and also uses vanilla non-generator-based code (because the yield grammar for torch.compile will not be available until next pytorch release).
What to do next
More correctness tests (awaiting H100 GPU to be free)---> 2025.03.21 morning: H100 is free now, MMLU passesCheck profile results to see there does exist overlap (awaiting H100 GPU to be free)---> 2025.03.21 morning: YesCode cleanup and make PR ready (awaiting correctness tests)---> 2025.03.21 morning: doneawaiting correctness tests above, awaiting kernels)2025.03.04 (Outdated)
Currently, it is just a draft hacky implementation, because I need to wait for integration of DeepEP/DeepGEMM/etc before doing careful performance tuning.
The generation output looks roughly reasonable:
The profile timeline looks like there are two batch interleaving, and one batch's communication overlaps with another batch's computation. (CUDA graph is not enabled yet, since I hacked the part that will be replaced by DeepEP etc and it seems not CUDA graph compatible.)
The code is quite hacky and will refactor later.
Motivation
Modifications
Checklist