Skip to content

Support overlapping two batches #4068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1,357 commits into from
May 25, 2025

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Mar 4, 2025

Update

If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch

2025.03.26

Just now I run some benchmark on 8xH200 and there seems to be performance improvements. Note that I have not done careful tuning, because still waiting for the kernels and features (e.g. DeepGEMM for grouped gemm, DeepEP low-latency). Also, other orthogonal techniques such as reducing imbalance between GPUs may also help.

Experiment setup

Command

sglang-bench-serving-launch-server extra_args:
    PYTHONUNBUFFERED=1 SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local python3 -m sglang.launch_server \
        --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \
        --tp 8 --dp 8 \
        --host 0.0.0.0 --port 32123 --decode-log-interval 1 \
        --enable-dp-attention --enable-deepep-moe --disable-cuda-graph \
        --enable-flashmla \
        --chunked-prefill-size 65536 \
        {{extra_args}}

sglang-bench-serving-random:
    python3 -m sglang.bench_serving \
        --backend sglang --host 127.0.0.1 --port 32123 \
        --dataset-name random \
        --num-prompt 1024 \
        --random-input 1000 --random-output 1 --random-range-ratio 1 \
        --max-concurrency 1024

For baseline and this PR, change {{extra_args}} to empty string and --enable-two-batch-overlap, respectively.
The random-output is set to 1 deliberately to disable decode phase, because decode relies on low-latency kernel and CUDA Graph support, which is still not there yet.

The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).

Experiment result

Throughput

  • baseline: 14492.08, 14657.42, 14515.49, 14590.32
  • ours: 15446.02, 15528.12, 15322.22, 15724.91

On average, it improves 6.4% throughput. Again, since the dependent PRs are not there yet, this is a very preliminary number without real kernels and carful optimization.

2025.03.20

Current status

Since both DeepGEMM and DeepEP integration are finally ready (which are prerequisites of this PR), today I updated the code. Now it seems to work with the new DeepEP and also uses vanilla non-generator-based code (because the yield grammar for torch.compile will not be available until next pytorch release).

What to do next

  • More correctness tests (awaiting H100 GPU to be free) ---> 2025.03.21 morning: H100 is free now, MMLU passes
  • Check profile results to see there does exist overlap (awaiting H100 GPU to be free) ---> 2025.03.21 morning: Yes
  • Code cleanup and make PR ready (awaiting correctness tests) ---> 2025.03.21 morning: done
  • Tune performance on H200 for DeepSeek-V3 model (awaiting correctness tests above, awaiting kernels)
  • Test CUDA Graph and torch compile (my code is roughly done, but need to wait DeepEP integration's support for CUDA Graph)

2025.03.04 (Outdated)

Currently, it is just a draft hacky implementation, because I need to wait for integration of DeepEP/DeepGEMM/etc before doing careful performance tuning.

The generation output looks roughly reasonable:

image

The profile timeline looks like there are two batch interleaving, and one batch's communication overlaps with another batch's computation. (CUDA graph is not enabled yet, since I hacked the part that will be replaced by DeepEP etc and it seems not CUDA graph compatible.)

Pasted image 20250305144309

The code is quite hacky and will refactor later.

Motivation

Modifications

Checklist

@merrymercy merrymercy mentioned this pull request Mar 13, 2025
67 tasks
@agiping
Copy link

agiping commented Mar 20, 2025

Hi, is this a minimal available version for two batch overlap? e.g., I mean could we directly run/test it on two H800 nodes?

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Mar 20, 2025

@agiping Hi, this PR is currently still in the state of "Draft PR", i.e. I am working on it. When it is done, I will convert it to be non-draft.

Indeed I continued programming today and was waiting for DeepGEMM and DeepEP integration for several weeks, which are prerequisite of this PR.

@fzyzcjy fzyzcjy marked this pull request as ready for review March 21, 2025 01:08
@fzyzcjy fzyzcjy changed the title [WIP] Support overlapping two batches Support overlapping two batches Mar 21, 2025
@ch-wan ch-wan mentioned this pull request Mar 25, 2025
18 tasks
@ZJLi2013
Copy link

ZJLi2013 commented Apr 10, 2025

[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None

  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
    self.mlp = DeepseekV2MoE(
  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
    dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
  File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
    return cls._member_map_[name]
KeyError: None

looks buggy here

@ZJLi2013
Copy link

[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None

  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
    self.mlp = DeepseekV2MoE(
  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
    dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
  File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
    return cls._member_map_[name]
KeyError: None

looks buggy here

missing --deepep-mode normal, no issue to reproduce now. thanks

@ZJLi2013
Copy link

btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro.

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 10, 2025

btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro.

After the series of PRs are merged, you can have a check, and there are some tools that may be useful for other kinds of two batch overlap

@FrontierSetter
Copy link

FrontierSetter commented Apr 11, 2025

Have you tried testing with the --random-output parameter set to greater than 1?

I tested using the latest branch from your repository and found that it ran into an error:

Assertion failed: /usr/local/lib/python3.10/dist-packages/deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:435, condition: status == cudaSuccess                                                     [175/1002]
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: initialized

Thread 0x00007f440bfff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4417fff640 (most recent call first):
  File "/usr/lib/python3.10/socket.py", line 293 in accept
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 609 in accept
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 463 in accept
  File "/usr/lib/python3.10/multiprocessing/resource_sharer.py", line 138 in _serve
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4423fff640 (most recent call first):
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/scheduler.py", line 1660 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f442ffff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/deep_gemm/jit/runtime.py", line 45 in __call__
  File "/usr/local/lib/python3.10/dist-packages/deep_gemm/jit_kernels/gemm.py", line 205 in gemm_fp8_fp8_bf16_nt
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_kernel.py", line 62 in deep_gemm_fp8_fp8_bf16_nt
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_kernel.py", line 783 in w8a8_block_fp8_matmul
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_utils.py", line 156 in apply_w8a8_block_fp8_linear
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8.py", line 422 in apply
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/linear.py", line 1276 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1000 in forward_absorb_stage_core
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 922 in forward_absorb
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 869 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1315 in forward_mode_mlp_all
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1289 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1604 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1701 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 961 in forward_decode
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 1023 in _forward_raw
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 1005 in forward
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker.py", line 176 in forward_batch_generation
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
...

The command I use is as follows.

CUDA_LAUNCH_BLOCKING=1 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path /share/model/DeepSeek-R1/ --tp 8 --dp 8 --trust-remote-code --disable-radix-cache --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --enable-flashmla --port 20000 --disable-cuda-graph --max-running-requests 128 --chunked-prefill-size 1024 --max-prefill-tokens 128 --stream-output --enable-two-batch-overlap

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 512 --random-input 1000 --random-output 10 --random-range-ratio 1 --host 127.0.0.1 --port 20000 --max-concurrency 128

I tested the following cases:

  1. Without enabling --enable-two-batch-overlap and set --random-output to 1000: no error
  2. Enabling --enable-two-batch-overlap and set --random-output to 1: no error
  3. Enabling --enable-two-batch-overlap and set --random-output to 10 (or 1000): error occurred.

The environment I used is a single machine with 8 H800 cards, and the model has been reduced in layers (down to 20 hidden layers) to ensure that there is no OOM issue.

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 11, 2025

I will get back to two batch overlap after EPLB

# Conflicts:
#	python/sglang/srt/operations_strategy.py
@zhyncs zhyncs merged commit 0d47788 into sgl-project:main May 25, 2025
3 of 42 checks passed
@Jacki1223
Copy link

Hello! Your work is great! May I ask if you have considered splitting into multiple chunks before the GEMM and hiding the communication through multiple streams, I experimented with this and found that although it is a coarse-grained approach, there are some throughput gains.

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
@GreatBryan
Copy link

How to split when input batch-size = 1, like warm up or single request?

@Jacki1223
Copy link

How to split when input batch-size = 1, like warm up or single request?

There doesn't seem to be a need to split in this case, I've made a simple example:#6923

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
@nannaer
Copy link

nannaer commented Jun 26, 2025

Hello, I'd like to ask you a question. Where can I find the code for the scheduling of the two micro batch Decode stage? I want to learn about its implementation. Thanks! @fzyzcjy

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jun 26, 2025

just check code diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.