Let `bench_one_batch` support `enable_dp_attention` #4058

fzyzcjy · 2025-03-04T08:38:19Z

Motivation

Before fix, enable_dp_attention=false, it works:

Details

python3 -m sglang.bench_one_batch --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --tp 2 --correct --trust-remote-code
[2025-03-04 00:23:46 TP1] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:23:46 TP1] Init torch distributed begin.
[2025-03-04 00:23:47 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:23:47 TP0] Init torch distributed begin.
[2025-03-04 00:23:47 TP0] sglang is using nccl==2.21.5
[2025-03-04 00:23:47 TP1] sglang is using nccl==2.21.5
[2025-03-04 00:23:48 TP0] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:23:48 TP1] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:23:48 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:23:48 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:23:49 TP0] Using model weights format ['*.safetensors']
[2025-03-04 00:23:49 TP1] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.57it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.16it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.05it/s]

[2025-03-04 00:23:53 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=62.40 GB
[2025-03-04 00:23:53 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=62.40 GB
[2025-03-04 00:23:53 TP1] Memory pool end. avail mem=7.58 GB
[2025-03-04 00:23:53 TP0] Memory pool end. avail mem=7.58 GB
[2025-03-04 00:23:54 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.49 GB
  0%|                                                                                                                    | 0/23 [00:00<?, ?it/s][2025-03-04 00:23:54 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=7.49 GB
loc("/host_home/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/host_home/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-03-04 00:23:58 TP1] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
[2025-03-04 00:23:58 TP0] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:18<00:00,  1.21it/s]
[2025-03-04 00:24:13 TP1] Registering 1265 cuda graph addresses
[2025-03-04 00:24:13 TP0] Registering 1265 cuda graph addresses
[2025-03-04 00:24:13 TP1] Capture cuda graph end. Time elapsed: 18.99 s. avail mem=5.88 GB
[2025-03-04 00:24:13 TP0] Capture cuda graph end. Time elapsed: 19.00 s. avail mem=5.88 GB
max_total_num_tokens=1805324

input_ids=[[100000, 549, 6077, 280, 7239, 317], [100000, 549, 6077, 280, 254, 4794, 20531, 283, 317], [100000, 16148, 317, 245, 28758, 1492, 285, 304, 837]]

prefill logits (first half): tensor([[ 7.1250,  5.9062,  4.0625,  ..., -3.1562, -3.1719, -3.2500],
        [ 7.1250,  5.9062,  4.0625,  ..., -3.1562, -3.1719, -3.2500],
        [ 7.2812,  9.9375, 10.5000,  ..., -1.7266, -1.5859, -1.6641]],
       device='cuda:0') 

prefill logits (final): tensor([[ 9.8750, 14.0625,  6.2188,  ..., -4.6875, -4.4375, -4.6562],
        [10.8125, 12.7500,  7.5938,  ..., -3.7969, -3.5781, -3.7500],
        [13.5000, 10.2500,  9.2500,  ..., -2.7031, -2.7031, -2.7031]],
       device='cuda:0') 

========== Prompt 0 ==========
<｜begin▁of▁sentence｜>The capital of France is Paris.

The capital of France is indeed Paris. Paris is the largest 

========== Prompt 1 ==========
<｜begin▁of▁sentence｜>The capital of the United Kindom is London.

The population of London is approximately 8.982 

========== Prompt 2 ==========
<｜begin▁of▁sentence｜>Today is a sunny day and I like to go out and enjoy the sun. I am going to the park to read 

[rank0]:[W304 00:24:19.095513178 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
➜  misc

Before fix, enable_dp_attention=true, it errors:

Details

python3 -m sglang.bench_one_batch --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --tp 2 --correct --trust-remote-code --enable-dp-attention
DP attention is enabled. The chunked prefill size is adjusted to 4096 to avoid MoE kernel issues. The schedule conservativeness is adjusted to 0.3. Data parallel size is adjusted to be the same as tensor parallel size. 
[2025-03-04 00:24:48 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:24:48 TP0] Init torch distributed begin.
[2025-03-04 00:24:48 TP1] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:24:48 TP1] Init torch distributed begin.
[2025-03-04 00:24:49 TP0] sglang is using nccl==2.21.5
[2025-03-04 00:24:49 TP1] sglang is using nccl==2.21.5
[2025-03-04 00:24:50 TP1] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:24:50 TP0] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:24:50 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:24:50 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:24:51 TP1] Using model weights format ['*.safetensors']
[2025-03-04 00:24:51 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.60it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.14it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03it/s]

[2025-03-04 00:24:55 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=61.62 GB
[2025-03-04 00:24:55 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=61.62 GB
[2025-03-04 00:24:55 TP1] Memory pool end. avail mem=7.60 GB
[2025-03-04 00:24:55 TP0] Memory pool end. avail mem=7.60 GB
[2025-03-04 00:24:55 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.51 GB
[2025-03-04 00:24:55 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=7.51 GB
  0%|                                                                                                                    | 0/23 [00:00<?, ?it/s]loc("/host_home/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.loc(p"y/"h:o310s:t16_)h: oerror: moperation scheduled before its operandse
/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-03-04 00:25:00 TP0] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
[2025-03-04 00:25:00 TP1] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:17<00:00,  1.33it/s]
[2025-03-04 00:25:13 TP1] Registering 621 cuda graph addresses
[2025-03-04 00:25:13 TP0] Registering 621 cuda graph addresses
[2025-03-04 00:25:13 TP1] Capture cuda graph end. Time elapsed: 17.35 s. avail mem=6.02 GB
[2025-03-04 00:25:13 TP0] Capture cuda graph end. Time elapsed: 17.35 s. avail mem=6.02 GB
max_total_num_tokens=1778556

input_ids=[[100000, 549, 6077, 280, 7239, 317], [100000, 549, 6077, 280, 254, 4794, 20531, 283, 317], [100000, 16148, 317, 245, 28758, 1492, 285, 304, 837]]

Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/bench_one_batch.py", line 278, in correctness_test
    next_token_ids, next_token_logits, batch = extend(reqs, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/bench_one_batch.py", line 243, in extend
    logits_output = model_runner.forward(forward_batch)
  File "/host_home/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 850, in forward
    return self.forward_extend(forward_batch)
  File "/host_home/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 815, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 1036, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 997, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 946, in forward
    hidden_states, start_idx, end_idx = all_gather(
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 822, in all_gather
    max_len = max(forward_batch.global_num_tokens)
TypeError: 'NoneType' object is not iterable
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/bench_one_batch.py", line 278, in correctness_test
    next_token_ids, next_token_logits, batch = extend(reqs, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/bench_one_batch.py", line 243, in extend
    logits_output = model_runner.forward(forward_batch)
  File "/host_home/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 850, in forward
    return self.forward_extend(forward_batch)
  File "/host_home/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 815, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 1036, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 997, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 946, in forward
    hidden_states, start_idx, end_idx = all_gather(
  File "/host_home/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 822, in all_gather
    max_len = max(forward_batch.global_num_tokens)
TypeError: 'NoneType' object is not iterable
[rank0]:[W304 00:25:16.747805092 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
C

After fix, enable_dp_attention=true, it works:

Details

python3 -m sglang.bench_one_batch --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --tp 2 --correct --trust-remote-code --enable-dp-attention
DP attention is enabled. The chunked prefill size is adjusted to 4096 to avoid MoE kernel issues. The schedule conservativeness is adjusted to 0.3. Data parallel size is adjusted to be the same as tensor parallel size. 
[2025-03-04 00:36:45 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:36:45 TP0] Init torch distributed begin.
[2025-03-04 00:36:45 TP1] MLA optimization is turned on. Use triton backend.
[2025-03-04 00:36:45 TP1] Init torch distributed begin.
[2025-03-04 00:36:46 TP0] sglang is using nccl==2.21.5
[2025-03-04 00:36:46 TP1] sglang is using nccl==2.21.5
[2025-03-04 00:36:47 TP0] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:36:47 TP1] Load weight begin. avail mem=77.18 GB
[2025-03-04 00:36:47 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:36:47 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-04 00:36:48 TP1] Using model weights format ['*.safetensors']
[2025-03-04 00:36:48 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.61it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.15it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03it/s]

[2025-03-04 00:36:52 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=61.62 GB
[2025-03-04 00:36:52 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=61.62 GB
[2025-03-04 00:36:52 TP0] Memory pool end. avail mem=7.60 GB
[2025-03-04 00:36:52 TP1] Memory pool end. avail mem=7.60 GB
[2025-03-04 00:36:52 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=7.51 GB
[2025-03-04 00:36:52 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=7.51 GB
  0%|                                                                                                                    | 0/23 [00:00<?, ?it/s]loc("/host_home/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/host_home/primary_synced/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-03-04 00:36:57 TP1] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
[2025-03-04 00:36:57 TP0] Using default MoE config. Performance might be sub-optimal! Config file not found at /host_home/primary_synced/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=64,N=704,device_name=NVIDIA_H100_80GB_HBM3.json
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:16<00:00,  1.36it/s]
[2025-03-04 00:37:09 TP1] Registering 621 cuda graph addresses
[2025-03-04 00:37:09 TP0] Registering 621 cuda graph addresses
[2025-03-04 00:37:09 TP1] Capture cuda graph end. Time elapsed: 16.93 s. avail mem=6.02 GB
[2025-03-04 00:37:09 TP0] Capture cuda graph end. Time elapsed: 16.94 s. avail mem=6.02 GB
max_total_num_tokens=1778556

input_ids=[[100000, 549, 6077, 280, 7239, 317], [100000, 549, 6077, 280, 254, 4794, 20531, 283, 317], [100000, 16148, 317, 245, 28758, 1492, 285, 304, 837]]

prefill logits (first half): tensor([[ 7.0625,  5.9062,  4.1562,  ..., -3.1250, -3.1406, -3.2344],
        [ 7.0625,  5.9062,  4.1562,  ..., -3.1250, -3.1406, -3.2344],
        [ 7.3125,  9.9375, 10.5000,  ..., -1.7891, -1.6484, -1.7266]],
       device='cuda:0') 

prefill logits (final): tensor([[ 9.8125, 14.0000,  6.2188,  ..., -4.6875, -4.4062, -4.6562],
        [10.7500, 12.6875,  7.5938,  ..., -3.7812, -3.5625, -3.7344],
        [13.4375, 10.2500,  9.2500,  ..., -2.6875, -2.6875, -2.6875]],
       device='cuda:0') 

========== Prompt 0 ==========
<｜begin▁of▁sentence｜>The capital of France is Paris.

The capital of France is indeed Paris. Paris is the most 

========== Prompt 1 ==========
<｜begin▁of▁sentence｜>The capital of the United Kindom is London.

The population of the United Kingdom is approximately 67 million 

========== Prompt 2 ==========
<｜begin▁of▁sentence｜>Today is a sunny day and I like to go out and enjoy the sun. I am going to the park to read 

[rank0]:[W304 00:37:16.789540315 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
➜  misc

EDIT: p.s. This is needed to test #4068, #4232, etc, which uses dp attention

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

merrymercy · 2025-03-04T11:44:33Z

Please hold this for now. We will upstream more DP attention stuff later.

…ention # Conflicts: # python/sglang/srt/managers/scheduler.py

…ention

merrymercy · 2025-03-30T10:02:56Z

all dp attention changes are merged into main now. We can merge this PR

* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <[email protected]> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <[email protected]> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <[email protected]> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <[email protected]> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <[email protected]> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: ch-wan <[email protected]> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: GeLee <[email protected]> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: zcnrex <[email protected]> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <[email protected]> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <[email protected]> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <[email protected]> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <[email protected]> Co-authored-by: laixinn <[email protected]> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Chunan Zeng <[email protected]> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <[email protected]> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: ispobock <[email protected]> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <[email protected]> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <[email protected]> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <[email protected]> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <[email protected]> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <[email protected]> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <[email protected]> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <[email protected]> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <[email protected]> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <[email protected]> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <[email protected]> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <[email protected]> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <[email protected]> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <[email protected]> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <[email protected]> Co-authored-by: sleepcoo <[email protected]> Co-authored-by: zhyncs <[email protected]> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: shangmingc <[email protected]> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <[email protected]> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <[email protected]> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <[email protected]> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <[email protected]> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <[email protected]> Co-authored-by: hebiao064 <[email protected]> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Chayenne <[email protected]> Co-authored-by: Jinn <[email protected]> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <[email protected]> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <[email protected]> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <[email protected]> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <[email protected]> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <[email protected]> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <[email protected]> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <[email protected]> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: ybyang <[email protected]> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <[email protected]> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <[email protected]> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <[email protected]> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: windsonsea <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Juwan Yoo <[email protected]> Co-authored-by: Qingquan Song <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: chaobo jia <[email protected]> Co-authored-by: rudy152 <[email protected]> Co-authored-by: Fr4nk1in <[email protected]> Co-authored-by: yinfan98 <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Sleepcoo <[email protected]> Co-authored-by: SEPLOS <[email protected]> Co-authored-by: ch-wan <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: JieXin Liang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Jinyan Chen <[email protected]> Co-authored-by: laixinn <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: GeLee <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: hebiao064 <[email protected]> Co-authored-by: zcnrex <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: renxin <[email protected]> Co-authored-by: saltyfish66 <[email protected]> Co-authored-by: yuethe <[email protected]> Co-authored-by: simveit <[email protected]> Co-authored-by: Yifan Zhang <[email protected]> Co-authored-by: Ravi Theja <[email protected]> Co-authored-by: Ravi Theja Desetty <[email protected]> Co-authored-by: AniZpZ <[email protected]> Co-authored-by: 晟海 <[email protected]> Co-authored-by: Tommy Yang <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: mlmz <[email protected]> Co-authored-by: shuaills <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: HAI <[email protected]> Co-authored-by: tianhaoyu <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: Yun Dai <[email protected]> Co-authored-by: Hubert Lu <[email protected]> Co-authored-by: huangtingwei <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: wunhuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Yubo Wang <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: DangKai <[email protected]> Co-authored-by: dangkai.dk <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Ma Mingfei <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: zhaochenyang20 <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: grimoire <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: Zhaoyang Hao <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yusong Gao <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: lambert0312 <[email protected]> Co-authored-by: tianlian yi <[email protected]> Co-authored-by: Jin Pan <[email protected]> Co-authored-by: Jinn <[email protected]> Co-authored-by: yulei <[email protected]> Co-authored-by: Yongtong Wu <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: mRSun15 <[email protected]> Co-authored-by: ryang <[email protected]> Co-authored-by: Yuhao Yang <[email protected]>

fzyzcjy added 6 commits March 4, 2025 16:29

more

0c3ce30

more

ef09b0d

more

05a270e

more

3020599

more

d4967f5

fmt

8a6d83a

fzyzcjy requested review from merrymercy, Ying1123 and hnyls2002 as code owners March 4, 2025 08:38

fzyzcjy added 4 commits March 21, 2025 11:17

Merge branch 'main-upstream' into feat/bench_one_batch_support_dp_att…

1d60c37

…ention # Conflicts: # python/sglang/srt/managers/scheduler.py

more merge

7d8fa55

more

5ecf949

fmt

9de139b

fzyzcjy requested a review from xiezhq-hermann as a code owner March 21, 2025 03:19

fzyzcjy added 2 commits March 22, 2025 14:03

fix

b56f2ca

Merge branch 'main-upstream' into feat/bench_one_batch_support_dp_att…

cf1c29c

…ention

fzyzcjy and others added 5 commits March 30, 2025 18:08

Merge branch 'main' into feat/bench_one_batch_support_dp_attention

d20e25d

Merge branch 'main' into feat/bench_one_batch_support_dp_attention

b1bb372

Merge branch 'main' into feat/bench_one_batch_support_dp_attention

9b0aaa1

Merge branch 'main' into feat/bench_one_batch_support_dp_attention

1d74ccb

Merge branch 'main' into feat/bench_one_batch_support_dp_attention

e52b04f

zhyncs approved these changes Apr 9, 2025

View reviewed changes

zhyncs merged commit 61970b0 into sgl-project:main Apr 9, 2025
0 of 19 checks passed

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025

Let bench_one_batch support enable_dp_attention (sgl-project#4058)

11c2170

thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025

Let bench_one_batch support enable_dp_attention (sgl-project#4058)

84626fc

jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Apr 13, 2025

Let bench_one_batch support enable_dp_attention (sgl-project#4058)

5ca2f3f

DiweiSun pushed a commit to DiweiSun/sglang that referenced this pull request Apr 16, 2025

Let bench_one_batch support enable_dp_attention (sgl-project#4058)

9ddc9fe

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

Let bench_one_batch support enable_dp_attention (sgl-project#4058)

dcdf48d

zyksir mentioned this pull request Jun 14, 2025

adapt bench_one_batch on dp-attention #7169

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let `bench_one_batch` support `enable_dp_attention` #4058

Let `bench_one_batch` support `enable_dp_attention` #4058

Uh oh!

fzyzcjy commented Mar 4, 2025 •

edited

Loading

Uh oh!

merrymercy commented Mar 4, 2025

Uh oh!

merrymercy commented Mar 30, 2025

Uh oh!

Uh oh!

Uh oh!

Let bench_one_batch support enable_dp_attention #4058

Let bench_one_batch support enable_dp_attention #4058

Uh oh!

Conversation

fzyzcjy commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

merrymercy commented Mar 4, 2025

Uh oh!

merrymercy commented Mar 30, 2025

Uh oh!

Uh oh!

Uh oh!

Let `bench_one_batch` support `enable_dp_attention` #4058

Let `bench_one_batch` support `enable_dp_attention` #4058

fzyzcjy commented Mar 4, 2025 •

edited

Loading