SGLang + Verl #3852

fzyzcjy · 2025-02-25T12:05:21Z

Motivation

~~Still WIP, mark as "ready for review" just to check CI.~~
Ready for review

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

examples/runtime/engine/adhoc_verl_torchrun.py

python/sglang/srt/entrypoints/engine.py

python/sglang/srt/entrypoints/verl_engine.py

ch-wan · 2025-02-26T05:54:20Z

python/sglang/srt/entrypoints/verl_engine.py

+class VerlEngine:
+    def __init__(
+        self,
+        device_mesh_cpu: DeviceMesh,


This device mesh has only one dimension. Can we use ProcessGroup instead?

I am personally OK for whatever API here, but the original feature request #2736 seems to pass in a 1D DeviceMesh so my default is to align with that.

EDIT: Btw quickly searched but ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

Can we use dist.get_global_rank(group, 0) or dist.get_process_group_ranks(group)[0]?

I feel that the SGLang community is more familiar with ProcessGroup. It would be great if we can keep such consistency.

Looks reasonable. Just now realized another tiny issue: It seems the DTensor weights is in FSDP DeviceMesh, so if we want to utilize DTensor.redistribute to SGLang mesh, we may need to have a DevcieMesh object. (Currently I do full_tensor() following Verl VLLM, and redistribute changing mesh is a not-yet-done in torch, so this is not in code, and wait for profiling to avoid premature optimizations)

ch-wan · 2025-02-26T06:17:02Z

python/sglang/srt/model_executor/model_runner.py

+            (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
+            for name, tensor in named_tensors
+        ]
+        # TODO should we name it "direct" or "megatron"?


based on its implementation, I recommend "direct".

P.S. #2736 named it "megatron", while I feel "direct" may be a bit more suitable, thus I leave the question here.

python/sglang/test/runners.py

ch-wan · 2025-02-26T08:35:22Z

python/sglang/test/runners.py

@@ -269,6 +212,79 @@ def __exit__(self, exc_type, exc_value, traceback):
        self.model_proc.terminate()
        self.in_queue = self.out_queue = None

+    @staticmethod


Is this refactor necessary?

(see below)

ch-wan · 2025-02-26T08:35:43Z

python/sglang/test/runners.py

@@ -408,6 +374,84 @@ def __exit__(self, exc_type, exc_value, traceback):
        self.engine.shutdown()
        del self.engine

+    @staticmethod


Is this refactor necessary?

(see below)

ch-wan · 2025-02-26T08:41:02Z

python/sglang/test/runners.py

-            mem_fraction_static=mem_fraction_static,
-            trust_remote_code=False,
+            mem_fraction_static=0.65,
+            trust_remote_code=True,


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

For changes in test/runners.py:

Firstly, it is both OK for me to refactor (to avoid code duplication) or to copy (to avoid changing existing code), though I personally slightly prefer refactoring, thus I commented # TODO Ask: is it ok to refactor test code like this in the code. Indeed zhaochenyang20 above seems to say LGTM.

Secondly, it is refactored because, in test_verl_engine.py, I made some comparison tests to ensure HuggingFace outputs are the same as SGLang outputs. The test_verl_engine.py roughly mimics adhoc_verl_torchrun.py, which is a minimal modification from guangming's Verl integration test script. This is quite similar to how comparison tests are done in test_generation_models.py, thus common logic are extracted.

For trust_remote_code, IIRC it is because some model (maybe THUDM/glm-4-9b-chat?) requires this. I copied the list of models in test_generation_models.py and put it in test_verl_engine.py and test them, and this model comes from there.

I leave the decision to @zhaochenyang20 as he is more knowledgeable about this refractor's potential impact.

To me, I don't think we need to change the trust_remote_code. THUDM/glm-4-9b-chat is not a widely used LLM. If we need to change the parameter to this model, we'd better delete this model in test cases.

To me, DRY (Don't Repeat Yourself). I agree with tom's refactor.

I copied it from test_generation_models.py (the ALL_OTHER_MODELS section) - not sure whether we are allowed to delete an existing test.

ch-wan · 2025-02-26T08:42:39Z

python/sglang/test/runners.py

@@ -130,7 +130,7 @@ def start_model_process(self, in_queue, out_queue, model_path, torch_dtype):
            self.base_model = AutoModelForCausalLM.from_pretrained(
                model_path,
                torch_dtype=torch_dtype,
-                trust_remote_code=False,
+                trust_remote_code=True,


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

(see above)

ch-wan · 2025-02-26T08:42:48Z

python/sglang/test/runners.py

-        self.tokenizer = get_tokenizer(model_path, torch_dtype=torch.dtype)
+        self.tokenizer = get_tokenizer(
+            model_path, torch_dtype=torch.dtype, trust_remote_code=True
+        )


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

(see above)

zhaochenyang20 · 2025-02-27T01:40:47Z

python/sglang/test/runners.py

-            mem_fraction_static=mem_fraction_static,
-            trust_remote_code=False,
+            mem_fraction_static=0.65,
+            trust_remote_code=True,


To me, DRY (Don't Repeat Yourself). I agree with tom's refactor.

yangw1234 · 2025-02-27T22:15:21Z

python/sglang/srt/entrypoints/verl_engine.py

+            dist.gather_object(
+                obj=serialized_tensor,
+                object_gather_list=gathered_serialized_tensors,
+                dst=self._device_mesh_cpu.mesh.tolist()[0],


nitpick: (I am not familiar with device mesh so this might be a stupid question.) Does the self._device_mesh_cpu.mesh.tolist()[0] return the global_rank for local_rank=0? Will it be more clear if we use group_dst=0?

Yes it is getting first global rank in group. About group_dst, I am a bit confused - it seems gather_object does not provide this API parameter.

Oh, it is introduced in pytorch 2.6. Maybe we can just stick to dst.

yangw1234

LGTM

shanyu-sys

LGTM

python/sglang/srt/entrypoints/verl_engine.py

ch-wan

I approve this PR. Avoiding full_tensor can be explored in future PRs.

zhaochenyang20 · 2025-02-28T05:32:43Z

Great. All set.

PeterSH6 · 2025-02-28T07:19:02Z

Great!!

fzyzcjy added 30 commits February 25, 2025 20:04

empty

15b1cb1

more

b86144c

more

8dec005

more

5d3aaa3

more

9a22ee4

more

d889d9a

more

8c6e2e5

more

6827ecb

more

a245074

more

2f2221f

more

51e73a9

more

76efa04

more

8769324

more

0af4a69

more

07704ca

more

e497d46

more

562f46f

more

0f37323

more

eba2bbf

more

f451284

more

d9ff06c

more

328a3ab

rm gather_pyobj

2ec60f5

more

a0ca4a7

more

497cf2f

more

48bc84a

more

540b774

more

4b1107d

more

0c10ec8

more

6a4891e

This was referenced Feb 26, 2025

Support custom device mesh for tensor parallel workers #3757

Closed

Support distributed tensor when updating weights #3763

Closed

Support direct weight loading #3764

Closed

Add more tests about SPMD #3767

Closed

ch-wan requested changes Feb 26, 2025

View reviewed changes

ch-wan reviewed Feb 26, 2025

View reviewed changes

fzyzcjy and others added 4 commits February 26, 2025 16:49

fix merge

b0b3d6e

Merge branch 'main' into feat/verl_20250225

a15079f

rename

40f96ea

rm file

0dea97f

zhaochenyang20 requested changes Feb 27, 2025

View reviewed changes

yangw1234 reviewed Feb 27, 2025

View reviewed changes

yangw1234 approved these changes Feb 27, 2025

View reviewed changes

shanyu-sys approved these changes Feb 28, 2025

View reviewed changes

python/sglang/srt/entrypoints/verl_engine.py Show resolved Hide resolved

ch-wan approved these changes Feb 28, 2025

View reviewed changes

Merge branch 'main' into feat/verl_20250225

bf716b9

shanyu-sys and others added 8 commits February 27, 2025 23:20

Merge branch 'main' into feat/verl_20250225

7abb8c4

cleanup

9d3e89d

more tests

3605d0a

Merge branch 'main' into feat/verl_20250225

e69687e

more tests

51eb6df

update originally failing tests (unrelated to this PR)

714ea18

Update test_programs.py

a0ca327

Update test_srt_backend.py

feec826

zhaochenyang20 merged commit e3e0bc5 into sgl-project:main Feb 28, 2025
4 of 18 checks passed

zhyncs mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

[Feature] SPMD for SGLang + Verl (sgl-project#3852)

bb022c0

SGLang + Verl #3852

SGLang + Verl #3852

Uh oh!

Conversation

fzyzcjy commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ch-wan Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangw1234 Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangw1234 left a comment

Choose a reason for hiding this comment

Uh oh!

shanyu-sys left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ch-wan left a comment

fzyzcjy commented Feb 25, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

ch-wan Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

yangw1234 Feb 27, 2025 •

edited

Loading