Fuse shared experts in Llama 4 #5101

fzyzcjy · 2025-04-06T10:36:44Z

Motivation

please subtract diff from #5092

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

# Conflicts: # python/sglang/srt/layers/attention/flashattention_backend.py

This reverts commit ac4cca3.

…rineSue/5092

# Conflicts: # python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

1. Adds a `use_irope` parameter to the RadixAttention class to indicate whether a layer should use local attention based on iRoPE 2. Modifies Llama4Attention to pass `use_irope=not self.nope` to RadixAttention, leveraging the existing NoPE flag 3. Updates FlashAttentionBackend.forward_extend to check for the `use_irope` flag when determining if local attention should be used 4. Simplifies local attention activation logic by directly checking `attention_chunk_size is not None` instead of using a separate flag

This reverts commit 82ee700.

This reverts commit fc81086.

…pert

fzyzcjy · 2025-04-21T08:06:49Z

No speedup, thus close this

CatherineSue and others added 30 commits April 4, 2025 16:36

Add Llama4 support

93ab6b9

complete pipeline

73c5d6d

fix

ca9870e

add locall_attn

fdb0dd6

load weight

2cd80c2

Merge branch 'main-upstream' into llama4

5a56108

# Conflicts: # python/sglang/srt/layers/attention/flashattention_backend.py

rm mllama4

ac4cca3

load experts

6cfb3a7

load weight

9fd5188

Revert "rm mllama4"

6afdfdf

This reverts commit ac4cca3.

Merge commit '9fd5188965867d0335d8dde357ec81b1a6880982' into pr/Cathe…

a8d4bff

…rineSue/5092

polish code

b0703ec

cleanup

6b21ef5

format

114a366

more

c097826

more

470bd94

more

63455cb

more

a4bfbe8

more

fb95032

more

9cdbe53

more

1a86c6a

more

c5c5c9e

more

27b9775

more

2d72913

more

762b0f1

more

26c8ef5

more

c2fec4b

more

da62632

more

8b5b0df

more

13e7c10

fzyzcjy and others added 26 commits April 6, 2025 18:18

more

9433d8f

support k > 1

cc7e862

Merge branch 'llama4' into feat/llama_tom_fused_shared_expert

0d53105

# Conflicts: # python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

fmt

d227924

more

ade4384

more

5a07b9d

more

773cdf6

lint

d8c4432

fix

f5d4cf7

more

49834d7

more

a517a53

fix mlp

95de87d

Merge branch 'llama4' into feat/llama4_tuning

5cab0b5

more

fc81086

more

82ee700

Revert "more"

f18a7de

This reverts commit 82ee700.

Revert "more"

f87f710

This reverts commit fc81086.

more

d789896

Merge branch 'llama4' into feat/llama_tom_fused_shared_expert

663ae1e

more

885d525

more

cf9b9e7

more

3bf8b1d

tuning

5d4f66a

tuning

42baa99

Merge branch 'feat/llama4_tuning' into feat/llama_tom_fused_shared_ex…

add644b

…pert

This was referenced Apr 10, 2025

Tiny refactor DeepSeek V3/R1 NextN shared experts fusion #5143

Closed

Tiny refactor computation of shared expert fusion weights #5261

Open

fzyzcjy closed this Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse shared experts in Llama 4 #5101

Fuse shared experts in Llama 4 #5101

Uh oh!

fzyzcjy commented Apr 6, 2025

Uh oh!

fzyzcjy commented Apr 21, 2025

Uh oh!

Uh oh!

Fuse shared experts in Llama 4 #5101

Fuse shared experts in Llama 4 #5101

Uh oh!

Conversation

fzyzcjy commented Apr 6, 2025

Motivation

Modifications

Checklist

Uh oh!

fzyzcjy commented Apr 21, 2025

Uh oh!

Uh oh!