Skip to content

Fuse shared experts in Llama 4 #5101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 6, 2025

Motivation

please subtract diff from #5092

Modifications

Checklist

fzyzcjy and others added 26 commits April 6, 2025 18:18
# Conflicts:
#	python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
1. Adds a `use_irope` parameter to the RadixAttention class to indicate whether a layer should use local attention based on iRoPE
2. Modifies Llama4Attention to pass `use_irope=not self.nope` to RadixAttention, leveraging the existing NoPE flag
3. Updates FlashAttentionBackend.forward_extend to check for the `use_irope` flag when determining if local attention should be used
4. Simplifies local attention activation logic by directly checking `attention_chunk_size is not None` instead of using a separate flag
This reverts commit 82ee700.
This reverts commit fc81086.
@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 21, 2025

No speedup, thus close this

@fzyzcjy fzyzcjy closed this Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants