Skip to content

Support tuning moe for llama 4 model #5109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 7, 2025

Motivation

tune outputs will be in #5092, here I only put script updates to avoid making 5092 so big

Modifications

Checklist

CatherineSue and others added 30 commits April 4, 2025 16:36
# Conflicts:
#	python/sglang/srt/layers/attention/flashattention_backend.py
This reverts commit ac4cca3.
1. Adds a `use_irope` parameter to the RadixAttention class to indicate whether a layer should use local attention based on iRoPE
2. Modifies Llama4Attention to pass `use_irope=not self.nope` to RadixAttention, leveraging the existing NoPE flag
3. Updates FlashAttentionBackend.forward_extend to check for the `use_irope` flag when determining if local attention should be used
4. Simplifies local attention activation logic by directly checking `attention_chunk_size is not None` instead of using a separate flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants