Skip to content

Support Phi-4 Multi-Modal (text + vision only) #6494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 25, 2025

Conversation

lifuhuang
Copy link
Collaborator

@lifuhuang lifuhuang commented May 21, 2025

Motivation

Support Phi4-MM model with text + vision.

Modifications

This change introduced the basic text + image support.
image

It's worth noting that the current MMMU run (without LoRA) is lower than advertised because Phi4MM relies on LoRA for full image understanding capabilities. However, LoRA support requires refactoring / generalizing the existing SGL LoRA handling, which will hopefully be addressed in this separate PR: #6585

Example: degraded image understanding without LoRA (MMMU is only 38). As comparison in our local branch (#6585) with LoRA, MMMU is boosted to ~50:
image

TODO in this PR:

  • add unit tests
  • clean-up styling issues

TODO in follow-up PR (ordered by priority):

  1. Precomputed feature support.
  2. LoRA support (required for multi-image understanding)
  3. SGLang LoRA compatibility with CUDA Graph and Radix Attention
  4. Refactor SGL MM processor logic support for support the original token variable image token (e.g., <image_1>)
  5. perf optimization
  6. audio support
  7. pipeline parallelism support

Tracked in #6544

Checklist

@lifuhuang lifuhuang mentioned this pull request May 23, 2025
7 tasks
@zhaochenyang20
Copy link
Collaborator

@mickqian @yizhang2077

@mickqian
Copy link
Collaborator

Better to be merged after #4969 , due to some change to the omni model processing and testing

@lifuhuang
Copy link
Collaborator Author

Better to be merged after #4969 , due to some change to the omni model processing and testing

Hi @mickqian, thank you so much for reviewing my PR :)

Can you share more details about the concern you have such that I can test them locally? JFYI, I was able to merge your branch mickqian:qwen2.5-omni locally without conflict and get a green TestOpenAIVisionServer run for phi4mm.

@mickqian
Copy link
Collaborator

mickqian commented May 23, 2025

Can you share more details

It's mostly that, for omni models, there's a new TestOpenaiOmniServer. And yes, you can cherry-pick it.

I just noticed audio input is not supported this time.

@lifuhuang lifuhuang requested a review from mickqian May 24, 2025 00:36
@zhyncs zhyncs merged commit 022012a into sgl-project:main May 25, 2025
1 of 19 checks passed
@lifuhuang lifuhuang self-assigned this May 25, 2025
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
@lifuhuang lifuhuang mentioned this pull request Jun 23, 2025
67 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants