-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Support Phi-4 Multi-Modal (text + vision only) #6494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Better to be merged after #4969 , due to some change to the omni model processing and testing |
Hi @mickqian, thank you so much for reviewing my PR :) Can you share more details about the concern you have such that I can test them locally? JFYI, I was able to merge your branch mickqian:qwen2.5-omni locally without conflict and get a green TestOpenAIVisionServer run for phi4mm. |
I just noticed audio input is not supported this time. |
Motivation
Support Phi4-MM model with text + vision.
Modifications
This change introduced the basic text + image support.

It's worth noting that the current MMMU run (without LoRA) is lower than advertised because Phi4MM relies on LoRA for full image understanding capabilities. However, LoRA support requires refactoring / generalizing the existing SGL LoRA handling, which will hopefully be addressed in this separate PR: #6585
Example: degraded image understanding without LoRA (MMMU is only 38). As comparison in our local branch (#6585) with LoRA, MMMU is boosted to ~50:

TODO in this PR:
TODO in follow-up PR (ordered by priority):
Tracked in #6544
Checklist