Skip to content

[Feature] Phi-4-MM support #6544

Open
1 of 1 issue completed
Open
1 of 1 issue completed
@lifuhuang

Description

@lifuhuang

Update

Currently we have added text & vision support.

Repeated MMMU benchmark runs range between 53.6 - 55.5, consistent with the the benchmark reported in the original paper (55).

Known limitations: (See Execution Plan before for full list):

  1. Audio capabilities: currently we do not support audio at all.
  2. LoRA / Image quality: Phi4MM depends on LoRA for full image capability, but there is some compatibility issues with the native SGL LORA solution. We are working on solving it by refactoring / generalizing SGL LoRA capabilities. Fixed with Refactor LoRA handling to support adapter tensors in fused format #6585, Fix incorrect LoRA weight loading for fused gate_up_proj #6734, Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. #6861)
  3. Token: Phi4MM supports two types of image token conventions (<|image1|> and <|endoftext10|>), currently we only support the latter. If you use the default chat template, it will automatically pick up the supported one.

Motivation

Supporting the Phi4 Multimodal model (https://huggingface.co/microsoft/Phi-4-multimodal-instruct in SGL.

Execution Plan:

Related resources

No response

Sub-issues

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions