-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation #16878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation #16878
Conversation
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Thank you for your bug fix! I will take a look at it. |
I will update the test snippet to use the original impl of transformers and vLLM (instead of currently copying the code manually). The PR huggingface/transformers#37631 has been merged and we can consider it as a groundtruth for video-with-audio now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging this to unblock followup PRs.
@fyabc Please take a look when you have time and let us know if you find any problem!
@WoosukKwon @imkero Sorry for the late reply, this bug fix is correct and consistent with transformers; thank you very much for the fix! |
…t#16878) Signed-off-by: imkero <[email protected]>
…t#16878) Signed-off-by: imkero <[email protected]>
…t#16878) Signed-off-by: imkero <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…t#16878) Signed-off-by: imkero <[email protected]> Signed-off-by: Mu Huai <[email protected]>
…t#16878) Signed-off-by: imkero <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
…t#16878) Signed-off-by: imkero <[email protected]> Signed-off-by: minpeter <[email protected]>
What this PR do
This PR fixes the
MRotaryEmbedding::_omni_get_input_positions_tensor
result whenuse_audio_in_video=True
and there is an image or a video in the inputRoot cause
The main branch code incorrectly handles the position shift among
vision_bos
,audio_bos
,audio_eos
andvision_eos
, causing a1
position delta between the hf output and the vLLM outputuse_audio_in_video=True
, the position id shift should only apply to videovision_end
belongs to an image or a videovllm/vllm/model_executor/layers/rotary_embedding.py
Lines 1226 to 1234 in 205d84a
audio_bos
&audio_eos
handling twice whenuse_audio_in_video=True
vllm/vllm/model_executor/layers/rotary_embedding.py
Lines 1226 to 1234 in 205d84a
vllm/vllm/model_executor/layers/rotary_embedding.py
Lines 1271 to 1296 in 205d84a
Compare to the groundtruth
Current vLLM impl is producing different result from the huggingface transformers impl, the following snippet do these tests.
https://gist.github.com/imkero/0377d5cc175bf068af237465a861dc3d
Case 1
use_audio_in_video=True
and an imageCase 2
use_audio_in_video=True
and an videoNOTE: currently the HuggingFace transformers main branch code has another bug in it, and won't produce reliable groundtruth now, see huggingface/transformers#37631
the following result is generated by the transformers code in huggingface/transformers#37631
for detailed explanation about the input and output, please refer to https://github.com/huggingface/transformers/blob/b16b93d6857cc1a3d318d3f9ae9957413a6e976c/tests/models/qwen2_5_omni/test_modeling_qwen2_5_omni.py#L384-L510
cc @ywang96 @fyabc