-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) #15130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+1,852
−82
Merged
Changes from 33 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
732ec71
Initial commit for Qwen2.5-Omni (thinker only).
fyabc 30d8ac6
update doc and typing
fyabc f8668bf
Merge branch 'refs/heads/main' into qwen2_omni_public_v1
fyabc 3108e36
fix typing error
fyabc 42ab3b7
fix typing
fyabc 139d305
fix typing
fyabc 04fe220
fix bug in multi-audio merging
fyabc 7bbd238
add more examples
fyabc d4a61c9
adapt for transformers update
fyabc d40f54f
fix bug of 'use_audio_in_video'
fyabc a6f878e
Merge branch 'main' into qwen2_omni_public_v1
ywang96 d3eb60d
precommit
ywang96 0cd5aa8
update V1 interface
ywang96 98226ad
add TODO
ywang96 6b4c705
Update docs/source/models/supported_models.md
ywang96 286e755
assert VLLM_USE_V1=0 audio in video example
ywang96 9cf9d26
adapt for transformers PR
fyabc 53501f3
multiple fixes
ywang96 512f874
squeeze only one dimension
ywang96 9c984d0
fix squeezing
ywang96 3908518
minor refactoring
ywang96 864accf
precommit
ywang96 adc5cdf
Merge branch 'main' into qwen2_omni_public_v1
ywang96 7108ba3
reformat
fyabc 71f96e4
add omni to chat utils
ywang96 ebd8b88
fix model type
ywang96 512bd41
fix typo
ywang96 68004d8
Merge remote-tracking branch 'upstream/main' into qwen2_omni_public_v1
ywang96 1dac918
Merge branch 'refs/heads/main' into qwen2_omni_public_v1
fyabc 1b0bf89
Fix vision attention qkv
fyabc f5def01
fix hard code
fyabc 753858b
fix hidden_size
fyabc ed6dca1
Merge branch 'refs/heads/main' into qwen2_omni_public_v1
fyabc 976fbf0
fix tests
fyabc f726ac8
refactor dummy inputs builder
fyabc e168e09
Update qwen2_5_omni_thinker.py
wangxiongts 7994068
Update qwen2_5_omni_thinker.py
wangxiongts 41c5855
fix test registry
fyabc d1e2046
Merge remote-tracking branch 'origin/qwen2_omni_public_v1' into qwen2…
fyabc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Qwen2.5-Omni Offline Inference Examples | ||
|
||
This folder provides several example scripts on how to inference Qwen2.5-Omni offline. | ||
|
||
## Thinker Only | ||
|
||
```bash | ||
# Audio + image + video | ||
python examples/offline_inference/qwen2_5_omni/only_thinker.py -q mixed_modalities | ||
|
||
# Read vision and audio inputs from a single video file | ||
# NOTE: V1 engine does not support interleaved modalities yet. | ||
VLLM_USE_V1=0 python examples/offline_inference/qwen2_5_omni/only_thinker.py -q use_audio_in_video | ||
|
||
# Multiple audios | ||
VLLM_USE_V1=0 python examples/offline_inference/qwen2_5_omni/only_thinker.py -q multi_audios | ||
``` | ||
|
||
This script will run the thinker part of Qwen2.5-Omni, and generate text response. | ||
|
||
You can also test Qwen2.5-Omni on a single modality: | ||
|
||
```bash | ||
# Process audio inputs | ||
python examples/offline_inference/audio_language.py --model-type qwen2_5_omni | ||
|
||
# Process image inputs | ||
python examples/offline_inference/vision_language.py --modality image --model-type qwen2_5_omni | ||
|
||
# Process video inputs | ||
python examples/offline_inference/vision_language.py --modality video --model-type qwen2_5_omni | ||
``` |
160 changes: 160 additions & 0 deletions
160
examples/offline_inference/qwen2_5_omni/only_thinker.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
""" | ||
This example shows how to use vLLM for running offline inference | ||
with the correct prompt format on Qwen2.5-Omni (thinker only). | ||
""" | ||
|
||
from typing import NamedTuple | ||
|
||
import vllm.envs as envs | ||
from vllm import LLM, SamplingParams | ||
from vllm.assets.audio import AudioAsset | ||
from vllm.assets.image import ImageAsset | ||
from vllm.assets.video import VideoAsset | ||
from vllm.utils import FlexibleArgumentParser | ||
|
||
|
||
class QueryResult(NamedTuple): | ||
inputs: dict | ||
limit_mm_per_prompt: dict[str, int] | ||
|
||
|
||
# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on | ||
# lower-end GPUs. | ||
# Unless specified, these settings have been tested to work on a single L4. | ||
|
||
default_system = ( | ||
"You are Qwen, a virtual human developed by the Qwen Team, Alibaba " | ||
"Group, capable of perceiving auditory and visual inputs, as well as " | ||
"generating text and speech.") | ||
|
||
|
||
def get_mixed_modalities_query() -> QueryResult: | ||
question = ("What is recited in the audio? " | ||
"What is the content of this image? Why is this video funny?") | ||
prompt = (f"<|im_start|>system\n{default_system}<|im_end|>\n" | ||
"<|im_start|>user\n<|audio_bos|><|AUDIO|><|audio_eos|>" | ||
"<|vision_bos|><|IMAGE|><|vision_eos|>" | ||
"<|vision_bos|><|VIDEO|><|vision_eos|>" | ||
f"{question}<|im_end|>\n" | ||
f"<|im_start|>assistant\n") | ||
return QueryResult( | ||
inputs={ | ||
"prompt": prompt, | ||
"multi_modal_data": { | ||
"audio": | ||
AudioAsset("mary_had_lamb").audio_and_sample_rate, | ||
"image": | ||
ImageAsset("cherry_blossom").pil_image.convert("RGB"), | ||
"video": | ||
VideoAsset(name="sample_demo_1.mp4", | ||
num_frames=16).np_ndarrays, | ||
}, | ||
}, | ||
limit_mm_per_prompt={ | ||
"audio": 1, | ||
"image": 1, | ||
"video": 1 | ||
}, | ||
) | ||
|
||
|
||
def get_use_audio_in_video_query() -> QueryResult: | ||
question = ("Describe the content of the video, " | ||
"then convert what the baby say into text.") | ||
prompt = (f"<|im_start|>system\n{default_system}<|im_end|>\n" | ||
"<|im_start|>user\n<|vision_bos|><|VIDEO|><|vision_eos|>" | ||
f"{question}<|im_end|>\n" | ||
f"<|im_start|>assistant\n") | ||
asset = VideoAsset(name="sample_demo_1.mp4", num_frames=16) | ||
audio = asset.get_audio(sampling_rate=16000) | ||
assert not envs.VLLM_USE_V1, ("V1 does not support use_audio_in_video. " | ||
"Please launch this example with " | ||
"`VLLM_USE_V1=0`.") | ||
return QueryResult( | ||
inputs={ | ||
"prompt": prompt, | ||
"multi_modal_data": { | ||
"video": asset.np_ndarrays, | ||
"audio": audio, | ||
}, | ||
"mm_processor_kwargs": { | ||
"use_audio_in_video": True, | ||
}, | ||
}, | ||
limit_mm_per_prompt={ | ||
"audio": 1, | ||
"video": 1 | ||
}, | ||
) | ||
|
||
|
||
def get_multi_audios_query() -> QueryResult: | ||
question = "Are these two audio clips the same?" | ||
prompt = (f"<|im_start|>system\n{default_system}<|im_end|>\n" | ||
"<|im_start|>user\n<|audio_bos|><|AUDIO|><|audio_eos|>" | ||
"<|audio_bos|><|AUDIO|><|audio_eos|>" | ||
f"{question}<|im_end|>\n" | ||
f"<|im_start|>assistant\n") | ||
return QueryResult( | ||
inputs={ | ||
"prompt": prompt, | ||
"multi_modal_data": { | ||
"audio": [ | ||
AudioAsset("winning_call").audio_and_sample_rate, | ||
AudioAsset("mary_had_lamb").audio_and_sample_rate, | ||
], | ||
}, | ||
}, | ||
limit_mm_per_prompt={ | ||
"audio": 2, | ||
}, | ||
) | ||
|
||
|
||
query_map = { | ||
"mixed_modalities": get_mixed_modalities_query, | ||
"use_audio_in_video": get_use_audio_in_video_query, | ||
"multi_audios": get_multi_audios_query, | ||
} | ||
|
||
|
||
def main(args): | ||
model_name = "Qwen/Qwen2.5-Omni-7B" | ||
query_result = query_map[args.query_type]() | ||
|
||
llm = LLM(model=model_name, | ||
max_model_len=5632, | ||
max_num_seqs=5, | ||
limit_mm_per_prompt=query_result.limit_mm_per_prompt, | ||
seed=args.seed) | ||
|
||
# We set temperature to 0.2 so that outputs can be different | ||
# even when all prompts are identical when running batch inference. | ||
sampling_params = SamplingParams(temperature=0.2, max_tokens=64) | ||
|
||
outputs = llm.generate(query_result.inputs, | ||
sampling_params=sampling_params) | ||
|
||
for o in outputs: | ||
generated_text = o.outputs[0].text | ||
print(generated_text) | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = FlexibleArgumentParser( | ||
description='Demo on using vLLM for offline inference with ' | ||
'audio language models') | ||
parser.add_argument('--query-type', | ||
'-q', | ||
type=str, | ||
default="mixed_modalities", | ||
choices=query_map.keys(), | ||
help='Query type.') | ||
parser.add_argument("--seed", | ||
type=int, | ||
default=None, | ||
help="Set the seed when initializing `vllm.LLM`.") | ||
|
||
args = parser.parse_args() | ||
main(args) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.