Skip to content

model: support gemma-3-it #4424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 17, 2025
Merged

model: support gemma-3-it #4424

merged 1 commit into from
Mar 17, 2025

Conversation

mickqian
Copy link
Collaborator

@mickqian mickqian commented Mar 14, 2025

Motivation

Support gemma3-it.

FYI, gemma3-1b-it is an llm, gemma3-pt series are not chat models.

Modifications

Checklist

for image_index, (image, estimated_frames) in enumerate(
zip(image_data, estimated_frames_list)
):
if len(all_frames) >= MAX_NUM_FRAMES:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think base image prcessor don't need change?

@mickqian mickqian force-pushed the gemma branch 4 times, most recently from f266005 to 70ae46c Compare March 14, 2025 13:08
@mickqian

This comment was marked as resolved.

@mickqian mickqian force-pushed the gemma branch 3 times, most recently from 062911b to 7936216 Compare March 14, 2025 15:40
@zhaochenyang20
Copy link
Collaborator

@mickqian @yizhang2077 is this ready to merge? approved?

@yizhang2077
Copy link
Collaborator

yizhang2077 commented Mar 15, 2025

Typically in PRs when adding models, accuracy is reported, for example, on MMLU implemented in the PR compared to the implementation in transformers, and for multimodal models, some multimodal benchmark. Could you advise how you verified the implementation?

@Swipe4057 we have a mmmu benchmark, and for each vlm model, we need verify this benchmark and compare with transformers implementation. Besides, if transformers supports this model, we suggest add an unittest to compare logits with hf.

@yizhang2077
Copy link
Collaborator

@mickqian plz paste mmmu benchmark result here.

@yizhang2077
Copy link
Collaborator

I add an issue for keeping track of current VLM models performance in mmmu benchmark. We can update benchmark result here #4456 @mickqian @zhaochenyang20

text_parts = input_text.split(image_token)
import re

pattern = "(" + "|".join(re.escape(sep) for sep in [image_token]) + ")"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need add regex here?

@mickqian mickqian force-pushed the gemma branch 3 times, most recently from e6ac032 to 7b94f77 Compare March 16, 2025 03:53
Co-authored-by: Yuhao Chen <[email protected]>
@yizhang2077
Copy link
Collaborator

@zhaochenyang20 this PR can be merged

@mickqian
Copy link
Collaborator Author

@zhaochenyang20 This is ready. Many thanks

@zhaochenyang20
Copy link
Collaborator

@mickqian @yizhang2077 thanks. I will tell lianmin!

@zhaochenyang20
Copy link
Collaborator

@Ying1123 hey, ying, this can be merged. it's high-prioritized.

@zhaochenyang20 zhaochenyang20 merged commit 9d02bb3 into sgl-project:main Mar 17, 2025
21 checks passed
@AkazaAkane AkazaAkane mentioned this pull request Mar 17, 2025
2 tasks
@Swipe4057
Copy link
Contributor

Swipe4057 commented Mar 19, 2025

@zhaochenyang20 @mickqian Hello, I've been running Gemma3 27b it model in vllm 0.8.0 and sglang installed from source with original weights on an H100 GPU. My results show that for the same long text-only query, the outputs of the models differ significantly. In vllm, generation proceeds normally, but in sglang, during long generation, it starts to degrade into garbage output and continues indefinitely. This phenomenon occurs with long queries and queries containing code. Could someone else test this as well?

Also, a question: is prefix caching supported in sglang for multimodal models, particularly Gemma 3?

@zhaochenyang20
Copy link
Collaborator

@zhaochenyang20 @mickqian Hello, I've been running Gemma3 27b it model in vllm 0.8.0 and sglang installed from source with original weights on an H100 GPU. My results show that for the same long text-only query, the outputs of the models differ significantly. In vllm, generation proceeds normally, but in sglang, during long generation, it starts to degrade into garbage output and continues indefinitely. This phenomenon occurs with long queries and queries containing code. Could someone else test this as well?

Also, a question: is prefix caching supported in sglang for multimodal models, particularly Gemma 3?

supported. could you give us your reproducable scripts. We will fix this ASAP>

@Swipe4057
Copy link
Contributor

@zhaochenyang20
Here's my launch command:
image
Then ask a simple question:
How would you advise fixing high vulnerabilities in a Docker container that are found in the base Debian image?
Temperature 1 and top_p 0.95

You'll see that the generation doesn't stop and something like this will begin:
image

In the logs, the generation will continue:
image

@zhaochenyang20
Copy link
Collaborator

@zhaochenyang20 Here's my launch command: image Then ask a simple question: How would you advise fixing high vulnerabilities in a Docker container that are found in the base Debian image? Temperature 1 and top_p 0.95

You'll see that the generation doesn't stop and something like this will begin: image

In the logs, the generation will continue: image

cc @mickqian mick

@mickqian
Copy link
Collaborator Author

mickqian commented Mar 20, 2025

a fix is on the way

@rangehow
Copy link

rangehow commented Apr 2, 2025

Gemma3's generation speed is surprisingly slow compared to other 3B/4B models like Qwen2.5-3B. Is the current Gemma3 implementation correct?

@zhaochenyang20
Copy link
Collaborator

great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants