Skip to content

[WIP] add vlm cache and support chunk prefill for vlm #5456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yizhang2077
Copy link
Collaborator

@yizhang2077 yizhang2077 commented Apr 16, 2025

Motivation

  1. add multimodal cache for vlm encoder to avoid repeated calculation when doing chunk prefill, we can use this cache to save encoder embedding for different request in the future.
  2. use prefix length and extend sequence length and each multimodal item begin-end offsets to calculate vlm encode embedding in each request for chunk prefill
  3. open chunk prefill for vlm (some old version vlm is still limited)

TODO:

  1. need vlm: enable radix cache for qwen-vl models #5349 merge before
  2. need adapt more vlm
  3. need do more tests.

Modifications

Checklist

@ch-wan ch-wan linked an issue Apr 16, 2025 that may be closed by this pull request
2 tasks
@yizhang2077 yizhang2077 mentioned this pull request Apr 16, 2025
16 tasks
@zhyncs zhyncs closed this Apr 21, 2025
@zhyncs zhyncs deleted the vlm-support-chunk-prefill branch April 21, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] support and turn on chunked prefill by default for VLM
3 participants