Skip to content

[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Apr 17, 2025

Conversation

p88h
Copy link
Contributor

@p88h p88h commented Apr 10, 2025

FIX #16185 (link existing issues this PR will resolve)

This is a rebase of #16279 which had too entangled commits.
Implements additional handling of MultimodalKwargs on top of #13790
Further improves memory usage on top of improvements in #16273 by another 50%

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 10, 2025
@p88h p88h force-pushed the serialize-multimodal-kwargs branch from 3268c77 to 43d87ec Compare April 10, 2025 21:15
@p88h p88h force-pushed the serialize-multimodal-kwargs branch from 43d87ec to f4832a7 Compare April 10, 2025 21:41
@ywang96
Copy link
Member

ywang96 commented Apr 10, 2025

@p88h This is amazing! Have you tried running some benchmarks to see the throughput performance impact of this PR?

@p88h
Copy link
Contributor Author

p88h commented Apr 10, 2025

@ywang96 I've added a benchmark table to the linked bug #16185

My benchmark focused on memory performance rather than throughput, and only used a single model. It should not really change throughput that much other than in cases that do run into memory issues, though.

I'll try running some throughput checks tomorrow

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @p88h! I think this looks good.

The main thing I think is to add custom serialization for the field. And we'll probably want to add a few more comments since it's tightly coupled with the custom tensor encoding format.

Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.

cc @ywang96 @DarkLight1337

@njhill
Copy link
Member

njhill commented Apr 11, 2025

Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.

FYI I've opened another PR to help with this: #16440. It should in theory help all of the cases not just the multi-proc case.

It would still be additionally beneficial to postpone doing this reduce operation until after being transferred to the engine though.

@p88h
Copy link
Contributor Author

p88h commented Apr 11, 2025

I have some experimental data with this PR in place. Interestingly it performs much better with zero-copy disabled

In this new benchmark,` I am feeding gradually increasing document sets to the engine. Turns out custom serialization helps less than expected - I think previously it was augmented by the cache, but now all files are unique so results are a bit different.

The 'mix' performance case measures running all prompts together (15 total, with 128 images total) after they have been initially processed one-by-one, so it's expected that it's performing much better / cached.

config / benchmark case       | 4 images | 8 images | 16 images | 32 images | t.max | t.mix
------------------------------+----------+----------+-----------+-----------+-------+-------
baseline (zero-copy disabled) | 3.55 GB  | 5.11 GB  | 9.96 GB   | 22.54 GB  | 90.4s | 44.1s
baseline (zero-copy enabled)  | 3.50 GB  | 5.01 GB  | 9.87 GB   | 22.56 GB  | 75.3s | 39.4s
#16432 (zero-copy enabled)    | 3.40 GB  | 4.75 GB  | 8.53 GB   | 22.02 GB  | 13.8s | 36.1s
#16432 (zero-copy disabled)   | 3.28 GB  | 3.95 GB  | 4.76 GB   | 5.85 GB   | 14.4s | 36.3s

@p88h p88h force-pushed the serialize-multimodal-kwargs branch from d56435a to 408f36b Compare April 11, 2025 12:03
@mergify mergify bot added documentation Improvements or additions to documentation ci/build tpu Related to Google TPUs labels Apr 11, 2025
p88h and others added 4 commits April 11, 2025 14:04
In addition to serializing base Tensors, this now allows to pass
Tensors embedded in MultiModalKwargs correctly.

Handles both V0 and V1 style args.

Improves memory usage with large multimodal payloads by a further
50% (but still not on par with single-threaded behavior).

Signed-off-by: Staszek Pasko <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
@p88h p88h force-pushed the serialize-multimodal-kwargs branch from 408f36b to 6641584 Compare April 11, 2025 12:05
@mergify mergify bot removed the tpu Related to Google TPUs label Apr 11, 2025
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Apr 11, 2025
@p88h p88h requested a review from njhill April 16, 2025 09:39
Signed-off-by: Staszek Pasko <[email protected]>
@p88h p88h requested a review from njhill April 16, 2025 15:00
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work @p88h!

@njhill njhill added ready ONLY add when PR is ready to merge/full CI is needed performance Performance-related issues labels Apr 16, 2025
@p88h p88h force-pushed the serialize-multimodal-kwargs branch from 1f2779a to 48ab2d9 Compare April 16, 2025 19:35
@njhill
Copy link
Member

njhill commented Apr 16, 2025

Looks like a CI test is failing - but unfortunately the root cause is obscured (the OOM failure of the subsequent test is a result of improper cleanup after the original failure). This should hopefully be addressed by #11737.

In the meantime I can try running this test locally.

p.s. there's no need to keep rebasing on latest main, this just causes all the tests to start over.

@njhill
Copy link
Member

njhill commented Apr 16, 2025

It turns out it was because sometimes MMKwargs can contain non-tensor data (specifically "second_per_grid_ts": [1.0] in this case). So I pushed an update to allow floats and ints too.

@njhill njhill merged commit 3092375 into vllm-project:main Apr 17, 2025
42 checks passed
@p88h
Copy link
Contributor Author

p88h commented Apr 17, 2025

Thank you ! I was about to go back to debugging this morning ;)

lionelvillard pushed a commit to lionelvillard/vllm that referenced this pull request Apr 17, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
@p88h p88h deleted the serialize-multimodal-kwargs branch April 18, 2025 20:22
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Yang Wang <[email protected]>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…[Rebased] (vllm-project#16432)

Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Huge memory overhead with V1 (multiprocessing) when handling several multimodal inputs
4 participants