[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

p88h · 2025-04-10T21:02:18Z

FIX #16185 (link existing issues this PR will resolve)

This is a rebase of #16279 which had too entangled commits.
Implements additional handling of MultimodalKwargs on top of #13790
Further improves memory usage on top of improvements in #16273 by another 50%

github-actions · 2025-04-10T21:02:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ywang96 · 2025-04-10T21:47:35Z

@p88h This is amazing! Have you tried running some benchmarks to see the throughput performance impact of this PR?

p88h · 2025-04-10T23:00:54Z

@ywang96 I've added a benchmark table to the linked bug #16185

My benchmark focused on memory performance rather than throughput, and only used a single model. It should not really change throughput that much other than in cases that do run into memory issues, though.

I'll try running some throughput checks tomorrow

njhill

Thanks @p88h! I think this looks good.

The main thing I think is to add custom serialization for the field. And we'll probably want to add a few more comments since it's tightly coupled with the custom tensor encoding format.

Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.

cc @ywang96 @DarkLight1337

vllm/v1/serial_utils.py

tests/v1/test_serial_utils.py

njhill · 2025-04-11T01:16:49Z

Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.

FYI I've opened another PR to help with this: #16440. It should in theory help all of the cases not just the multi-proc case.

It would still be additionally beneficial to postpone doing this reduce operation until after being transferred to the engine though.

tests/v1/test_serial_utils.py

p88h · 2025-04-11T11:06:15Z

I have some experimental data with this PR in place. Interestingly it performs much better with zero-copy disabled

In this new benchmark,` I am feeding gradually increasing document sets to the engine. Turns out custom serialization helps less than expected - I think previously it was augmented by the cache, but now all files are unique so results are a bit different.

The 'mix' performance case measures running all prompts together (15 total, with 128 images total) after they have been initially processed one-by-one, so it's expected that it's performing much better / cached.

config / benchmark case       | 4 images | 8 images | 16 images | 32 images | t.max | t.mix
------------------------------+----------+----------+-----------+-----------+-------+-------
baseline (zero-copy disabled) | 3.55 GB  | 5.11 GB  | 9.96 GB   | 22.54 GB  | 90.4s | 44.1s
baseline (zero-copy enabled)  | 3.50 GB  | 5.01 GB  | 9.87 GB   | 22.56 GB  | 75.3s | 39.4s
#16432 (zero-copy enabled)    | 3.40 GB  | 4.75 GB  | 8.53 GB   | 22.02 GB  | 13.8s | 36.1s
#16432 (zero-copy disabled)   | 3.28 GB  | 3.95 GB  | 4.76 GB   | 5.85 GB   | 14.4s | 36.3s

In addition to serializing base Tensors, this now allows to pass Tensors embedded in MultiModalKwargs correctly. Handles both V0 and V1 style args. Improves memory usage with large multimodal payloads by a further 50% (but still not on par with single-threaded behavior). Signed-off-by: Staszek Pasko <[email protected]>

Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Staszek Pasko <[email protected]>

Signed-off-by: Staszek Pasko <[email protected]>

vllm/v1/serial_utils.py

Signed-off-by: Staszek Pasko <[email protected]>

njhill

Thanks for the great work @p88h!

njhill · 2025-04-16T22:25:28Z

Looks like a CI test is failing - but unfortunately the root cause is obscured (the OOM failure of the subsequent test is a result of improper cleanup after the original failure). This should hopefully be addressed by #11737.

In the meantime I can try running this test locally.

p.s. there's no need to keep rebasing on latest main, this just causes all the tests to start over.

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-04-16T23:28:45Z

It turns out it was because sometimes MMKwargs can contain non-tensor data (specifically "second_per_grid_ts": [1.0] in this case). So I pushed an update to allow floats and ints too.

p88h · 2025-04-17T06:33:56Z

Thank you ! I was about to go back to debugging this morning ;)

…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]>

…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]>

…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]>

…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Mu Huai <[email protected]>

p88h requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 10, 2025 21:02

mergify bot added the v1 label Apr 10, 2025

p88h force-pushed the serialize-multimodal-kwargs branch from 3268c77 to 43d87ec Compare April 10, 2025 21:15

p88h mentioned this pull request Apr 10, 2025

[V1][Performance] Implement custom serializaton for MultiModalKwargs #16279

Closed

p88h force-pushed the serialize-multimodal-kwargs branch from 43d87ec to f4832a7 Compare April 10, 2025 21:41

njhill reviewed Apr 10, 2025

View reviewed changes

DarkLight1337 reviewed Apr 11, 2025

View reviewed changes

tests/v1/test_serial_utils.py Outdated Show resolved Hide resolved

xtknight mentioned this pull request Apr 11, 2025

[Performance]: MultiModalKwargs serialization has significant impact on E2E latency (w/ proof-of-concept patch) #16461

Open

1 task

p88h force-pushed the serialize-multimodal-kwargs branch from d56435a to 408f36b Compare April 11, 2025 12:03

mergify bot added documentation Improvements or additions to documentation ci/build tpu Related to Google TPUs labels Apr 11, 2025

p88h and others added 4 commits April 11, 2025 14:04

Apply suggestions from code review

4bdd16e

Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Staszek Pasko <[email protected]>

Additional fixes after code review

e5931af

Signed-off-by: Staszek Pasko <[email protected]>

Fix some broken bits & reformat

6641584

Signed-off-by: Staszek Pasko <[email protected]>

p88h force-pushed the serialize-multimodal-kwargs branch from 408f36b to 6641584 Compare April 11, 2025 12:05

mergify bot removed the tpu Related to Google TPUs label Apr 11, 2025

Add custom support for MultiModalFieldConfig, less pickle

a94df99

Signed-off-by: Staszek Pasko <[email protected]>

mergify bot added the multi-modality Related to multi-modality (#4194) label Apr 11, 2025

p88h added 2 commits April 16, 2025 07:33

Merge branch 'vllm-project:main' into serialize-multimodal-kwargs

d7cb694

style

7511262

Signed-off-by: Staszek Pasko <[email protected]>

p88h requested a review from njhill April 16, 2025 09:39

Merge branch 'vllm-project:main' into serialize-multimodal-kwargs

97188e6

njhill reviewed Apr 16, 2025

View reviewed changes

vllm/v1/serial_utils.py Outdated Show resolved Hide resolved

remove unnecessary comment

48ab2d9

Signed-off-by: Staszek Pasko <[email protected]>

p88h requested a review from njhill April 16, 2025 15:00

njhill approved these changes Apr 16, 2025

View reviewed changes

njhill added ready ONLY add when PR is ready to merge/full CI is needed performance Performance-related issues labels Apr 16, 2025

p88h force-pushed the serialize-multimodal-kwargs branch from 1f2779a to 48ab2d9 Compare April 16, 2025 19:35

Merge branch 'vllm-project:main' into serialize-multimodal-kwargs

a60333e

Accommodate floats in NestedTensors

281f0f1

Signed-off-by: Nick Hill <[email protected]>

njhill merged commit 3092375 into vllm-project:main Apr 17, 2025
42 checks passed

DarkLight1337 mentioned this pull request Apr 17, 2025

[Bug]: Unable to deploy Qwen2.5-VL-3B-Instruct after updating vLLM to latest version #16791

Open

1 task

p88h mentioned this pull request Apr 17, 2025

[Bug]: Mistral 3.1 Small Image inference is broken on 0.8.4 #16675

Closed

1 task

njhill mentioned this pull request Apr 18, 2025

[BugFix] Support bf16 in zero-copy tensor serialization #16860

Closed

p88h deleted the serialize-multimodal-kwargs branch April 18, 2025 20:22

DarkLight1337 mentioned this pull request Apr 28, 2025

[Feature]: Performance issue, when using Qwen2.5-VL-32B-Instruct model for multi graph inference #17297

Open

1 task

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Uh oh!

[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

Conversation

p88h commented Apr 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

ywang96 commented Apr 10, 2025

Uh oh!

p88h commented Apr 10, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Apr 11, 2025

Uh oh!

Uh oh!

p88h commented Apr 11, 2025

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Apr 16, 2025

Uh oh!

njhill commented Apr 16, 2025

Uh oh!

Uh oh!

p88h commented Apr 17, 2025

Uh oh!

Uh oh!

p88h commented Apr 10, 2025 •

edited by github-actions bot

Loading