-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
3268c77
to
43d87ec
Compare
43d87ec
to
f4832a7
Compare
@p88h This is amazing! Have you tried running some benchmarks to see the throughput performance impact of this PR? |
@ywang96 I've added a benchmark table to the linked bug #16185 My benchmark focused on memory performance rather than throughput, and only used a single model. It should not really change throughput that much other than in cases that do run into memory issues, though. I'll try running some throughput checks tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @p88h! I think this looks good.
The main thing I think is to add custom serialization for the field
. And we'll probably want to add a few more comments since it's tightly coupled with the custom tensor encoding format.
Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.
FYI I've opened another PR to help with this: #16440. It should in theory help all of the cases not just the multi-proc case. It would still be additionally beneficial to postpone doing this reduce operation until after being transferred to the engine though. |
I have some experimental data with this PR in place. Interestingly it performs much better with zero-copy disabled In this new benchmark,` I am feeding gradually increasing document sets to the engine. Turns out custom serialization helps less than expected - I think previously it was augmented by the cache, but now all files are unique so results are a bit different. The 'mix' performance case measures running all prompts together (15 total, with 128 images total) after they have been initially processed one-by-one, so it's expected that it's performing much better / cached.
|
d56435a
to
408f36b
Compare
In addition to serializing base Tensors, this now allows to pass Tensors embedded in MultiModalKwargs correctly. Handles both V0 and V1 style args. Improves memory usage with large multimodal payloads by a further 50% (but still not on par with single-threaded behavior). Signed-off-by: Staszek Pasko <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
408f36b
to
6641584
Compare
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work @p88h!
1f2779a
to
48ab2d9
Compare
Looks like a CI test is failing - but unfortunately the root cause is obscured (the OOM failure of the subsequent test is a result of improper cleanup after the original failure). This should hopefully be addressed by #11737. In the meantime I can try running this test locally. p.s. there's no need to keep rebasing on latest main, this just causes all the tests to start over. |
Signed-off-by: Nick Hill <[email protected]>
It turns out it was because sometimes MMKwargs can contain non-tensor data (specifically |
Thank you ! I was about to go back to debugging this morning ;) |
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]>
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…[Rebased] (vllm-project#16432) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Mu Huai <[email protected]>
FIX #16185 (link existing issues this PR will resolve)
This is a rebase of #16279 which had too entangled commits.
Implements additional handling of MultimodalKwargs on top of #13790
Further improves memory usage on top of improvements in #16273 by another 50%