-
Notifications
You must be signed in to change notification settings - Fork 17
Support both paged and non-paged attention #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Joe Runde <[email protected]> Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>
This PR is the result of running both ``` uv lock ``` to upgrade all the dependencies in the lockfile, and additionally ``` uv lock --upgrade-package aiohttp ``` because there was at least one platform in the lockfile that had a yanked version locked. This updates the locked version of vllm to match our pyproject.toml, from 0.8.5 to 0.9.0.1. This does mean that all of our unit tests will start using 0.9.0.1 by default so technically we won't be testing compatibility with 0.9.0. I'm not too concerned at the moment because we aren't guaranteeing backwards compatibility yet though, and I'd rather have our docker builds and other locked installs with uv provide the latest released version of vllm. Signed-off-by: Joe Runde <[email protected]>
### [CB] refactor left padding removal - calls the function `reduce_left_padding` at every step (prefill and decode) - removes the dependency on cached requests - adjusts /tests for CB covering that exact case --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>
Warmup context needs to capture both prefill/decode. This PR fixes that issue. Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>
bf1a1a3
to
cd06756
Compare
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Fixing DCO and getting this ready to merge with fms 1.1 |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
bot:test |
Signed-off-by: Joe Runde <[email protected]>
bot:test |
bot:test |
bot:test |
tests/e2e/test_spyre_cb.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new test incoming - test_cb_max_tokens
- We will need to uncomment the v1 marker on this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to go over the test matrix several times and look at what tests get run in which grouping. Maybe we can simplify those in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vllm_spyre/v1/worker/spyre_worker.py
Outdated
@@ -63,7 +63,8 @@ def compile_or_warm_up_model(self) -> None: | |||
"""Prepare model for execution through compilation/warmup.""" | |||
# TO DO: implement warmup for continuous batching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TO DO: implement warmup for continuous batching
Is this comment still valid? beside warming up with dynamic sizes, what has to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, should be fine, I will remove it
@@ -89,8 +90,10 @@ def compile_or_warm_up_model(self) -> None: | |||
logger.info( | |||
"Warming up for prompt length %d, decoding %d tokens with " | |||
"batch size %d", prompt_len, num_decode_tokens, batch_size) | |||
self._warmup_spyre_fixed_size(prompt_len, num_decode_tokens, | |||
self.restricted_tokens, batch_size) | |||
with _maybe_warmup_context(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the with _maybe_warmup_context()
can used one level up before the for loop, instead of importing warmup_mode from torch_sendnn for each warmup shape. Maybe not that important since we don't warmup for many shapes usually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be done on each shape here so that the context closes and runs update_lazyhandle()
after each shape, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah ok you are probably right, I don't really understand what is happening with that context to be honest.
btw, I was looking for update_lazyhandle
in the code to understand, and couldn't find it, even with a general "find" in the directory. When did that disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to update_lazyhandle
is now hidden in the warmup_mode
context provided by torch_sendnn
such that it gets called when the context exits. Use of warmup_mode
was part of the change to remove the deprecated sendnn_decoder
backend in #186.
So you'd have to look in the torch_sendnn
code, but the warmup_mode
context is basically just:
class warmup_mode:
def __init__(self, enabled: bool = True):
global _warmup_mode
self._old_mode = _warmup_mode
_warmup_mode = enabled
def __enter__(self):
return None
def __exit__(self, exc_type, exc_val, exc_tb):
global _warmup_mode
update_lazyhandle()
_warmup_mode = self._old_mode
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Merging now that we have driver changes to support this 🎉 |
nice to have that finally in main ! |
Final FMS API supporting paged and non-paged attention
This final Fms API will support both the static and continuous batching APIs. Currently consuming this fms branch. Only minimal changes required on vLLM Spyre side.
Note: Do not merge until in fms main.