Skip to content

Support both paged and non-paged attention #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 18, 2025
Merged

Support both paged and non-paged attention #162

merged 31 commits into from
Jun 18, 2025

Conversation

yannicks1
Copy link
Collaborator

Final FMS API supporting paged and non-paged attention

This final Fms API will support both the static and continuous batching APIs. Currently consuming this fms branch. Only minimal changes required on vLLM Spyre side.

Note: Do not merge until in fms main.

Signed-off-by: Yannick Schnider <[email protected]>
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

yannicks1 and others added 8 commits June 13, 2025 14:43
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
This PR is the result of running both
```
uv lock
```
to upgrade all the dependencies in the lockfile, and additionally
```
uv lock --upgrade-package aiohttp
```
because there was at least one platform in the lockfile that had a
yanked version locked.

This updates the locked version of vllm to match our pyproject.toml,
from 0.8.5 to 0.9.0.1. This does mean that all of our unit tests will
start using 0.9.0.1 by default so technically we won't be testing
compatibility with 0.9.0. I'm not too concerned at the moment because we
aren't guaranteeing backwards compatibility yet though, and I'd rather
have our docker builds and other locked installs with uv provide the
latest released version of vllm.

Signed-off-by: Joe Runde <[email protected]>
### [CB] refactor left padding removal

- calls the function `reduce_left_padding` at every step (prefill and
decode)
- removes the dependency on cached requests
- adjusts /tests for CB covering that exact case

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Warmup context needs to capture both prefill/decode. This PR fixes that
issue.

Signed-off-by: Joshua Rosenkranz <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
@joerunde joerunde force-pushed the ysc-final-fms-api branch from bf1a1a3 to cd06756 Compare June 13, 2025 20:44
@joerunde
Copy link
Collaborator

Fixing DCO and getting this ready to merge with fms 1.1

@joerunde
Copy link
Collaborator

bot:test
MARKERS="cb and spyre"

Signed-off-by: Joe Runde <[email protected]>
@joerunde
Copy link
Collaborator

bot:test
MARKERS="cb and spyre"

@joerunde joerunde marked this pull request as ready for review June 13, 2025 22:16
@joerunde
Copy link
Collaborator

bot:test
TEST_FILE=tests/e2e/test_spyre_online.py

@prashantgupta24
Copy link
Collaborator

bot:test
TEST_FILE=tests/e2e/test_spyre_basic.py
MARKERS="v1 and spyre"

@joerunde joerunde changed the title [do not merge][SB/CB] Final FMS API supporting paged and non-paged attention Support both paged and non-paged attention Jun 13, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test incoming - test_cb_max_tokens - We will need to uncomment the v1 marker on this test.

Copy link
Collaborator

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to go over the test matrix several times and look at what tests get run in which grouping. Maybe we can simplify those in the future.

Copy link
Collaborator

@sducouedic sducouedic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -63,7 +63,8 @@ def compile_or_warm_up_model(self) -> None:
"""Prepare model for execution through compilation/warmup."""
# TO DO: implement warmup for continuous batching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# TO DO: implement warmup for continuous batching

Is this comment still valid? beside warming up with dynamic sizes, what has to be done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, should be fine, I will remove it

@@ -89,8 +90,10 @@ def compile_or_warm_up_model(self) -> None:
logger.info(
"Warming up for prompt length %d, decoding %d tokens with "
"batch size %d", prompt_len, num_decode_tokens, batch_size)
self._warmup_spyre_fixed_size(prompt_len, num_decode_tokens,
self.restricted_tokens, batch_size)
with _maybe_warmup_context():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the with _maybe_warmup_context() can used one level up before the for loop, instead of importing warmup_mode from torch_sendnn for each warmup shape. Maybe not that important since we don't warmup for many shapes usually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to be done on each shape here so that the context closes and runs update_lazyhandle() after each shape, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah ok you are probably right, I don't really understand what is happening with that context to be honest.
btw, I was looking for update_lazyhandle in the code to understand, and couldn't find it, even with a general "find" in the directory. When did that disappear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to update_lazyhandle is now hidden in the warmup_mode context provided by torch_sendnn such that it gets called when the context exits. Use of warmup_mode was part of the change to remove the deprecated sendnn_decoder backend in #186.

So you'd have to look in the torch_sendnn code, but the warmup_mode context is basically just:

class warmup_mode:
    def __init__(self, enabled: bool = True):
        global _warmup_mode
        self._old_mode = _warmup_mode
        _warmup_mode = enabled

    def __enter__(self):
        return None

    def __exit__(self, exc_type, exc_val, exc_tb):
        global _warmup_mode
        update_lazyhandle()
        _warmup_mode = self._old_mode

@joerunde
Copy link
Collaborator

Merging now that we have driver changes to support this 🎉

@joerunde joerunde merged commit eeedaf4 into main Jun 18, 2025
18 checks passed
@joerunde joerunde deleted the ysc-final-fms-api branch June 18, 2025 17:06
@yannicks1
Copy link
Collaborator Author

nice to have that finally in main !

@joerunde joerunde restored the ysc-final-fms-api branch June 19, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants