Support both paged and non-paged attention #162

yannicks1 · 2025-05-16T14:31:53Z

Final FMS API supporting paged and non-paged attention

This final Fms API will support both the static and continuous batching APIs. Currently consuming this fms branch. Only minimal changes required on vLLM Spyre side.

Note: Do not merge until in fms main.

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-05-16T14:32:02Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Yannick Schnider <[email protected]>

.github/workflows/test.yml

Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Joe Runde <[email protected]> Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

This PR is the result of running both ``` uv lock ``` to upgrade all the dependencies in the lockfile, and additionally ``` uv lock --upgrade-package aiohttp ``` because there was at least one platform in the lockfile that had a yanked version locked. This updates the locked version of vllm to match our pyproject.toml, from 0.8.5 to 0.9.0.1. This does mean that all of our unit tests will start using 0.9.0.1 by default so technically we won't be testing compatibility with 0.9.0. I'm not too concerned at the moment because we aren't guaranteeing backwards compatibility yet though, and I'd rather have our docker builds and other locked installs with uv provide the latest released version of vllm. Signed-off-by: Joe Runde <[email protected]>

### [CB] refactor left padding removal - calls the function `reduce_left_padding` at every step (prefill and decode) - removes the dependency on cached requests - adjusts /tests for CB covering that exact case --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

Warmup context needs to capture both prefill/decode. This PR fixes that issue. Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-06-13T20:56:43Z

Fixing DCO and getting this ready to merge with fms 1.1

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-06-13T21:59:58Z

bot:test
MARKERS="cb and spyre"

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-06-13T22:13:12Z

bot:test
MARKERS="cb and spyre"

joerunde · 2025-06-13T22:21:23Z

bot:test
TEST_FILE=tests/e2e/test_spyre_online.py

prashantgupta24 · 2025-06-13T22:25:36Z

bot:test
TEST_FILE=tests/e2e/test_spyre_basic.py
MARKERS="v1 and spyre"

prashantgupta24 · 2025-06-16T19:46:59Z

tests/e2e/test_spyre_cb.py

new test incoming - test_cb_max_tokens - We will need to uncomment the v1 marker on this test.

ckadner

I had to go over the test matrix several times and look at what tests get run in which grouping. Maybe we can simplify those in the future.

sducouedic

LGTM

sducouedic · 2025-06-17T08:47:08Z

vllm_spyre/v1/worker/spyre_worker.py

@@ -63,7 +63,8 @@ def compile_or_warm_up_model(self) -> None:
        """Prepare model for execution through compilation/warmup."""
        # TO DO: implement warmup for continuous batching


# TO DO: implement warmup for continuous batching

Is this comment still valid? beside warming up with dynamic sizes, what has to be done?

no, should be fine, I will remove it

sducouedic · 2025-06-17T09:02:21Z

vllm_spyre/v1/worker/spyre_worker.py

@@ -89,8 +90,10 @@ def compile_or_warm_up_model(self) -> None:
            logger.info(
                "Warming up for prompt length %d, decoding %d tokens with "
                "batch size %d", prompt_len, num_decode_tokens, batch_size)
-            self._warmup_spyre_fixed_size(prompt_len, num_decode_tokens,
-                                          self.restricted_tokens, batch_size)
+            with _maybe_warmup_context():


maybe the with _maybe_warmup_context() can used one level up before the for loop, instead of importing warmup_mode from torch_sendnn for each warmup shape. Maybe not that important since we don't warmup for many shapes usually.

I think it needs to be done on each shape here so that the context closes and runs update_lazyhandle() after each shape, right?

aah ok you are probably right, I don't really understand what is happening with that context to be honest.
btw, I was looking for update_lazyhandle in the code to understand, and couldn't find it, even with a general "find" in the directory. When did that disappear?

The call to update_lazyhandle is now hidden in the warmup_mode context provided by torch_sendnn such that it gets called when the context exits. Use of warmup_mode was part of the change to remove the deprecated sendnn_decoder backend in #186.

So you'd have to look in the torch_sendnn code, but the warmup_mode context is basically just:

class warmup_mode: def __init__(self, enabled: bool = True): global _warmup_mode self._old_mode = _warmup_mode _warmup_mode = enabled def __enter__(self): return None def __exit__(self, exc_type, exc_val, exc_tb): global _warmup_mode update_lazyhandle() _warmup_mode = self._old_mode

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

joerunde · 2025-06-18T17:06:46Z

Merging now that we have driver changes to support this 🎉

yannicks1 · 2025-06-19T10:58:40Z

nice to have that finally in main !

final fms API

27d5a25

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 and others added 6 commits May 27, 2025 11:53

Merge branch 'main' into ysc-final-fms-api

de1ccf5

Merge branch 'main' into ysc-final-fms-api

242284e

Merge branch 'main' into ysc-final-fms-api

1f3940d

clearer separation of attention kwargs and explicitly naming attn_name

18b3ed1

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into ysc-final-fms-api

c69b208

update cb test fms branch

825df86

Signed-off-by: Yannick Schnider <[email protected]>

prashantgupta24 reviewed Jun 5, 2025

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

yannicks1 and others added 8 commits June 13, 2025 14:43

name change in fms

1d70fed

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

fix import after name change

e7be2e9

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

fixed issue with warmup_context not capturing full generate (#219)

d6b7735

Warmup context needs to capture both prefill/decode. This PR fixes that issue. Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]>

fix formating

5dce376

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

apply to v0: fixed issue with warmup_context not capturing full generate

cd06756

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Joe Runde <[email protected]>

joerunde force-pushed the ysc-final-fms-api branch from bf1a1a3 to cd06756 Compare June 13, 2025 20:44

joerunde added 4 commits June 13, 2025 14:49

⏪ revert uv.lock to main branch changes

4fb1f42

Signed-off-by: Joe Runde <[email protected]>

⬆️ bump fms lower bound

59fb0ff

Signed-off-by: Joe Runde <[email protected]>

⬆️ bump for vulnerability with httpcore

c026cd2

Signed-off-by: Joe Runde <[email protected]>

⬆️ bump for setuptools, tornado vulnerabilities

ddfa119

Signed-off-by: Joe Runde <[email protected]>

joerunde added 4 commits June 13, 2025 15:06

Merge branch 'main' into ysc-final-fms-api

4da37ff

⚗️ enable cb tests on main

e618501

Signed-off-by: Joe Runde <[email protected]>

🎨 fmt

14ede42

Signed-off-by: Joe Runde <[email protected]>

⚡ rollup utils tests to reduce number of jobs

90eb379

Signed-off-by: Joe Runde <[email protected]>

🔥 whoops, remove utils

8126efe

Signed-off-by: Joe Runde <[email protected]>

joerunde marked this pull request as ready for review June 13, 2025 22:16

joerunde requested review from joerunde, ckadner, rafvasq, sducouedic, tdoublep and nikolaospapandreou as code owners June 13, 2025 22:16

prashantgupta24 approved these changes Jun 13, 2025

View reviewed changes

joerunde changed the title ~~[do not merge][SB/CB] Final FMS API supporting paged and non-paged attention~~ Support both paged and non-paged attention Jun 13, 2025

prashantgupta24 reviewed Jun 16, 2025

View reviewed changes

tjohnson31415 mentioned this pull request Jun 16, 2025

fix: avoid KeyError when cancelling requests that have not been processed #233

Merged

ckadner approved these changes Jun 16, 2025

View reviewed changes

sducouedic approved these changes Jun 17, 2025

View reviewed changes

joerunde and others added 7 commits June 17, 2025 09:25

Merge branch 'main' into ysc-final-fms-api

50bdf30

🐛 unmark num_blocks as dynamic for prefill

91be547

Signed-off-by: Joe Runde <[email protected]>

⚗️ try reverting graph changes

6203b9e

Signed-off-by: Joe Runde <[email protected]>

🐛 oops

859ad9b

Signed-off-by: Joe Runde <[email protected]>

🎨 fmt

9e7f017

Signed-off-by: Joe Runde <[email protected]>

remove obsolete comments

6ac36c5

Signed-off-by: Yannick Schnider <[email protected]>

set free blocks for warmup consistent with KV cache dimension

2a02d8c

Signed-off-by: Yannick Schnider <[email protected]>

joerunde merged commit eeedaf4 into main Jun 18, 2025
18 checks passed

joerunde deleted the ysc-final-fms-api branch June 18, 2025 17:06

joerunde restored the ysc-final-fms-api branch June 19, 2025 16:39

		@@ -63,7 +63,8 @@ def compile_or_warm_up_model(self) -> None:
		"""Prepare model for execution through compilation/warmup."""
		# TO DO: implement warmup for continuous batching

Support both paged and non-paged attention #162

Support both paged and non-paged attention #162

Uh oh!

Conversation

yannicks1 commented May 16, 2025

Final FMS API supporting paged and non-paged attention

Uh oh!

github-actions bot commented May 16, 2025

Uh oh!

Uh oh!

joerunde commented Jun 13, 2025

Uh oh!

joerunde commented Jun 13, 2025

Uh oh!

joerunde commented Jun 13, 2025

Uh oh!

joerunde commented Jun 13, 2025

Uh oh!

prashantgupta24 commented Jun 13, 2025

Uh oh!

prashantgupta24 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

ckadner left a comment

Choose a reason for hiding this comment

Uh oh!

sducouedic left a comment

Choose a reason for hiding this comment

Uh oh!

sducouedic Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

sducouedic Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

sducouedic Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde commented Jun 18, 2025

Uh oh!

Uh oh!

yannicks1 commented Jun 19, 2025

Uh oh!

Uh oh!