-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Attention] Update to lastest FA3 code #13111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] Update to lastest FA3 code #13111
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@khluu can we run the perf CI on this? would be nice to check for regressions since theres alot of FA changes |
61a411e
to
c65ddce
Compare
if has_context: | ||
if not current_platform.is_cuda(): | ||
raise NotImplementedError( | ||
"Chunked Prefill for MLA is not currently supported on" | ||
"non-cuda platforms") | ||
output = self.flash_attn_varlen_func( | ||
output = self.flash_attn_varlen_diff_headdims( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to handle the fact that flash_attn_varlen_diff_headdims
returns both output and *rest in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we did, but the return was slicing a tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request has merge conflicts that must be resolved before it can be |
c69b970
to
d0ce155
Compare
causal=True, | ||
return_attn_probs=has_context, | ||
) | ||
if has_context and not current_platform.is_cuda(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this not work for ROCm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this was cruft from an earlier slack discussion that cast doubts on if return_softmax_lse
was supported on RoCM
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
eb958e7
to
fcc54d8
Compare
Hi, @LucasWilkinson @simon-mo. This PR uses a new function It seems that a new release of |
https://pypi.org/project/vllm-flash-attn/ It’s been a long time since |
We currently ship |
Hi, @LucasWilkinson. Does the latest FA3 still fail on Lovelace GPUs due to shared memory limits for some shapes? |
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Yang Wang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Mu Huai <[email protected]>
NOTE: Tested MLA on AMD V0 is working, V1 is broken but is also broken on main
Perf: https://docs.google.com/spreadsheets/d/1U5lsoCKuWq99Cz1QbWkc0dBn1bij1Ifb3tphE2UXJj0/edit?usp=sharing
Main:
This PR:
See accuracy drops for:
Due to the dynamic split scheduler