-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER #15001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER #15001
Conversation
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
…nImmp Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
…opriate paged attention module Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: tjtanaa <[email protected]> Signed-off-by: vllmellm <[email protected]>
…granite model test as well Signed-off-by: vllmellm <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post Lm_eval results for the main models that this kernel supports?
@@ -15,6 +15,7 @@ | |||
CommonMetadataBuilder) | |||
from vllm.attention.ops.paged_attn import (PagedAttention, | |||
PagedAttentionMetadata) | |||
from vllm.attention.ops.rocm_aiter_paged_attn import AITERPagedAttention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this always attempt to import AITER even if it's disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore Are you suggesting to move this line to line 50 after checking whether it is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore @hongxiayang AITER is now imported only when the flag is set.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
…gration Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: tjtanaa <[email protected]>
@sunway513 @hongxiayang We have just updated the PR with lm_eval and performance values for |
Hi, @SageMoore: @tjtanaa has updated the description and included lm_eval result and addressed the review feedback, can you please review again at your earliest convenience? As you already know, this is blocking the decommission of the ROCm fork. Thanks a lot. |
great! @gshtras who can help expedite the review? |
Signed-off-by: vllmellm <[email protected]>
cc @DarkLight1337 Can you help to expedite the review and merge of this PR? |
Does the AITER kernel support fused output quantization? |
@vllmellm please fix the pre-commit |
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
…gration Signed-off-by: vllmellm <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Mu Huai <[email protected]>
…t#15001) Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: minpeter <[email protected]>
This PR integrates Paged Attention Kernel from AITER (AI Tensor Engine for ROCm)
The
pa_fwd_asm
kernel from AITER is integrated as a new paged attention op in/vllm/attention/ops/rocm_aiter_paged_attn.py
and implemented into the ROCM attention backend in/vllm/attention/backends/rocm_flash_attn.py
.This feature is disabled by default, even when the parent switch (
VLLM_ROCM_USE_AITER=1
) is enabled. To use this kernel, both the parent switch and its dedicated environment variableVLLM_ROCM_USE_AITER_PAGED_ATTN
must be enabled.Note:
kv_cache_dtypes
:int8
"fp8"
"fp8_e4m3"
bfloat16
float16
float16
andbfloat16
kv_cache_dtype
, the module currently does not support decoding of models with more than 1kv_head
. Thus, a fallback to the original v1/v2 paged attention is added.Performance Improvement Tables
The https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py script has been used to evaluate the performance on the following models
Dataset: Random
Input length: 1024
Output length: 128
The ROCm Custom Paged Attention method, which can be enabled using the
VLLM_ROCM_CUSTOM_PAGED_ATTN=1
flag, was used as a baseline for comparison. Furthermore, all benchmarks were run using--quantization fp8
and--kv-cache-dtype fp8
args.Request throughput (req/s)
Output token throughput (tok/s)
Total Token throughput (tok/s)
Mean TTFT (ms)
Mean TPOT (ms)
Mean ITL (ms)
Lmeval
AITER Operations Testing Overview
1. High-Level Integration Tests
The integration of AITER ops is tested at a higher module level in the following files under
/tests/models/decoder_only/language
:test_models.py
test_phimoe.py
test_mistral.py
test_granite.py
These tests involve running various models to ensure overall functionality.
2. AITER MoE Specific Test
/tests/kernels/test_moe.py
3. Quantization Testing
/tests/quantization/test_fp8.py
4. Kernel Function Dispatch Testing
/tests/model_executor/test_enabled_custom_ops.py
Environment Settings
Updates in
Dockerfile.rocm_base
:Added AITER Package:
AITER_BRANCH
:7e1ed08
Note:
When setting up AITER, it is crucial to use the command
git clone --recursive
. This is because the package depends on a third-party package (Composable Kernel).For building and installing the AITER Python package, you must use the
PREBUILD_KERNELS=1
flag along with the commandpython3 setup.py develop
. This ensures that all kernels in the AITER package are built successfully.The following branches were used as references for this integration: