Skip to content

Commit 00ae250

Browse files
authored
[V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it? This PR implements the Eagle Pososer feature for vLLM v1, which enables more efficient speculative decoding by using a draft model to predict potential future tokens. - The implementation includes the core Eagle algorithm integration with vLLM's existing architecture, allowing for faster inference while maintaining output quality. - This is needed to significantly improve the generation speed of large language models without compromising on the quality of generated text. ### Does this PR introduce any user-facing change? Yes, this PR introduces a new speculative decoding mode that can be enabled via configuration. - Users can now choose to use Eagle Pososer by setting appropriate flags in the inference configuration. - The API remains backward compatible, with the new functionality being opt-in. ### How was this patch tested? CI passed with new unit tests added for the Eagle Pososer functionality. - Benchmark tests were conducted comparing generation speed and quality with and without Eagle Pososer. - Integration tests were performed with various model architectures to ensure compatibility. - Manual testing was done using different prompt scenarios to verify output quality remains consistent. - we test accept rate on one Ascend 910B npu, The acceptance rate results are basically consistent with those shown here: vllm-project/vllm#16937 - Currently, we support scenarios where num_spec_tokens <= 2. When num_spec_tokens > 2, issues such as insufficient GPU memory and operator computation errors may occur. We will address this in subsequent updates. - We will add support for Eagle v1 in future updates. ### Acceptance Test Script ```bash SCRIPT="/offline/eagle.py" DATASET="ShareGpt" MODEL=Meta-Llama-3.1-8B-Instruct DRAFT=EAGLE3-LLaMA3.1-Instruct-8B CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \ --dataset $DATASET \ --num_spec_tokens 2 \ --max_num_seqs 1 \ --model_dir $MODEL \ --eagle_dir $DRAFT \ --tp 1 \ --num_prompts 80 ``` ### Acceptance Test Results ```bash ██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s] ------------------------------------------------------------------------------------- mean acceptance length: 1.63 ------------------------------------------------------------------------------------- total_counts: 8062 acceptance at token 0: 1.00 (8062 times) acceptance at token 1: 0.70 (5612 times) acceptance at token 2: 0.47 (3765 times) ``` Closes: #1004 --------- Signed-off-by: yuancaoyaoHW <[email protected]>
1 parent 45be1aa commit 00ae250

File tree

5 files changed

+734
-25
lines changed

5 files changed

+734
-25
lines changed

.github/workflows/vllm_ascend_test_long_term.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ jobs:
100100
# spec decode test
101101
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
102102
# TODO: revert me when test_v1_spec_decode.py::test_ngram_correctness is fixed
103-
# VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/long_term/spec_decode/e2e/test_v1_spec_decode.py
103+
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/long_term/spec_decode/e2e/test_v1_spec_decode.py
104104
VLLM_USE_MODELSCOPE=True pytest -sv tests/e2e/long_term/spec_decode/e2e/test_mtp_correctness.py # it needs a clean process
105105
pytest -sv tests/e2e/long_term/spec_decode --ignore=tests/e2e/long_term/spec_decode/e2e/test_mtp_correctness.py --ignore=tests/e2e/long_term/spec_decode/e2e/test_v1_spec_decode.py --ignore=tests/e2e/long_term/spec_decode/e2e/test_v1_mtp_correctness.py
106106
pytest -sv tests/e2e/long_term/test_accuracy.py

tests/e2e/long_term/spec_decode/e2e/test_v1_spec_decode.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
@pytest.fixture
1212
def test_prompts():
1313
prompt_types = ["repeat", "sentence"]
14-
num_prompts = 100
14+
num_prompts = 10
1515
prompts = []
1616

1717
random.seed(0)
@@ -69,6 +69,7 @@ def test_ngram_correctness(
6969
Compare the outputs of a original LLM and a speculative LLM
7070
should be the same when using ngram speculative decoding.
7171
'''
72+
pytest.skip("Not current support for the test.")
7273
with monkeypatch.context() as m:
7374
m.setenv("VLLM_USE_V1", "1")
7475

@@ -116,11 +117,12 @@ def test_eagle_correctness(
116117
Compare the outputs of a original LLM and a speculative LLM
117118
should be the same when using eagle speculative decoding.
118119
'''
119-
pytest.skip("Not current support for the test.")
120+
if not use_eagle3:
121+
pytest.skip("Not current support for the test.")
120122
with monkeypatch.context() as m:
121123
m.setenv("VLLM_USE_V1", "1")
122124

123-
ref_llm = LLM(model=model_name, max_model_len=2048)
125+
ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=True)
124126
ref_outputs = ref_llm.chat(test_prompts, sampling_config)
125127
del ref_llm
126128

@@ -129,13 +131,17 @@ def test_eagle_correctness(
129131
spec_llm = LLM(
130132
model=model_name,
131133
trust_remote_code=True,
134+
enable_chunked_prefill=True,
135+
max_num_seqs=1,
136+
max_num_batched_tokens=2048,
137+
gpu_memory_utilization=0.6,
132138
speculative_config={
133139
"method": "eagle3" if use_eagle3 else "eagle",
134140
"model": spec_model_name,
135-
"num_speculative_tokens": 3,
136-
"max_model_len": 2048,
141+
"num_speculative_tokens": 2,
142+
"max_model_len": 128,
137143
},
138-
max_model_len=2048,
144+
max_model_len=128,
139145
enforce_eager=True,
140146
)
141147
spec_outputs = spec_llm.chat(test_prompts, sampling_config)

tests/e2e/long_term/test_deepseek_v2_lite_tp2_accuracy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838

3939

4040
def run_test(model_name, queue, more_args=None):
41-
model_args = f"pretrained={model_name},max_model_len=4096,trust_remote_code=True,tensor_parallel_size=4"
41+
model_args = f"pretrained={model_name},max_model_len=4096,trust_remote_code=True,tensor_parallel_size=4,enforce_eager=True"
4242
if more_args is not None:
4343
model_args = f"{model_args},{more_args}"
4444
results = lm_eval.simple_evaluate(

0 commit comments

Comments
 (0)