-
Notifications
You must be signed in to change notification settings - Fork 501
[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jwjwjw3, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request resolves a bug in the benchmarking system where Time-to-First-Token (TTFT) measurements for reasoning Large Language Models (LLMs) were significantly inflated. The previous calculation incorrectly waited for the first non-reasoning token, thereby including the generation time of internal reasoning steps. The fix adjusts the TTFT logic to capture the arrival of any initial token, whether it's reasoning or non-reasoning content, providing a more accurate performance metric for these models.
Highlights
- Bug Fix for TTFT in Reasoning LLM Benchmarks: Addresses an issue where Time-to-First-Token (TTFT) was inaccurately high (10-30s) for reasoning (Chain-of-Thought) LLMs in benchmarks. This occurred because the TTFT calculation only considered the first non-reasoning token, effectively including the generation time of reasoning tokens in the TTFT.
- TTFT Calculation Logic Update: The TTFT calculation has been updated to correctly measure the time until either the first non-reasoning token or the first reasoning token is received. This ensures that the benchmark accurately reflects the initial response time for reasoning LLMs.
- client.py Modification: The
benchmarks/client/client.pyfile was modified to include a check forchunk.choices[0].delta.reasoning_contentwhen determining thefirst_response_time, alongside the existing check forchunk.choices[0].delta.content.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug causing unexpectedly high Time-to-First-Token (TTFT) in benchmarks for reasoning-based LLMs. The fix correctly adjusts the TTFT calculation to consider the first token received, whether it's a reasoning or a content token. The change is logical and well-explained. My review includes a suggestion to refactor the implementation to reduce code duplication and improve maintainability.
benchmarks/client/client.py
Outdated
| elif chunk.choices[0].delta.reasoning_content is not None: | ||
| if not first_response_time: | ||
| first_response_time = asyncio.get_event_loop().time() | ||
| output_text = chunk.choices[0].delta.reasoning_content | ||
| text_chunks.append(output_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif block correctly fixes the TTFT calculation for reasoning models, but it introduces code duplication. The logic for setting first_response_time and appending the text chunk is identical to the if block for delta.content.
To improve maintainability and avoid redundancy, consider refactoring this section to handle both content and reasoning_content in a more unified way. Since this would involve changing lines outside the current diff, I'm providing a suggestion here for you to consider applying across lines 65-74:
delta = chunk.choices[0].delta
output_text = delta.content
if output_text is None:
# Use getattr for safety as reasoning_content is not a standard field
output_text = getattr(delta, 'reasoning_content', None)
if output_text is not None:
if not first_response_time:
first_response_time = asyncio.get_event_loop().time()
text_chunks.append(output_text)This refactoring removes the repeated code, making it cleaner and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwjwjw3 Could you take a look at the code-assist comment above -- I think the comment actually make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyandslow Thanks for the suggestion, I think the code block in the code-assist comment is also a good solution. I just tested the code-assist solution and it also resolves the "TTFT too high" bug. I agree with using the code-assist version instead of my initial version. I just did a second commit and replaced my initial version with the code-assist version.
|
/cc @varungup90 |
|
this is a benchmark fix @happyandslow please take a look |
Signed-off-by: Jinghua Wang <[email protected]>
|
@jwjwjw3 thanks for the comprehensive testing and bug fix. awesome! |
…(Chain-of-Thought) LLMs (vllm-project#1573) * bug fix on benchmarking TTFT of reasoning LLMs * adapt the solution in the code-assist comments Signed-off-by: Jinghua Wang <[email protected]> Co-authored-by: jwbook <[email protected]> Signed-off-by: ChethanUK <[email protected]>
Pull Request Description
Bug Description:
When using
benchmarks/benchmark.pyto run benchmark according to the guide inbenchmarks/README.mdon reasoning (CoT) LLMs, I always get unexpectedly high Time-to-First-Token (TTFT). On NVIDIA GPUs like A30 and A100 running Qwen3 LLMs using SGLang, despite very low Time-per-Output-Token (TPOT) of 0.01s-0.05s, TTFT stays at 10-30s. This "TTFT too high" bug is not found on non-reasoning LLMs (with TTFT being 0.2-2s), and this bug doesn't show up if I turn off reasoning on the same Qwen3 LLM (on the same GPU, same server node, same LLM inference backend, etc.).Details of my setup:
Here are three benchmark analysis results (generated by
python benchmark.py --stage analysis --config config.yaml) of running Qwen3-14B using SGLang v0.4.7, with benchmark config of "synthetic_multiturn" prompts with "output_token_limit"=8192 (the rest of the configurations like "workload_configs" and "dataset_configs" are the same asbenchmarks/config.yamlonmainbranch) on the same server machine, each run takes ~5 minutes:benchmarks/client/client.py, then TTFT stays in 0.1-0.6s, the "TTFT too high" bug is fixed.Solution Description:
The bug of "TTFT too high" is because the benchmark client Python script uses the time difference from "request sent" to "received first non-reasoning token response" as TTFT, this is fine for non-reasoning LLMs, but buggy for reasoning LLMs (because this TTFT includes 10-20s of reasoning tokens decode time before the first non-reasoning token is generated).
My fix is to change TTFT into using the time difference from "request sent" to "received (the first non-reasoning token response) or (the first reasoning token response)".
The OpenAI API Client streams LLM generated non-reasoning content and reasoning content in two different ways:
non-reasoning contents are at
choices[0].delta.content, and reasoning contents are atchoices[0].delta.reasoning_content. Therefore, I added another "elif" branch after theif chunk.choices[0].delta.content is not None:branch inbenchmarks/client/client.pywhen detectingfirst_response_timeand updatingtext_chunks.Related Issues
Resolves: None
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.