Skip to content

Conversation

@jwjwjw3
Copy link
Contributor

@jwjwjw3 jwjwjw3 commented Sep 15, 2025

Pull Request Description

Bug Description:

When using benchmarks/benchmark.py to run benchmark according to the guide in benchmarks/README.md on reasoning (CoT) LLMs, I always get unexpectedly high Time-to-First-Token (TTFT). On NVIDIA GPUs like A30 and A100 running Qwen3 LLMs using SGLang, despite very low Time-per-Output-Token (TPOT) of 0.01s-0.05s, TTFT stays at 10-30s. This "TTFT too high" bug is not found on non-reasoning LLMs (with TTFT being 0.2-2s), and this bug doesn't show up if I turn off reasoning on the same Qwen3 LLM (on the same GPU, same server node, same LLM inference backend, etc.).

Details of my setup:

  • GPU hardware: bug reproduced on A30 and A100.
  • LLM inference backend: SGLang.
  • LLMs: bug reproduced with Qwen3-8B, Qwen3-14B, and Qwen3-32B.

Here are three benchmark analysis results (generated by python benchmark.py --stage analysis --config config.yaml) of running Qwen3-14B using SGLang v0.4.7, with benchmark config of "synthetic_multiturn" prompts with "output_token_limit"=8192 (the rest of the configurations like "workload_configs" and "dataset_configs" are the same as benchmarks/config.yaml on main branch) on the same server machine, each run takes ~5 minutes:

  • Run 1, reasoning disabled in SGLang: TTFT stays in 0.1-1.0s, there is no "TTFT too high" bug.
benchmark_analysis_reasoning_disabled
  • Run 2, reasoning enabled in SGLang: TTFT stays in 8-22s, I see "TTFT too high" bug.
benchmark_analysis_reasoning_with_bug
  • Run 3, reasoning enabled in SGLang: I applied my fix (described below in "Solution Description") in benchmarks/client/client.py, then TTFT stays in 0.1-0.6s, the "TTFT too high" bug is fixed.
benchmark_analysis_reasoning_bug_fixed

Solution Description:

The bug of "TTFT too high" is because the benchmark client Python script uses the time difference from "request sent" to "received first non-reasoning token response" as TTFT, this is fine for non-reasoning LLMs, but buggy for reasoning LLMs (because this TTFT includes 10-20s of reasoning tokens decode time before the first non-reasoning token is generated).

My fix is to change TTFT into using the time difference from "request sent" to "received (the first non-reasoning token response) or (the first reasoning token response)".

The OpenAI API Client streams LLM generated non-reasoning content and reasoning content in two different ways:
non-reasoning contents are at choices[0].delta.content, and reasoning contents are at choices[0].delta.reasoning_content. Therefore, I added another "elif" branch after the if chunk.choices[0].delta.content is not None: branch in benchmarks/client/client.py when detecting first_response_time and updating text_chunks.

Related Issues

Resolves: None

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jwjwjw3, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the benchmarking system where Time-to-First-Token (TTFT) measurements for reasoning Large Language Models (LLMs) were significantly inflated. The previous calculation incorrectly waited for the first non-reasoning token, thereby including the generation time of internal reasoning steps. The fix adjusts the TTFT logic to capture the arrival of any initial token, whether it's reasoning or non-reasoning content, providing a more accurate performance metric for these models.

Highlights

  • Bug Fix for TTFT in Reasoning LLM Benchmarks: Addresses an issue where Time-to-First-Token (TTFT) was inaccurately high (10-30s) for reasoning (Chain-of-Thought) LLMs in benchmarks. This occurred because the TTFT calculation only considered the first non-reasoning token, effectively including the generation time of reasoning tokens in the TTFT.
  • TTFT Calculation Logic Update: The TTFT calculation has been updated to correctly measure the time until either the first non-reasoning token or the first reasoning token is received. This ensures that the benchmark accurately reflects the initial response time for reasoning LLMs.
  • client.py Modification: The benchmarks/client/client.py file was modified to include a check for chunk.choices[0].delta.reasoning_content when determining the first_response_time, alongside the existing check for chunk.choices[0].delta.content.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug causing unexpectedly high Time-to-First-Token (TTFT) in benchmarks for reasoning-based LLMs. The fix correctly adjusts the TTFT calculation to consider the first token received, whether it's a reasoning or a content token. The change is logical and well-explained. My review includes a suggestion to refactor the implementation to reduce code duplication and improve maintainability.

Comment on lines 70 to 74
elif chunk.choices[0].delta.reasoning_content is not None:
if not first_response_time:
first_response_time = asyncio.get_event_loop().time()
output_text = chunk.choices[0].delta.reasoning_content
text_chunks.append(output_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This elif block correctly fixes the TTFT calculation for reasoning models, but it introduces code duplication. The logic for setting first_response_time and appending the text chunk is identical to the if block for delta.content.

To improve maintainability and avoid redundancy, consider refactoring this section to handle both content and reasoning_content in a more unified way. Since this would involve changing lines outside the current diff, I'm providing a suggestion here for you to consider applying across lines 65-74:

                    delta = chunk.choices[0].delta
                    output_text = delta.content
                    if output_text is None:
                        # Use getattr for safety as reasoning_content is not a standard field
                        output_text = getattr(delta, 'reasoning_content', None)

                    if output_text is not None:
                        if not first_response_time:
                            first_response_time = asyncio.get_event_loop().time()
                        text_chunks.append(output_text)

This refactoring removes the repeated code, making it cleaner and easier to maintain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwjwjw3 Could you take a look at the code-assist comment above -- I think the comment actually make sense.

Copy link
Contributor Author

@jwjwjw3 jwjwjw3 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyandslow Thanks for the suggestion, I think the code block in the code-assist comment is also a good solution. I just tested the code-assist solution and it also resolves the "TTFT too high" bug. I agree with using the code-assist version instead of my initial version. I just did a second commit and replaced my initial version with the code-assist version.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 16, 2025

/cc @varungup90

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 18, 2025

this is a benchmark fix @happyandslow please take a look

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 19, 2025

@jwjwjw3 thanks for the comprehensive testing and bug fix. awesome!

@Jeffwan Jeffwan merged commit 7f909cb into vllm-project:main Sep 19, 2025
3 checks passed
chethanuk pushed a commit to chethanuk/aibrix that referenced this pull request Sep 22, 2025
…(Chain-of-Thought) LLMs (vllm-project#1573)

* bug fix on benchmarking TTFT of reasoning LLMs
* adapt the solution in the code-assist comments

Signed-off-by: Jinghua Wang <[email protected]>
Co-authored-by: jwbook <[email protected]>
Signed-off-by: ChethanUK <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants