[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573

jwjwjw3 · 2025-09-15T22:34:56Z

Pull Request Description

Bug Description:

When using benchmarks/benchmark.py to run benchmark according to the guide in benchmarks/README.md on reasoning (CoT) LLMs, I always get unexpectedly high Time-to-First-Token (TTFT). On NVIDIA GPUs like A30 and A100 running Qwen3 LLMs using SGLang, despite very low Time-per-Output-Token (TPOT) of 0.01s-0.05s, TTFT stays at 10-30s. This "TTFT too high" bug is not found on non-reasoning LLMs (with TTFT being 0.2-2s), and this bug doesn't show up if I turn off reasoning on the same Qwen3 LLM (on the same GPU, same server node, same LLM inference backend, etc.).

Details of my setup:

GPU hardware: bug reproduced on A30 and A100.
LLM inference backend: SGLang.
LLMs: bug reproduced with Qwen3-8B, Qwen3-14B, and Qwen3-32B.

Here are three benchmark analysis results (generated by python benchmark.py --stage analysis --config config.yaml) of running Qwen3-14B using SGLang v0.4.7, with benchmark config of "synthetic_multiturn" prompts with "output_token_limit"=8192 (the rest of the configurations like "workload_configs" and "dataset_configs" are the same as benchmarks/config.yaml on main branch) on the same server machine, each run takes ~5 minutes:

Run 1, reasoning disabled in SGLang: TTFT stays in 0.1-1.0s, there is no "TTFT too high" bug.

Run 2, reasoning enabled in SGLang: TTFT stays in 8-22s, I see "TTFT too high" bug.

Run 3, reasoning enabled in SGLang: I applied my fix (described below in "Solution Description") in benchmarks/client/client.py, then TTFT stays in 0.1-0.6s, the "TTFT too high" bug is fixed.

Solution Description:

The bug of "TTFT too high" is because the benchmark client Python script uses the time difference from "request sent" to "received first non-reasoning token response" as TTFT, this is fine for non-reasoning LLMs, but buggy for reasoning LLMs (because this TTFT includes 10-20s of reasoning tokens decode time before the first non-reasoning token is generated).

My fix is to change TTFT into using the time difference from "request sent" to "received (the first non-reasoning token response) or (the first reasoning token response)".

The OpenAI API Client streams LLM generated non-reasoning content and reasoning content in two different ways:
non-reasoning contents are at choices[0].delta.content, and reasoning contents are at choices[0].delta.reasoning_content. Therefore, I added another "elif" branch after the if chunk.choices[0].delta.content is not None: branch in benchmarks/client/client.py when detecting first_response_time and updating text_chunks.

Related Issues

Resolves: None

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

gemini-code-assist

Summary of Changes

Hello @jwjwjw3, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the benchmarking system where Time-to-First-Token (TTFT) measurements for reasoning Large Language Models (LLMs) were significantly inflated. The previous calculation incorrectly waited for the first non-reasoning token, thereby including the generation time of internal reasoning steps. The fix adjusts the TTFT logic to capture the arrival of any initial token, whether it's reasoning or non-reasoning content, providing a more accurate performance metric for these models.

Highlights

Bug Fix for TTFT in Reasoning LLM Benchmarks: Addresses an issue where Time-to-First-Token (TTFT) was inaccurately high (10-30s) for reasoning (Chain-of-Thought) LLMs in benchmarks. This occurred because the TTFT calculation only considered the first non-reasoning token, effectively including the generation time of reasoning tokens in the TTFT.
TTFT Calculation Logic Update: The TTFT calculation has been updated to correctly measure the time until either the first non-reasoning token or the first reasoning token is received. This ensures that the benchmark accurately reflects the initial response time for reasoning LLMs.
client.py Modification: The benchmarks/client/client.py file was modified to include a check for chunk.choices[0].delta.reasoning_content when determining the first_response_time, alongside the existing check for chunk.choices[0].delta.content.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a bug causing unexpectedly high Time-to-First-Token (TTFT) in benchmarks for reasoning-based LLMs. The fix correctly adjusts the TTFT calculation to consider the first token received, whether it's a reasoning or a content token. The change is logical and well-explained. My review includes a suggestion to refactor the implementation to reduce code duplication and improve maintainability.

gemini-code-assist · 2025-09-15T22:36:08Z

benchmarks/client/client.py

+                    elif chunk.choices[0].delta.reasoning_content is not None:
+                        if not first_response_time:
+                            first_response_time = asyncio.get_event_loop().time()
+                        output_text = chunk.choices[0].delta.reasoning_content
+                        text_chunks.append(output_text)


This elif block correctly fixes the TTFT calculation for reasoning models, but it introduces code duplication. The logic for setting first_response_time and appending the text chunk is identical to the if block for delta.content.

To improve maintainability and avoid redundancy, consider refactoring this section to handle both content and reasoning_content in a more unified way. Since this would involve changing lines outside the current diff, I'm providing a suggestion here for you to consider applying across lines 65-74:

delta = chunk.choices[0].delta output_text = delta.content if output_text is None: # Use getattr for safety as reasoning_content is not a standard field output_text = getattr(delta, 'reasoning_content', None) if output_text is not None: if not first_response_time: first_response_time = asyncio.get_event_loop().time() text_chunks.append(output_text)

This refactoring removes the repeated code, making it cleaner and easier to maintain.

@jwjwjw3 Could you take a look at the code-assist comment above -- I think the comment actually make sense.

@happyandslow Thanks for the suggestion, I think the code block in the code-assist comment is also a good solution. I just tested the code-assist solution and it also resolves the "TTFT too high" bug. I agree with using the code-assist version instead of my initial version. I just did a second commit and replaced my initial version with the code-assist version.

Jeffwan · 2025-09-16T02:26:20Z

/cc @varungup90

Jeffwan · 2025-09-18T20:06:35Z

this is a benchmark fix @happyandslow please take a look

Signed-off-by: Jinghua Wang <[email protected]>

Jeffwan · 2025-09-19T16:37:59Z

@jwjwjw3 thanks for the comprehensive testing and bug fix. awesome!

…(Chain-of-Thought) LLMs (vllm-project#1573) * bug fix on benchmarking TTFT of reasoning LLMs * adapt the solution in the code-assist comments Signed-off-by: Jinghua Wang <[email protected]> Co-authored-by: jwbook <[email protected]> Signed-off-by: ChethanUK <[email protected]>

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

bug fix on benchmarking TTFT of reasoning LLMs

93f2dd6

Jeffwan force-pushed the main branch from b9c7ed0 to 93f2dd6 Compare September 18, 2025 20:09

adapt the solution in the code-assist comments

ca10be5

Signed-off-by: Jinghua Wang <[email protected]>

Jeffwan merged commit 7f909cb into vllm-project:main Sep 19, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573

[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573

Uh oh!

jwjwjw3 commented Sep 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Uh oh!

happyandslow Sep 18, 2025

Uh oh!

jwjwjw3 Sep 19, 2025 •

edited

Loading

Uh oh!

Jeffwan commented Sep 16, 2025

Uh oh!

Jeffwan commented Sep 18, 2025

Uh oh!

Jeffwan commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573

[Bug] fix: unexpectedly high TTFT in benchmarks results of reasoning (Chain-of-Thought) LLMs #1573

Uh oh!

Conversation

jwjwjw3 commented Sep 15, 2025

Pull Request Description

Bug Description:

Solution Description:

Related Issues

Pull Request Title Format

Submission Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

happyandslow Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jwjwjw3 Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Sep 16, 2025

Uh oh!

Jeffwan commented Sep 18, 2025

Uh oh!

Jeffwan commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jwjwjw3 Sep 19, 2025 •

edited

Loading