Gemini-2.5-flash - support reasoning cost calc + return reasoning content #10141

krrishdholakia · 2025-04-18T18:49:04Z

build(model_prices_and_context_window.json): add vertex ai gemini-2.5-flash pricing
build(model_prices_and_context_window.json): add gemini reasoning token pricing
fix(vertex_and_google_ai_studio_gemini.py): support counting thinking tokens for gemini

allows accurate cost calc

fix(utils.py): add reasoning token cost calc to generic cost calc

ensures gemini-2.5-flash cost calculation is accurate

build(model_prices_and_context_window.json): mark gemini-2.5-flash as 'supports_reasoning'
feat(gemini/): support 'thinking' + 'reasoning_effort' params + new unit tests

allow controlling thinking effort for gemini-2.5-flash models

test: update unit testing
feat(vertex_and_google_ai_studio_gemini.py): return reasoning content if given in gemini response

…-flash pricing

…en pricing

… tokens for gemini allows accurate cost calc

ensures gemini-2.5-flash cost calculation is accurate

… 'supports_reasoning'

…nit tests allow controlling thinking effort for gemini-2.5-flash models

… if given in gemini response

vercel · 2025-04-18T18:49:09Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
litellm	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 19, 2025 4:03pm

…sitive to new keys / updates to usage object

awesie · 2025-04-19T18:28:22Z

litellm/litellm_core_utils/llm_cost_calc/utils.py

+        and reasoning_tokens
+        and reasoning_tokens > 0
+    ):
+        completion_cost += float(reasoning_tokens) * _output_cost_per_reasoning_token


My understanding of the Gemini 2.5 Flash pricing is that the output cost per token is binary based on whether reasoning is enabled (e.g., was the thinking budget == 0). I believe whether the token is an output token or reasoning token is not relevant, both are just output tokens.

I see reasoning tokens are charged separately -

it's a good idea to add a unit test to make sure these are calculated separately only if the model has 'output_cost_per_reasoning_token' set,

thanks @awesie

Added here - dacc712

Reasoning tokens are not calculated separately from output tokens, they are both considered output. Thinking being disabled with thinkingBudget = 0 is what switches output pricing from $3.5/million tokens to $0.60/million tokens.

https://x.com/OfficialLoganK/status/1912986097765789782

The thinking budget = 0 is what triggers the billing switch.

@cheahjs that's how we handle it if i understand this correctly

completion_Tokens = total tokens (maps to the candidate token count from vertex)

text tokens (non-thinking tokens) = candidate token count - thinking token count

reasoning tokens = thinking token count

What am i missing here?

The pricing for Gemini Flash 2.5 does not distinguish between text tokens and reasoning tokens. They are billed at the same price. The price is determined based on whether thinking is enabled or not. Thinking is enabled if the thinking budget in the request is non-zero or if it the budget is not set (default thinking budget is 8192 tokens).

Oh - so is the case we're missing - when thinking budget = 0

In which case Gemini would return a thinking token count, but the billing is the output_cost_per_token cost? @cheahjs @awesie

Appreciate your help on this!

if thinkingBudget != 0 (thinking enabled):

completion cost = candidatesTokenCount * $3.50/million, where candidatesTokenCount includes thinking tokens and response tokens.

if thinkingBudget == 0 (thinking disabled):

completion cost = candidatesTokenCount * $0.60/million, where candidatesTokenCount includes both thinking tokens and response tokens. The model isn't meant to produce thinking tokens, but if it does, it's billed at the non-thinking rate.

Note that the Gemini API returns different usage metadata than Vertex AI. With the Gemini API, candidatesTokenCount includes thinking tokens, but on Vertex AI, candidatesTokenCount does not include thinking tokens.

Gemini API:

{ "completion_tokens": "usageMetadata['candidatesTokenCount']", "completion_tokens_details": { "reasoning_tokens": "usageMetadata['thoughtsTokenCount']" } }

Vertex AI:

{ "completion_tokens": "usageMetadata['candidatesTokenCount'] + usageMetadata['thoughtsTokenCount']", "completion_tokens_details": { "reasoning_tokens": "usageMetadata['thoughtsTokenCount']" } }

awesie · 2025-04-19T18:30:21Z

litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py

+
+        params: GeminiThinkingConfig = {}
+        if thinking_enabled:
+            params["includeThoughts"] = True


I think this is incorrect. The includeThoughts parameter only determines whether thoughts are returned by the API, it does not affect whether reasoning is used at all. If you want to "disable" thinking, you must set the thinking budget to 0 tokens. As such, this function should probably only set thinkingBudget and ditto for the function above as well.

i'm following their spec - https://ai.google.dev/api/generate-content#ThinkingConfig

and can confirm this works

awesie · 2025-04-19T18:39:00Z

litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py

+        if thinking_enabled:
+            params["includeThoughts"] = True
+        if thinking_budget:
+            params["thinkingBudget"] = thinking_budget


Do we need the conditional to be if thinking_budget is not None: so that it is possible to set thinking_budget to 0? My understanding is that the thinking budget is non-zero by default, so users need a way to set it to zero explicitly.

i don't follow what the issue here is

i don't follow what the issue here is

The issue is that there is no way to disable thinking for gemini-2.5-flash. If users want to disable thinking, they must set thinkingBudget to 0, which is skipped because the if thinking_budget: statement.

This is exactly the bug we are having right now:

includeThoughts only controls whether thoughts will be returned by the API call not if thinking is on/off

thinkingBudget set to 0 is the only way to disable thinking and right now 0 will not pass through the if because if of 0 is False.

We've already tried with -1 but vertex API gives 400 with that setting and budget of 1 gets auto converted to minimum of 1024 by google.

ok '0' works

Fixed e434ccc

you are a beast @krrishdholakia. Thanks!

…set, but reasoning tokens not charged separately Addresses #10141 (comment)

#10165) * test(utils.py): handle scenario where text tokens + reasoning tokens set, but reasoning tokens not charged separately Addresses #10141 (comment) * fix(vertex_and_google_ai_studio.py): only set content if non-empty str

… exclusive vs. inclusive tokens Addresses #10141 (comment)

* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0 Fixes #10121 * fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens Addresses #10141 (comment)

* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0 Fixes BerriAI#10121 * fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens Addresses BerriAI#10141 (comment)

krrishdholakia added 8 commits April 18, 2025 09:48

build(model_prices_and_context_window.json): add vertex ai gemini-2.5…

41483b9

…-flash pricing

build(model_prices_and_context_window.json): add gemini reasoning tok…

273355f

…en pricing

fix(vertex_and_google_ai_studio_gemini.py): support counting thinking…

3471dcd

… tokens for gemini allows accurate cost calc

fix(utils.py): add reasoning token cost calc to generic cost calc

037b7da

ensures gemini-2.5-flash cost calculation is accurate

build(model_prices_and_context_window.json): mark gemini-2.5-flash as…

380691b

… 'supports_reasoning'

feat(gemini/): support 'thinking' + 'reasoning_effort' params + new u…

d1aadb5

…nit tests allow controlling thinking effort for gemini-2.5-flash models

test: update unit testing

be79061

feat(vertex_and_google_ai_studio_gemini.py): return reasoning content…

1e0e7b0

… if given in gemini response

test: update model name

0cbca55

vercel bot deployed to Preview April 18, 2025 19:01 View deployment

fix: fix ruff check

e077752

vercel bot deployed to Preview April 19, 2025 15:44 View deployment

test(test_spend_management_endpoints.py): update tests to be less sen…

211fffc

…sitive to new keys / updates to usage object

vercel bot deployed to Preview April 19, 2025 16:01 View deployment

fix(vertex_and_google_ai_studio_gemini.py): fix translation

7bd2c11

vercel bot deployed to Preview April 19, 2025 16:03 View deployment

krrishdholakia merged commit 36308a3 into main Apr 19, 2025
31 of 39 checks passed

awesie reviewed Apr 19, 2025

View reviewed changes

krrishdholakia added a commit that referenced this pull request Apr 19, 2025

test(utils.py): handle scenario where text tokens + reasoning tokens …

dacc712

…set, but reasoning tokens not charged separately Addresses #10141 (comment)

krrishdholakia mentioned this pull request Apr 19, 2025

test(utils.py): handle scenario where text tokens + reasoning tokens … #10165

Merged

4 tasks

krrishdholakia deleted the litellm_gemini_2_5_flash branch April 19, 2025 19:34

emc2314 mentioned this pull request Apr 21, 2025

[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM #10121

Closed

awesie mentioned this pull request Apr 21, 2025

Fix gemini 2.5 flash on Vertex AI #10189

Open

4 tasks

krrishdholakia added a commit that referenced this pull request Apr 22, 2025

fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting…

72d89cc

… exclusive vs. inclusive tokens Addresses #10141 (comment)

krrishdholakia mentioned this pull request Apr 22, 2025

Gemini-2.5-flash improvements #10198

Merged

4 tasks

simonw mentioned this pull request May 8, 2025

Token counting is incorrect, thinking tokens should be added to output simonw/llm-gemini#75

Closed

Uh oh!

Gemini-2.5-flash - support reasoning cost calc + return reasoning content #10141

Gemini-2.5-flash - support reasoning cost calc + return reasoning content #10141

Uh oh!

Conversation

krrishdholakia commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krrishdholakia Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krrishdholakia Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krrishdholakia Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdonaj Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krrishdholakia commented Apr 18, 2025 •

edited

Loading

vercel bot commented Apr 18, 2025 •

edited

Loading

krrishdholakia Apr 19, 2025 •

edited

Loading

krrishdholakia Apr 20, 2025 •

edited

Loading

krrishdholakia Apr 19, 2025 •

edited

Loading

mdonaj Apr 21, 2025 •

edited

Loading