Skip to content

Gemini-2.5-flash - support reasoning cost calc + return reasoning content #10141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 19, 2025

Conversation

krrishdholakia
Copy link
Contributor

@krrishdholakia krrishdholakia commented Apr 18, 2025

  • build(model_prices_and_context_window.json): add vertex ai gemini-2.5-flash pricing

  • build(model_prices_and_context_window.json): add gemini reasoning token pricing

  • fix(vertex_and_google_ai_studio_gemini.py): support counting thinking tokens for gemini

allows accurate cost calc

  • fix(utils.py): add reasoning token cost calc to generic cost calc

ensures gemini-2.5-flash cost calculation is accurate

  • build(model_prices_and_context_window.json): mark gemini-2.5-flash as 'supports_reasoning'

  • feat(gemini/): support 'thinking' + 'reasoning_effort' params + new unit tests

allow controlling thinking effort for gemini-2.5-flash models

  • test: update unit testing

  • feat(vertex_and_google_ai_studio_gemini.py): return reasoning content if given in gemini response

Copy link

vercel bot commented Apr 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
litellm ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 19, 2025 4:03pm

…sitive to new keys / updates to usage object
@krrishdholakia krrishdholakia merged commit 36308a3 into main Apr 19, 2025
31 of 39 checks passed
and reasoning_tokens
and reasoning_tokens > 0
):
completion_cost += float(reasoning_tokens) * _output_cost_per_reasoning_token
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the Gemini 2.5 Flash pricing is that the output cost per token is binary based on whether reasoning is enabled (e.g., was the thinking budget == 0). I believe whether the token is an output token or reasoning token is not relevant, both are just output tokens.

Copy link
Contributor Author

@krrishdholakia krrishdholakia Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see reasoning tokens are charged separately -
Screenshot 2025-04-19 at 12 02 18 PM
Screenshot 2025-04-19 at 12 02 37 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea to add a unit test to make sure these are calculated separately only if the model has 'output_cost_per_reasoning_token' set,

thanks @awesie

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here - dacc712

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning tokens are not calculated separately from output tokens, they are both considered output. Thinking being disabled with thinkingBudget = 0 is what switches output pricing from $3.5/million tokens to $0.60/million tokens.

https://x.com/OfficialLoganK/status/1912986097765789782

The thinking budget = 0 is what triggers the billing switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheahjs that's how we handle it if i understand this correctly

  • completion_Tokens = total tokens (maps to the candidate token count from vertex)
  • text tokens (non-thinking tokens) = candidate token count - thinking token count
  • reasoning tokens = thinking token count

What am i missing here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pricing for Gemini Flash 2.5 does not distinguish between text tokens and reasoning tokens. They are billed at the same price. The price is determined based on whether thinking is enabled or not. Thinking is enabled if the thinking budget in the request is non-zero or if it the budget is not set (default thinking budget is 8192 tokens).

Copy link
Contributor Author

@krrishdholakia krrishdholakia Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - so is the case we're missing - when thinking budget = 0

In which case Gemini would return a thinking token count, but the billing is the output_cost_per_token cost? @cheahjs @awesie

Appreciate your help on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if thinkingBudget != 0 (thinking enabled):

completion cost = candidatesTokenCount * $3.50/million, where candidatesTokenCount includes thinking tokens and response tokens.

if thinkingBudget == 0 (thinking disabled):

completion cost = candidatesTokenCount * $0.60/million, where candidatesTokenCount includes both thinking tokens and response tokens. The model isn't meant to produce thinking tokens, but if it does, it's billed at the non-thinking rate.

Note that the Gemini API returns different usage metadata than Vertex AI. With the Gemini API, candidatesTokenCount includes thinking tokens, but on Vertex AI, candidatesTokenCount does not include thinking tokens.

Gemini API:

{
  "completion_tokens": "usageMetadata['candidatesTokenCount']",
  "completion_tokens_details": {
    "reasoning_tokens": "usageMetadata['thoughtsTokenCount']"
  }
}

Vertex AI:

{
  "completion_tokens": "usageMetadata['candidatesTokenCount'] + usageMetadata['thoughtsTokenCount']",
  "completion_tokens_details": {
    "reasoning_tokens": "usageMetadata['thoughtsTokenCount']"
  }
}


params: GeminiThinkingConfig = {}
if thinking_enabled:
params["includeThoughts"] = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect. The includeThoughts parameter only determines whether thoughts are returned by the API, it does not affect whether reasoning is used at all. If you want to "disable" thinking, you must set the thinking budget to 0 tokens. As such, this function should probably only set thinkingBudget and ditto for the function above as well.

Copy link
Contributor Author

@krrishdholakia krrishdholakia Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm following their spec - https://ai.google.dev/api/generate-content#ThinkingConfig

and can confirm this works

if thinking_enabled:
params["includeThoughts"] = True
if thinking_budget:
params["thinkingBudget"] = thinking_budget
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the conditional to be if thinking_budget is not None: so that it is possible to set thinking_budget to 0? My understanding is that the thinking budget is non-zero by default, so users need a way to set it to zero explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't follow what the issue here is

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't follow what the issue here is

The issue is that there is no way to disable thinking for gemini-2.5-flash. If users want to disable thinking, they must set thinkingBudget to 0, which is skipped because the if thinking_budget: statement.

Copy link
Contributor

@mdonaj mdonaj Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the bug we are having right now:

  • includeThoughts only controls whether thoughts will be returned by the API call not if thinking is on/off
  • thinkingBudget set to 0 is the only way to disable thinking and right now 0 will not pass through the if because if of 0 is False.

We've already tried with -1 but vertex API gives 400 with that setting and budget of 1 gets auto converted to minimum of 1024 by google.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok '0' works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed e434ccc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are a beast @krrishdholakia. Thanks!

krrishdholakia added a commit that referenced this pull request Apr 19, 2025
…set, but reasoning tokens not charged separately

Addresses #10141 (comment)
krrishdholakia added a commit that referenced this pull request Apr 19, 2025
#10165)

* test(utils.py): handle scenario where text tokens + reasoning tokens set, but reasoning tokens not charged separately

Addresses #10141 (comment)

* fix(vertex_and_google_ai_studio.py): only set content if non-empty str
@krrishdholakia krrishdholakia deleted the litellm_gemini_2_5_flash branch April 19, 2025 19:34
krrishdholakia added a commit that referenced this pull request Apr 22, 2025
krrishdholakia added a commit that referenced this pull request Apr 22, 2025
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0

Fixes #10121

* fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens

Addresses #10141 (comment)
minh-thai-gfg pushed a commit to GFG/litellm that referenced this pull request May 7, 2025
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0

Fixes BerriAI#10121

* fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens

Addresses BerriAI#10141 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants