-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Gemini-2.5-flash - support reasoning cost calc + return reasoning content #10141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… tokens for gemini allows accurate cost calc
ensures gemini-2.5-flash cost calculation is accurate
… 'supports_reasoning'
…nit tests allow controlling thinking effort for gemini-2.5-flash models
… if given in gemini response
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…sitive to new keys / updates to usage object
and reasoning_tokens | ||
and reasoning_tokens > 0 | ||
): | ||
completion_cost += float(reasoning_tokens) * _output_cost_per_reasoning_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the Gemini 2.5 Flash pricing is that the output cost per token is binary based on whether reasoning is enabled (e.g., was the thinking budget == 0). I believe whether the token is an output token or reasoning token is not relevant, both are just output tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a good idea to add a unit test to make sure these are calculated separately only if the model has 'output_cost_per_reasoning_token' set,
thanks @awesie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here - dacc712
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasoning tokens are not calculated separately from output tokens, they are both considered output. Thinking being disabled with thinkingBudget = 0
is what switches output pricing from $3.5/million tokens to $0.60/million tokens.
https://x.com/OfficialLoganK/status/1912986097765789782
The thinking budget = 0 is what triggers the billing switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheahjs that's how we handle it if i understand this correctly
- completion_Tokens = total tokens (maps to the candidate token count from vertex)
- text tokens (non-thinking tokens) = candidate token count - thinking token count
- reasoning tokens = thinking token count
What am i missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pricing for Gemini Flash 2.5 does not distinguish between text tokens and reasoning tokens. They are billed at the same price. The price is determined based on whether thinking is enabled or not. Thinking is enabled if the thinking budget in the request is non-zero or if it the budget is not set (default thinking budget is 8192 tokens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if thinkingBudget != 0
(thinking enabled):
completion cost = candidatesTokenCount * $3.50/million
, where candidatesTokenCount
includes thinking tokens and response tokens.
if thinkingBudget == 0
(thinking disabled):
completion cost = candidatesTokenCount * $0.60/million
, where candidatesTokenCount
includes both thinking tokens and response tokens. The model isn't meant to produce thinking tokens, but if it does, it's billed at the non-thinking rate.
Note that the Gemini API returns different usage metadata than Vertex AI. With the Gemini API, candidatesTokenCount
includes thinking tokens, but on Vertex AI, candidatesTokenCount
does not include thinking tokens.
Gemini API:
{
"completion_tokens": "usageMetadata['candidatesTokenCount']",
"completion_tokens_details": {
"reasoning_tokens": "usageMetadata['thoughtsTokenCount']"
}
}
Vertex AI:
{
"completion_tokens": "usageMetadata['candidatesTokenCount'] + usageMetadata['thoughtsTokenCount']",
"completion_tokens_details": {
"reasoning_tokens": "usageMetadata['thoughtsTokenCount']"
}
}
|
||
params: GeminiThinkingConfig = {} | ||
if thinking_enabled: | ||
params["includeThoughts"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect. The includeThoughts
parameter only determines whether thoughts are returned by the API, it does not affect whether reasoning is used at all. If you want to "disable" thinking, you must set the thinking budget to 0 tokens. As such, this function should probably only set thinkingBudget
and ditto for the function above as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm following their spec - https://ai.google.dev/api/generate-content#ThinkingConfig
and can confirm this works
if thinking_enabled: | ||
params["includeThoughts"] = True | ||
if thinking_budget: | ||
params["thinkingBudget"] = thinking_budget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the conditional to be if thinking_budget is not None:
so that it is possible to set thinking_budget
to 0? My understanding is that the thinking budget is non-zero by default, so users need a way to set it to zero explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't follow what the issue here is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't follow what the issue here is
The issue is that there is no way to disable thinking for gemini-2.5-flash. If users want to disable thinking, they must set thinkingBudget to 0, which is skipped because the if thinking_budget:
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the bug we are having right now:
includeThoughts
only controls whether thoughts will be returned by the API call not if thinking is on/offthinkingBudget
set to 0 is the only way to disable thinking and right now 0 will not pass through the if because if of 0 isFalse
.
We've already tried with -1 but vertex API gives 400 with that setting and budget of 1 gets auto converted to minimum of 1024 by google.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok '0' works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed e434ccc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are a beast @krrishdholakia. Thanks!
…set, but reasoning tokens not charged separately Addresses #10141 (comment)
#10165) * test(utils.py): handle scenario where text tokens + reasoning tokens set, but reasoning tokens not charged separately Addresses #10141 (comment) * fix(vertex_and_google_ai_studio.py): only set content if non-empty str
… exclusive vs. inclusive tokens Addresses #10141 (comment)
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0 Fixes #10121 * fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens Addresses #10141 (comment)
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0 Fixes BerriAI#10121 * fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens Addresses BerriAI#10141 (comment)
build(model_prices_and_context_window.json): add vertex ai gemini-2.5-flash pricing
build(model_prices_and_context_window.json): add gemini reasoning token pricing
fix(vertex_and_google_ai_studio_gemini.py): support counting thinking tokens for gemini
allows accurate cost calc
ensures gemini-2.5-flash cost calculation is accurate
build(model_prices_and_context_window.json): mark gemini-2.5-flash as 'supports_reasoning'
feat(gemini/): support 'thinking' + 'reasoning_effort' params + new unit tests
allow controlling thinking effort for gemini-2.5-flash models
test: update unit testing
feat(vertex_and_google_ai_studio_gemini.py): return reasoning content if given in gemini response