Add buffer in the maximum number of tokens generated (to fix #353) #354
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Issue #353 found that the LiteLLM agent is now reporting that our model's request is exceeding the maximum allowable number of tokens.
This problem stems from the fact that there's a disparity between
tiktoken
's tokenizer counts and:For the full prompt parser prompt, tiktoken says there's 2569 tokens, so we set max_tokens for LiteLLM to be 4097 - 2569 = 1528
However, OpenAI's API perceives there to be 2576 tokens, which exceeds the 4097 limit, while OpenAI's tokenizer thinks there's 2862 tokens.
A naive solution here is to give a buffer; e.g. generate 300 fewer tokens than the maximum limit (so we would set max_tokens to be 1228 instead of 1528). This PR implements that solution.
References
N/A
Blocked by
N/A