-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat(oai refactor): Replace openai_api
with entrypoints/openai
#7351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request undertakes a significant refactoring of the OpenAI-compatible API serving infrastructure and template management within the system. The primary goal is to enhance modularity, improve code organization, and centralize template handling, leading to a more maintainable and extensible codebase.
Highlights
- API Module Reorganization: The core OpenAI API serving logic has been moved from
sglang.srt.openai_api
tosglang.srt.entrypoints.openai
, establishing a clearer separation of concerns and improving code structure. - Centralized Template Management: A new
TemplateManager
class is introduced to consolidate the loading and management of both chat and completion templates, replacing scattered logic and global state for improved modularity. - Dedicated API Handlers: The HTTP server now utilizes specialized
OpenAIServing
classes for different OpenAI API endpoints (e.g., chat, completions, embeddings, rerank, score), promoting a more object-oriented approach to API request handling. - Expanded API Functionality: New OpenAI-compatible endpoints for reranking (
/v1/rerank
) and scoring (/v1/score
) are added, leveraging the new serving class architecture. - Streamlined Protocol Definitions: Minor adjustments are made to the OpenAI protocol definitions, such as making
finish_reason
anddimensions
optional, for increased flexibility and robustness. - Removal of Batch API: Existing batch API endpoints (
/v1/files
,/v1/batches
) and their associated test cases have been removed, simplifying the API surface.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request primarily refactors the OpenAI API integration by moving code from an openai_api
directory to entrypoints/openai
and introducing a TemplateManager
for better organization. It also adds new serving classes for different OpenAI-compatible endpoints like rerank and score.
Overall, the refactoring appears to be a positive step towards better code organization and maintainability. My main concerns revolve around potential API contract changes in protocol.py
and the removal of batch file processing functionality, which should be clarified if it's an intended change for this PR.
Key Changes & Observations:
- Refactoring: Successfully moved OpenAI related logic to the
entrypoints/openai
path. TemplateManager
: IntroducedTemplateManager
to centralize chat and completion template loading and management, which is a good improvement over global state or scattered logic.- New Serving Classes: Added dedicated classes (
OpenAIServingChat
,OpenAIServingCompletion
,OpenAIServingEmbedding
,OpenAIServingRerank
,OpenAIServingScore
) to handle specific OpenAI endpoints. This enhances modularity. - Endpoint Removal: The
/v1/files
and/v1/batches
endpoints, along with their associated functionality and tests, have been removed. This is a significant functional change and needs confirmation if it's within the scope of this refactor. - Protocol Changes:
finish_reason
inCompletionResponseChoice
andChatCompletionResponseChoice
is nowOptional
.ChatCompletionRequest
'svalidate_messages_not_empty
validator was removed.ChatCompletionRequest
'sset_tool_choice_default
validator was simplified.- A new
RerankResponse
class was added.
- Logging: Improved error logging in
serving_base.py
by usinglogger.exception
. - Docstrings: Added a more detailed module docstring to
python/sglang/srt/conversation.py
.
Recommendations:
- Clarify if the removal of batch file processing APIs (
/v1/files
,/v1/batches
) is an intended part of this refactoring PR. - Review the API contract changes in
protocol.py
to ensure they are intentional and to understand any potential impact on clients. - Consider re-adding Pydantic-level validation for non-empty messages in
ChatCompletionRequest
if it's a strict requirement.
The changes generally improve the codebase structure. Addressing the points above will help ensure the quality and clarity of this refactor.
@@ -200,7 +200,7 @@ class CompletionResponseChoice(BaseModel): | |||
index: int | |||
text: str | |||
logprobs: Optional[LogProbs] = None | |||
finish_reason: Literal["stop", "length", "content_filter", "abort"] | |||
finish_reason: Optional[Literal["stop", "length", "content_filter", "abort"]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making finish_reason
optional is an API contract change. Is it now possible for finish_reason
to be None
in some scenarios? If so, could you briefly explain these scenarios or point to where this change is handled downstream? This also applies to ChatCompletionResponseChoice.finish_reason
on line 428.
@model_validator(mode="before") | ||
@classmethod | ||
def set_tool_choice_default(cls, values): | ||
if isinstance(values, dict): | ||
if values.get("tool_choice") is None: | ||
if values.get("tools") is None: | ||
values["tool_choice"] = "none" | ||
else: | ||
values["tool_choice"] = "auto" | ||
if values.get("tool_choice") is None: | ||
if values.get("tools") is None: | ||
values["tool_choice"] = "none" | ||
else: | ||
values["tool_choice"] = "auto" | ||
return values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isinstance(values, dict)
check was removed from set_tool_choice_default
validator. This makes the validator slightly less robust if ChatCompletionRequest
could be instantiated with non-dict-like data that Pydantic can't coerce to a dict before this validator runs. This could lead to an AttributeError
on values.get()
.
@@ -51,7 +51,7 @@ async def handle_request( | |||
) | |||
|
|||
except Exception as e: | |||
logger.error(f"Error in request: {e}") | |||
logger.exception(f"Error in request: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.template_manager.chat_template_name is None: | ||
prompt, prompt_ids, image_data, audio_data, modalities, stop = ( | ||
self._apply_jinja_template(request, tools, is_multimodal) | ||
) | ||
else: | ||
prompt, image_data, audio_data, modalities, stop = ( | ||
prompt, prompt_ids, image_data, audio_data, modalities, stop = ( | ||
self._apply_conversation_template(request) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
|
||
template_content_format = self._cached_template_format | ||
template_content_format = self.template_manager.jinja_template_content_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop = request.stop | ||
image_data = image_data if image_data else None | ||
audio_data = audio_data if audio_data else None | ||
modalities = modalities if modalities else [] | ||
return prompt, prompt_ids, image_data, audio_data, modalities, stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignment stop = request.stop
is a bit more direct than request.stop or []
. If request.stop
can be None
, this change is fine as None
is often a valid value for stop sequences (meaning use default or no specific stop sequences beyond EOS). The conditional initialization of image_data
, audio_data
, and modalities
is also clearer.
if not is_multimodal: | ||
prompt_ids = self.tokenizer_manager.tokenizer.encode(prompt) | ||
|
||
return prompt, prompt_ids, image_data, audio_data, modalities, stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if request.chat_template_kwargs is not None: | ||
enable_thinking = request.chat_template_kwargs.get( | ||
"enable_thinking", True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue is that the default is None, it wil override {}
and raise error on None.get()
"""Template utilities for Jinja template processing. | ||
|
||
This module provides utilities for analyzing and processing Jinja chat templates, | ||
including content format detection and message processing. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…nvert internal request
fa46bfc
to
019e5e1
Compare
- Replace all imports with sglang/srt/entrypoints/openai - Remove batch and file endpoints in http_server.py - Update openai-compatible endpoints in http_server.py
- Add TODO find a better way for the chat template management
- logger.error won't print detailed exec details
- prompt and prompt_ids are not created before referenced - modalities data should be None in `_apply_jinja_template` otherwise there will be an error in GenerateReqInput.normalize
…utilities * Create centralized TemplateManager class to eliminate global template state - Replace global chat_template_name/completion_template_name variables - Integrate TemplateManager into _GlobalState and inject into serving classes - Add proper type hints for TokenizerManager and TemplateManager parameters * Extract and reorganize Jinja template utilities for better separation of concerns - Create jinja_template_utils.py with template detection and processing logic - Move utilities from openai/utils.py to avoid improper dependencies - Rename detect_template_content_format -> detect_jinja_template_content_format * Optimize template content format detection for better performance - Detect format during template loading instead of on every request - Cache detected format in TemplateManager.jinja_template_content_format property - Add logging for template format detection results * Clean up codebase and improve maintainability - Remove unused imports and clean up import organization - Simplify TemplateManager interface by removing unused is_initialized property - Update all serving classes (chat, completions, embedding) to use dependency injection - Improve code organization and eliminate architectural debt Benefits: - ✅ Eliminates global state pollution - ✅ Better separation of concerns (generic vs OpenAI-specific utilities) - ✅ Improved performance through caching - ✅ Cleaner dependency injection pattern - ✅ More testable and maintainable architecture
- The handling for a single string in a list can be removed as #7396 is merged. - Add UT cases in test_openai_server for such case
…ontent - Remove enable_thinking in the check condition when handling reasoning content: enable_thinking is a flag that only supported by Qwen3, in its chat_template. We pass by this parameter in `self.tokenizer_manager.tokenizer.apply_chat_template`. When handling reasoning content, as other models don't support enable_thinking, this flag should be removed from the check condition. - Add back `_get_enable_thinking_from_request` as a util function as some reasoning related parser or backend may need it in the future.
I tested endpoint embedding rerank and score, all seems good. |
- Correct name should be `detect_jinja_template_content_format`
Qwen3 should not assume it is always in reasoning mode. As Qwen3 supports a parameter called `enable_thinking=False` in its chat_template. In this case, it won't generate thinking content.
…ing is False - Introduce a self._in_reasoning to handle case when force_reasoning is False
We don't need to separate _in_reasoning and _force_in_reasoning
- If there is already a tool, the next token can be simply a tool_call_separator, before following with a `{`
openai_api
with entrypoints/openai
openai_api
with entrypoints/openai
- Add `retrieve_model` in `http_server.py`. It will retreive a model's information. And returns 404 if model is not in the served model list. - Add UT
…gl-project#7351) Co-authored-by: Jin Pan <[email protected]>
…gl-project#7351) Co-authored-by: Jin Pan <[email protected]>
…gl-project#7351) Co-authored-by: Jin Pan <[email protected]>
@CatherineSue Hello, may I ask why files/batch apis have been removed? |
Motivation
This PR aims to refactor and modernize the OpenAI API implementation in SGLang by removing the legacy openai_api module and consolidating it into the entrypoints/openai structure.
Modifications
1. Remove Legacy OpenAI API Module
sglang/srt/openai_api
directory (commit 5a7df078)sglang/srt/entrypoints/openai
http_server.py
http_server.py
openai/api_server.py
andconftest.py
(commit d6d7351c)2. Introduce Centralized Template Management (commit 7d343747)
TemplateManager
class to eliminate global template statechat_template_name
/completion_template_name
variablesTemplateManager
into_GlobalState
with proper dependency injectionTokenizerManager
andTemplateManager
parametersjinja_template_utils.py
with template detection and processing logic (moved fromutils.py
)detect_template_content_format
→detect_jinja_template_content_format
TemplateManager.jinja_template_content_format
3. Error Fixes and Improvements
validation_exception_handler
to http_server.py. This will enforceContent-Type: application/json
in the request header, which is an OpenAI standard. It also enable FastAPI to automatically decode payload, bypassing the need to handle it manually in each endpoint.is_multimodal
not found errorenable_thinking
parameter issues4. Code Cleanup and Optimization
entrypoints/openai
test/srt/openai
run_suite.py
V1RerankReqInput
duplication betweenopenai.protocol
andio_struct
5. Bug Fixes for Specific Tests
test_openai_server.py
to handle separated usage chunk and finish_reason chunkenable_thinking
in reasoning content handling (commits e583fd9, (b22eb4a...2dc0341))enable_thinking
parameter and it was used for check for reasoning content parsing:if reasoning_parser and request.separate_reasoning and enable_thinking:
. Seeadapter.py
,serving_chat.py
enable_thinking
parameter.True
unless it is explicitly set throughchat_template_kwargs
. However, this brings confusion to developers as not all models supportenable_thinking
.enable_thinking
as a condition for reasoning content parsing since it’s specific to Qwen3’s chat template, not a general OpenAI-compatible parameter.force_thinking=True
, which causes models like Qwen3 to treat even the initial user message as part of the reasoning, leading to unexpected behavior.enable_thinking
is irrelevant.force_thinking=True
toFalse
inQwen3Detector
and fixed the logic inBaseReasoningFormatDetector
to enable it to handle cases whenforce_thinking=False
.test_reasoning_content
is passed.Checklist