feat(oai refactor): Replace `openai_api` with `entrypoints/openai` #7351

CatherineSue · 2025-06-19T09:19:07Z

Motivation

This PR aims to refactor and modernize the OpenAI API implementation in SGLang by removing the legacy openai_api module and consolidating it into the entrypoints/openai structure.

Modifications

1. Remove Legacy OpenAI API Module

Removed sglang/srt/openai_api directory (commit 5a7df078)
Replaced all imports with sglang/srt/entrypoints/openai
Removed batch and file endpoints from http_server.py
Updated OpenAI-compatible endpoints in http_server.py
Removed openai/api_server.py and conftest.py (commit d6d7351c)
Consolidated functionality into the new structure

2. Introduce Centralized Template Management (commit 7d343747)

Created TemplateManager class to eliminate global template state
Replaced global chat_template_name/completion_template_name variables
Integrated TemplateManager into _GlobalState with proper dependency injection
Added proper type hints for TokenizerManager and TemplateManager parameters
Extracted and reorganized Jinja template utilities
- Created jinja_template_utils.py with template detection and processing logic (moved from utils.py)
Moved utilities from openai/utils.py to avoid improper dependencies
Renamed detect_template_content_format → detect_jinja_template_content_format
Optimized template content format detection
Detect format during template loading instead of on every request
Cache detected format in TemplateManager.jinja_template_content_format

3. Error Fixes and Improvements

Fixed validation and error handling (commit b0c940cb)
Added validation_exception_handler to http_server.py. This will enforce Content-Type: application/json in the request header, which is an OpenAI standard. It also enable FastAPI to automatically decode payload, bypassing the need to handle it manually in each endpoint.
Fixed import and parameter errors (commits 90cc976e, 40a5c5a2)
Fixed is_multimodal not found error
Fixed enable_thinking parameter issues

4. Code Cleanup and Optimization

Cleaned up unused imports and code (commit f9add484)
Removed unused imports in entrypoints/openai
Removed unnecessary tests in test/srt/openai
Updated test timings in run_suite.py
Removed V1RerankReqInput duplication between openai.protocol and io_struct
Improved logging (commits c3be8638, a7d3b35c)
Use logger.exception for better error details
Skip embedding in log_requests for concised logs

5. Bug Fixes for Specific Tests

Fixed streaming response format (commit f9add484)
Updated test_openai_server.py to handle separated usage chunk and finish_reason chunk
Removed batch test in test_openai_server.py
Remove enable_thinking in reasoning content handling (commits e583fd9, (b22eb4a...2dc0341))
- The problem is related to the enable_thinking parameter and it was used for check for reasoning content parsing: if reasoning_parser and request.separate_reasoning and enable_thinking:. See adapter.py, serving_chat.py
- Some models like DeepSeek do not support the enable_thinking parameter.
- This parameter is only supported by Qwen3 in its chat_template.
- Our logic was to set this to True unless it is explicitly set through chat_template_kwargs. However, this brings confusion to developers as not all models support enable_thinking.
- Suggestion:
  - The server should not use enable_thinking as a condition for reasoning content parsing since it’s specific to Qwen3’s chat template, not a general OpenAI-compatible parameter.
  - It’s more appropriate to treat it as a sampling parameter, controlled entirely by the chat template.
  - ReasoningParser has force_thinking=True, which causes models like Qwen3 to treat even the initial user message as part of the reasoning, leading to unexpected behavior.
  - For models like DeepSeek-R1, the tag was removed in their templates, so enable_thinking is irrelevant.
  - This PR changed force_thinking=True to False in Qwen3Detector and fixed the logic in BaseReasoningFormatDetector to enable it to handle cases when force_thinking=False.
  - Test: currently, test_reasoning_content is passed.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the OpenAI-compatible API serving infrastructure and template management within the system. The primary goal is to enhance modularity, improve code organization, and centralize template handling, leading to a more maintainable and extensible codebase.

Highlights

API Module Reorganization: The core OpenAI API serving logic has been moved from sglang.srt.openai_api to sglang.srt.entrypoints.openai, establishing a clearer separation of concerns and improving code structure.
Centralized Template Management: A new TemplateManager class is introduced to consolidate the loading and management of both chat and completion templates, replacing scattered logic and global state for improved modularity.
Dedicated API Handlers: The HTTP server now utilizes specialized OpenAIServing classes for different OpenAI API endpoints (e.g., chat, completions, embeddings, rerank, score), promoting a more object-oriented approach to API request handling.
Expanded API Functionality: New OpenAI-compatible endpoints for reranking (/v1/rerank) and scoring (/v1/score) are added, leveraging the new serving class architecture.
Streamlined Protocol Definitions: Minor adjustments are made to the OpenAI protocol definitions, such as making finish_reason and dimensions optional, for increased flexibility and robustness.
Removal of Batch API: Existing batch API endpoints (/v1/files, /v1/batches) and their associated test cases have been removed, simplifying the API surface.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request primarily refactors the OpenAI API integration by moving code from an openai_api directory to entrypoints/openai and introducing a TemplateManager for better organization. It also adds new serving classes for different OpenAI-compatible endpoints like rerank and score.

Overall, the refactoring appears to be a positive step towards better code organization and maintainability. My main concerns revolve around potential API contract changes in protocol.py and the removal of batch file processing functionality, which should be clarified if it's an intended change for this PR.

Key Changes & Observations:

Refactoring: Successfully moved OpenAI related logic to the entrypoints/openai path.
TemplateManager: Introduced TemplateManager to centralize chat and completion template loading and management, which is a good improvement over global state or scattered logic.
New Serving Classes: Added dedicated classes (OpenAIServingChat, OpenAIServingCompletion, OpenAIServingEmbedding, OpenAIServingRerank, OpenAIServingScore) to handle specific OpenAI endpoints. This enhances modularity.
Endpoint Removal: The /v1/files and /v1/batches endpoints, along with their associated functionality and tests, have been removed. This is a significant functional change and needs confirmation if it's within the scope of this refactor.
Protocol Changes:
- finish_reason in CompletionResponseChoice and ChatCompletionResponseChoice is now Optional.
- ChatCompletionRequest's validate_messages_not_empty validator was removed.
- ChatCompletionRequest's set_tool_choice_default validator was simplified.
- A new RerankResponse class was added.
Logging: Improved error logging in serving_base.py by using logger.exception.
Docstrings: Added a more detailed module docstring to python/sglang/srt/conversation.py.

Recommendations:

Clarify if the removal of batch file processing APIs (/v1/files, /v1/batches) is an intended part of this refactoring PR.
Review the API contract changes in protocol.py to ensure they are intentional and to understand any potential impact on clients.
Consider re-adding Pydantic-level validation for non-empty messages in ChatCompletionRequest if it's a strict requirement.

The changes generally improve the codebase structure. Addressing the points above will help ensure the quality and clarity of this refactor.

gemini-code-assist · 2025-06-19T09:21:45Z

python/sglang/srt/entrypoints/openai/protocol.py

@@ -200,7 +200,7 @@ class CompletionResponseChoice(BaseModel):
    index: int
    text: str
    logprobs: Optional[LogProbs] = None
-    finish_reason: Literal["stop", "length", "content_filter", "abort"]
+    finish_reason: Optional[Literal["stop", "length", "content_filter", "abort"]] = None


Making finish_reason optional is an API contract change. Is it now possible for finish_reason to be None in some scenarios? If so, could you briefly explain these scenarios or point to where this change is handled downstream? This also applies to ChatCompletionResponseChoice.finish_reason on line 428.

gemini-code-assist · 2025-06-19T09:21:45Z

python/sglang/srt/entrypoints/openai/protocol.py

    @model_validator(mode="before")
    @classmethod
    def set_tool_choice_default(cls, values):
-        if isinstance(values, dict):
-            if values.get("tool_choice") is None:
-                if values.get("tools") is None:
-                    values["tool_choice"] = "none"
-                else:
-                    values["tool_choice"] = "auto"
+        if values.get("tool_choice") is None:
+            if values.get("tools") is None:
+                values["tool_choice"] = "none"
+            else:
+                values["tool_choice"] = "auto"
        return values


The isinstance(values, dict) check was removed from set_tool_choice_default validator. This makes the validator slightly less robust if ChatCompletionRequest could be instantiated with non-dict-like data that Pydantic can't coerce to a dict before this validator runs. This could lead to an AttributeError on values.get().

gemini-code-assist · 2025-06-19T09:21:45Z

python/sglang/srt/entrypoints/openai/serving_base.py

@@ -51,7 +51,7 @@ async def handle_request(
                )

        except Exception as e:
-            logger.error(f"Error in request: {e}")
+            logger.exception(f"Error in request: {e}")


Using logger.exception here instead of logger.error is a good change as it will automatically include the stack trace in the log, which is very helpful for debugging unexpected errors.

gemini-code-assist · 2025-06-19T09:21:45Z