Skip to content

feat(oai refactor): Replace openai_api with entrypoints/openai #7351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jun 21, 2025

Conversation

CatherineSue
Copy link
Collaborator

@CatherineSue CatherineSue commented Jun 19, 2025

Motivation

This PR aims to refactor and modernize the OpenAI API implementation in SGLang by removing the legacy openai_api module and consolidating it into the entrypoints/openai structure.

Modifications

1. Remove Legacy OpenAI API Module

  • Removed sglang/srt/openai_api directory (commit 5a7df078)
  • Replaced all imports with sglang/srt/entrypoints/openai
  • Removed batch and file endpoints from http_server.py
  • Updated OpenAI-compatible endpoints in http_server.py
  • Removed openai/api_server.py and conftest.py (commit d6d7351c)
  • Consolidated functionality into the new structure

2. Introduce Centralized Template Management (commit 7d343747)

  • Created TemplateManager class to eliminate global template state
  • Replaced global chat_template_name/completion_template_name variables
  • Integrated TemplateManager into _GlobalState with proper dependency injection
  • Added proper type hints for TokenizerManager and TemplateManager parameters
  • Extracted and reorganized Jinja template utilities
    • Created jinja_template_utils.py with template detection and processing logic (moved from utils.py)
  • Moved utilities from openai/utils.py to avoid improper dependencies
  • Renamed detect_template_content_formatdetect_jinja_template_content_format
  • Optimized template content format detection
  • Detect format during template loading instead of on every request
  • Cache detected format in TemplateManager.jinja_template_content_format

3. Error Fixes and Improvements

  • Fixed validation and error handling (commit b0c940cb)
  • Added validation_exception_handler to http_server.py. This will enforce Content-Type: application/json in the request header, which is an OpenAI standard. It also enable FastAPI to automatically decode payload, bypassing the need to handle it manually in each endpoint.
  • Fixed import and parameter errors (commits 90cc976e, 40a5c5a2)
  • Fixed is_multimodal not found error
  • Fixed enable_thinking parameter issues

4. Code Cleanup and Optimization

  • Cleaned up unused imports and code (commit f9add484)
  • Removed unused imports in entrypoints/openai
  • Removed unnecessary tests in test/srt/openai
  • Updated test timings in run_suite.py
  • Removed V1RerankReqInput duplication between openai.protocol and io_struct
  • Improved logging (commits c3be8638, a7d3b35c)
  • Use logger.exception for better error details
  • Skip embedding in log_requests for concised logs

5. Bug Fixes for Specific Tests

  • Fixed streaming response format (commit f9add484)
  • Updated test_openai_server.py to handle separated usage chunk and finish_reason chunk
  • Removed batch test in test_openai_server.py
  • Remove enable_thinking in reasoning content handling (commits e583fd9, (b22eb4a...2dc0341))
    • The problem is related to the enable_thinking parameter and it was used for check for reasoning content parsing: if reasoning_parser and request.separate_reasoning and enable_thinking:. See adapter.py, serving_chat.py
    • Some models like DeepSeek do not support the enable_thinking parameter.
    • This parameter is only supported by Qwen3 in its chat_template.
    • Our logic was to set this to True unless it is explicitly set through chat_template_kwargs. However, this brings confusion to developers as not all models support enable_thinking.
    • Suggestion:
      • The server should not use enable_thinking as a condition for reasoning content parsing since it’s specific to Qwen3’s chat template, not a general OpenAI-compatible parameter.
      • It’s more appropriate to treat it as a sampling parameter, controlled entirely by the chat template.
      • ReasoningParser has force_thinking=True, which causes models like Qwen3 to treat even the initial user message as part of the reasoning, leading to unexpected behavior.
      • For models like DeepSeek-R1, the tag was removed in their templates, so enable_thinking is irrelevant.
      • This PR changed force_thinking=True to False in Qwen3Detector and fixed the logic in BaseReasoningFormatDetector to enable it to handle cases when force_thinking=False.
      • Test: currently, test_reasoning_content is passed.

Checklist

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the OpenAI-compatible API serving infrastructure and template management within the system. The primary goal is to enhance modularity, improve code organization, and centralize template handling, leading to a more maintainable and extensible codebase.

Highlights

  • API Module Reorganization: The core OpenAI API serving logic has been moved from sglang.srt.openai_api to sglang.srt.entrypoints.openai, establishing a clearer separation of concerns and improving code structure.
  • Centralized Template Management: A new TemplateManager class is introduced to consolidate the loading and management of both chat and completion templates, replacing scattered logic and global state for improved modularity.
  • Dedicated API Handlers: The HTTP server now utilizes specialized OpenAIServing classes for different OpenAI API endpoints (e.g., chat, completions, embeddings, rerank, score), promoting a more object-oriented approach to API request handling.
  • Expanded API Functionality: New OpenAI-compatible endpoints for reranking (/v1/rerank) and scoring (/v1/score) are added, leveraging the new serving class architecture.
  • Streamlined Protocol Definitions: Minor adjustments are made to the OpenAI protocol definitions, such as making finish_reason and dimensions optional, for increased flexibility and robustness.
  • Removal of Batch API: Existing batch API endpoints (/v1/files, /v1/batches) and their associated test cases have been removed, simplifying the API surface.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily refactors the OpenAI API integration by moving code from an openai_api directory to entrypoints/openai and introducing a TemplateManager for better organization. It also adds new serving classes for different OpenAI-compatible endpoints like rerank and score.

Overall, the refactoring appears to be a positive step towards better code organization and maintainability. My main concerns revolve around potential API contract changes in protocol.py and the removal of batch file processing functionality, which should be clarified if it's an intended change for this PR.

Key Changes & Observations:

  • Refactoring: Successfully moved OpenAI related logic to the entrypoints/openai path.
  • TemplateManager: Introduced TemplateManager to centralize chat and completion template loading and management, which is a good improvement over global state or scattered logic.
  • New Serving Classes: Added dedicated classes (OpenAIServingChat, OpenAIServingCompletion, OpenAIServingEmbedding, OpenAIServingRerank, OpenAIServingScore) to handle specific OpenAI endpoints. This enhances modularity.
  • Endpoint Removal: The /v1/files and /v1/batches endpoints, along with their associated functionality and tests, have been removed. This is a significant functional change and needs confirmation if it's within the scope of this refactor.
  • Protocol Changes:
    • finish_reason in CompletionResponseChoice and ChatCompletionResponseChoice is now Optional.
    • ChatCompletionRequest's validate_messages_not_empty validator was removed.
    • ChatCompletionRequest's set_tool_choice_default validator was simplified.
    • A new RerankResponse class was added.
  • Logging: Improved error logging in serving_base.py by using logger.exception.
  • Docstrings: Added a more detailed module docstring to python/sglang/srt/conversation.py.

Recommendations:

  • Clarify if the removal of batch file processing APIs (/v1/files, /v1/batches) is an intended part of this refactoring PR.
  • Review the API contract changes in protocol.py to ensure they are intentional and to understand any potential impact on clients.
  • Consider re-adding Pydantic-level validation for non-empty messages in ChatCompletionRequest if it's a strict requirement.

The changes generally improve the codebase structure. Addressing the points above will help ensure the quality and clarity of this refactor.

@@ -200,7 +200,7 @@ class CompletionResponseChoice(BaseModel):
index: int
text: str
logprobs: Optional[LogProbs] = None
finish_reason: Literal["stop", "length", "content_filter", "abort"]
finish_reason: Optional[Literal["stop", "length", "content_filter", "abort"]] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Making finish_reason optional is an API contract change. Is it now possible for finish_reason to be None in some scenarios? If so, could you briefly explain these scenarios or point to where this change is handled downstream? This also applies to ChatCompletionResponseChoice.finish_reason on line 428.

Comment on lines 380 to 412
@model_validator(mode="before")
@classmethod
def set_tool_choice_default(cls, values):
if isinstance(values, dict):
if values.get("tool_choice") is None:
if values.get("tools") is None:
values["tool_choice"] = "none"
else:
values["tool_choice"] = "auto"
if values.get("tool_choice") is None:
if values.get("tools") is None:
values["tool_choice"] = "none"
else:
values["tool_choice"] = "auto"
return values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The isinstance(values, dict) check was removed from set_tool_choice_default validator. This makes the validator slightly less robust if ChatCompletionRequest could be instantiated with non-dict-like data that Pydantic can't coerce to a dict before this validator runs. This could lead to an AttributeError on values.get().

@@ -51,7 +51,7 @@ async def handle_request(
)

except Exception as e:
logger.error(f"Error in request: {e}")
logger.exception(f"Error in request: {e}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using logger.exception here instead of logger.error is a good change as it will automatically include the stack trace in the log, which is very helpful for debugging unexpected errors.

Comment on lines 225 to 152
if self.template_manager.chat_template_name is None:
prompt, prompt_ids, image_data, audio_data, modalities, stop = (
self._apply_jinja_template(request, tools, is_multimodal)
)
else:
prompt, image_data, audio_data, modalities, stop = (
prompt, prompt_ids, image_data, audio_data, modalities, stop = (
self._apply_conversation_template(request)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for choosing between Jinja template and conversation template application has been updated to use self.template_manager.chat_template_name. This is a good centralization of template-related state and logic into the new TemplateManager.

)

template_content_format = self._cached_template_format
template_content_format = self.template_manager.jinja_template_content_format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using self.template_manager.jinja_template_content_format instead of detecting and caching the format within this class is a good simplification and aligns with the TemplateManager's role.

Comment on lines +331 to 255
stop = request.stop
image_data = image_data if image_data else None
audio_data = audio_data if audio_data else None
modalities = modalities if modalities else []
return prompt, prompt_ids, image_data, audio_data, modalities, stop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assignment stop = request.stop is a bit more direct than request.stop or []. If request.stop can be None, this change is fine as None is often a valid value for stop sequences (meaning use default or no specific stop sequences beyond EOS). The conditional initialization of image_data, audio_data, and modalities is also clearer.

Comment on lines +381 to +306
if not is_multimodal:
prompt_ids = self.tokenizer_manager.tokenizer.encode(prompt)

return prompt, prompt_ids, image_data, audio_data, modalities, stop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the tokenization of prompt (if not multimodal) into _apply_conversation_template from its previous location in _process_messages makes sense, as _apply_conversation_template is now responsible for returning prompt_ids as well.

Comment on lines 686 to 650
if request.chat_template_kwargs is not None:
enable_thinking = request.chat_template_kwargs.get(
"enable_thinking", True
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking if request.chat_template_kwargs is not None: before accessing request.chat_template_kwargs.get(...) is a good defensive check. The previous getattr(request, "chat_template_kwargs", {}).get(...) was also safe, but this is arguably more explicit and readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue is that the default is None, it wil override {} and raise error on None.get()

Comment on lines +1 to 5
"""Template utilities for Jinja template processing.

This module provides utilities for analyzing and processing Jinja chat templates,
including content format detection and message processing.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file rename from openai_api/utils.py to jinja_template_utils.py and the updated module docstring make the purpose of this module clearer, focusing specifically on Jinja template utilities.

CatherineSue pushed a commit to woodx9/sglang that referenced this pull request Jun 20, 2025
@CatherineSue CatherineSue force-pushed the chang/remove-adapter branch 2 times, most recently from fa46bfc to 019e5e1 Compare June 20, 2025 23:28
@CatherineSue CatherineSue marked this pull request as ready for review June 21, 2025 01:47
@slin1237 slin1237 self-requested a review June 21, 2025 01:51
CatherineSue and others added 11 commits June 21, 2025 02:20
- Replace all imports with sglang/srt/entrypoints/openai
- Remove batch and file endpoints in http_server.py
- Update openai-compatible endpoints in http_server.py
- Add TODO find a better way for the chat template management
- logger.error won't print detailed exec details
- prompt and prompt_ids are not created before referenced
- modalities data should be None in `_apply_jinja_template` otherwise there will be an error in GenerateReqInput.normalize
…utilities

* Create centralized TemplateManager class to eliminate global template state
  - Replace global chat_template_name/completion_template_name variables
  - Integrate TemplateManager into _GlobalState and inject into serving classes
  - Add proper type hints for TokenizerManager and TemplateManager parameters

* Extract and reorganize Jinja template utilities for better separation of concerns
  - Create jinja_template_utils.py with template detection and processing logic
  - Move utilities from openai/utils.py to avoid improper dependencies
  - Rename detect_template_content_format -> detect_jinja_template_content_format

* Optimize template content format detection for better performance
  - Detect format during template loading instead of on every request
  - Cache detected format in TemplateManager.jinja_template_content_format property
  - Add logging for template format detection results

* Clean up codebase and improve maintainability
  - Remove unused imports and clean up import organization
  - Simplify TemplateManager interface by removing unused is_initialized property
  - Update all serving classes (chat, completions, embedding) to use dependency injection
  - Improve code organization and eliminate architectural debt

Benefits:
- ✅ Eliminates global state pollution
- ✅ Better separation of concerns (generic vs OpenAI-specific utilities)
- ✅ Improved performance through caching
- ✅ Cleaner dependency injection pattern
- ✅ More testable and maintainable architecture
CatherineSue and others added 4 commits June 21, 2025 05:09
- The handling for a single string in a list can be removed as #7396 is merged.
- Add UT cases in test_openai_server for such case
…ontent

- Remove enable_thinking in the check condition when handling reasoning content:

enable_thinking is a flag that only supported by Qwen3, in its chat_template. We pass by this parameter in `self.tokenizer_manager.tokenizer.apply_chat_template`.

When handling reasoning content, as other models don't support enable_thinking, this flag should be removed from the check condition.

- Add back `_get_enable_thinking_from_request` as a util function as some reasoning related parser or backend may need it in the future.
@woodx9
Copy link
Contributor

woodx9 commented Jun 21, 2025

I tested endpoint embedding rerank and score, all seems good.

- Correct name should be `detect_jinja_template_content_format`
Qwen3 should not assume it is always in reasoning mode. As Qwen3 supports a parameter called `enable_thinking=False` in its chat_template. In this case, it won't generate thinking content.
…ing is False

- Introduce a self._in_reasoning to handle case when force_reasoning is False
We don't need to separate _in_reasoning and _force_in_reasoning
- If there is already a tool, the next token can be simply a tool_call_separator, before following with a `{`
@CatherineSue CatherineSue changed the title feat(oai refactor): Remove openai_api with entrypoints/openai feat(oai refactor): Replace openai_api with entrypoints/openai Jun 21, 2025
- Add `retrieve_model` in `http_server.py`. It will retreive a model's information. And returns 404 if model is not in the served model list.
- Add UT
@CatherineSue
Copy link
Collaborator Author

unittest-test-backend-4-gpu

Local test:
test_local_attn
image (1)

test_pp_single_node
image (2)

@zhyncs zhyncs merged commit 72676cd into main Jun 21, 2025
67 of 91 checks passed
@zhyncs zhyncs deleted the chang/remove-adapter branch June 21, 2025 20:21
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
yilian49 pushed a commit to yilian49/sglang that referenced this pull request Jun 24, 2025
whybeyoung added a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
@taegeonum
Copy link

@CatherineSue Hello, may I ask why files/batch apis have been removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants