[RFC]: xDiT Video Generation API

### Summary

The goal of this API is to expose a scalable, distributed video generation service for xDiT using models such as CogVideoX, ConsisID, and Latte. The API allows clients to submit text prompts with optional generation parameters and receive video outputs either as a file path or base64-encoded data.



### Motivation

The API is designed to enhance AIBrix's ability to generate image and video, based on an example engine [xDiT](https://github.com/xdit-project/xDiT/tree/main) :

- Support both disk-based and memory-based outputs.
- Scale across multiple GPUs via Ray remote workers.
- Provide a simple OpenAI-style endpoint for common video generation.
- Allow future extensibility for multi-stage pipelines or DAG-based workflows.

### Proposed Change

`POST /generatevideo`

Description: Generates a video from a text prompt.
Content-Type: application/json
Authentication: TBD (future optional API key support)

**Request Model**

```
{
  "prompt": "A fox running through a forest",
  "num_inference_steps": 50,
  "num_frames": 17,
  "seed": 42,
  "cfg": 7.5,
  "save_disk_path": "output/",
  "height": 1024,
  "width": 1024,
  "fps": 8
}
```

`curl` request example with disk output 

```
curl -X POST http://127.0.0.1:6000/generatevideo \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "A fox running through a forest",
           "num_frames": 17,
           "fps": 12,
           "save_disk_path": "output/"
         }'
```

`curl` request with base64 output
```
curl -X POST http://127.0.0.1:6000/generatevideo \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "A fox running through a forest",
           "num_frames": 17,
           "fps": 12,
           "save_disk_path": "output/"
         }'
```


Field | Type | Default | Description
-- | -- | -- | --
prompt | string | N/A | Text prompt describing the video content. Required.
num_inference_steps | int | 50 | Number of denoising steps in the diffusion process.
num_frames | int | 17 | Number of frames in the video.
seed | int | 42 | RNG seed for reproducibility.
cfg | float | 7.5 | Guidance scale for the diffusion model.
save_disk_path | string | None | If provided, video will be saved to path; otherwise, returned as base64.
height | int | 1024 | Video frame height in pixels.
width | int | 1024 | Video frame width in pixels.
fps | int | 8 | Frames per second of output video.


**Response**

- Synchronous Response
  1. Response with path provided
  ```
  {
    "message": "Video generated successfully",
    "elapsed_time": "12.34 sec",
    "output": "output/generated_video_20250922-141500.mp4",
    "save_to_disk": true
  }
  ```
  2. Response without path provided
  ```
  {
    "message": "Video generated successfully",
    "elapsed_time": "12.34 sec",
    "output": "<base64-encoded-video>",
    "save_to_disk": false,
    "format": "mp4"
  }
  ```
  
- Asynchronous Response 
  
  - TBD

**Error Handling**

HTTP Status | Condition | Response
-- | -- | --
400 | Missing or invalid prompt/parameters | {"detail": "Prompt cannot be empty"}
500 | Model execution error | {"detail": "Error generating video: <error_message>"}




### Alternatives Considered
ComfyUI style workflow request like:
```
POST /prompt
Content-Type: application/json

{
  "1": {
    "class_type": "TextPrompt",
    "inputs": {
      "text": "A fox running through a forest"
    }
  },
  "2": {
    "class_type": "CogWidowXVideo",
    "inputs": {
      "prompt": ["1", 0],
      "duration": 5,
      "fps": 12,
      "width": 720,
      "height": 480,
      "seed": 123
    }
  }
}

```
This forces ComfyUI to be deployed along with AIBrix. 

**Deploying along with ComfyUI** 
ComfyUI could be used as a frontend client that connects to third-party/service API, see [link1](https://github.com/comfyanonymous/ComfyUI/tree/707b2638ecd82360c0a67e1d86cc4fdeae218d03/comfy_api_nodes) and [link2](https://docs.comfy.org/tutorials/api-nodes/overview). 

Currently, ComfyUI runs/manages model to run in a single instance and model management could be found [here](https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/model_management.py). To make it support distributed environment, we either need to update ComfyUI so it supports distributed serving (through library like xfuser) or we let ComfyUI to call backend services through API (API node, see below) and the backend services could deploying large model distributedly, and/or have multiple replicas serving requests. I found one project that attempted to do distributed ComfyUI [here](https://github.com/robertvoy/ComfyUI-Distributed). 

In order to adapt to ComfyUI, we could first enable this API as a service and forward ComfyUI's request through customized external API using ComfyUI's API node. It would look somewhat like the following:

<img width="512" height="761" alt="Image" src="https://github.com/user-attachments/assets/ae9cb4d1-ef5f-4c75-81b2-60fe3ca24b97" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: xDiT Video Generation API #1595

Summary

Motivation

Proposed Change

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Type	Default	Description
prompt	string	N/A	Text prompt describing the video content. Required.
num_inference_steps	int	50	Number of denoising steps in the diffusion process.
num_frames	int	17	Number of frames in the video.
seed	int	42	RNG seed for reproducibility.
cfg	float	7.5	Guidance scale for the diffusion model.
save_disk_path	string	None	If provided, video will be saved to path; otherwise, returned as base64.
height	int	1024	Video frame height in pixels.
width	int	1024	Video frame width in pixels.
fps	int	8	Frames per second of output video.

HTTP Status	Condition	Response
400	Missing or invalid prompt/parameters	{"detail": "Prompt cannot be empty"}
500	Model execution error	{"detail": "Error generating video: <error_message>"}

[RFC]: xDiT Video Generation API #1595

Description

Summary

Motivation

Proposed Change

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions