Skip to content

[RFC]: xDiT Video Generation API #1595

@happyandslow

Description

@happyandslow

Summary

The goal of this API is to expose a scalable, distributed video generation service for xDiT using models such as CogVideoX, ConsisID, and Latte. The API allows clients to submit text prompts with optional generation parameters and receive video outputs either as a file path or base64-encoded data.

Motivation

The API is designed to enhance AIBrix's ability to generate image and video, based on an example engine xDiT :

  • Support both disk-based and memory-based outputs.
  • Scale across multiple GPUs via Ray remote workers.
  • Provide a simple OpenAI-style endpoint for common video generation.
  • Allow future extensibility for multi-stage pipelines or DAG-based workflows.

Proposed Change

POST /generatevideo

Description: Generates a video from a text prompt.
Content-Type: application/json
Authentication: TBD (future optional API key support)

Request Model

{
  "prompt": "A fox running through a forest",
  "num_inference_steps": 50,
  "num_frames": 17,
  "seed": 42,
  "cfg": 7.5,
  "save_disk_path": "output/",
  "height": 1024,
  "width": 1024,
  "fps": 8
}

curl request example with disk output

curl -X POST http://127.0.0.1:6000/generatevideo \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "A fox running through a forest",
           "num_frames": 17,
           "fps": 12,
           "save_disk_path": "output/"
         }'

curl request with base64 output

curl -X POST http://127.0.0.1:6000/generatevideo \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "A fox running through a forest",
           "num_frames": 17,
           "fps": 12,
           "save_disk_path": "output/"
         }'
Field Type Default Description
prompt string N/A Text prompt describing the video content. Required.
num_inference_steps int 50 Number of denoising steps in the diffusion process.
num_frames int 17 Number of frames in the video.
seed int 42 RNG seed for reproducibility.
cfg float 7.5 Guidance scale for the diffusion model.
save_disk_path string None If provided, video will be saved to path; otherwise, returned as base64.
height int 1024 Video frame height in pixels.
width int 1024 Video frame width in pixels.
fps int 8 Frames per second of output video.

Response

  • Synchronous Response

    1. Response with path provided
    {
      "message": "Video generated successfully",
      "elapsed_time": "12.34 sec",
      "output": "output/generated_video_20250922-141500.mp4",
      "save_to_disk": true
    }
    
    1. Response without path provided
    {
      "message": "Video generated successfully",
      "elapsed_time": "12.34 sec",
      "output": "<base64-encoded-video>",
      "save_to_disk": false,
      "format": "mp4"
    }
    
  • Asynchronous Response

    • TBD

Error Handling

HTTP Status Condition Response
400 Missing or invalid prompt/parameters {"detail": "Prompt cannot be empty"}
500 Model execution error {"detail": "Error generating video: <error_message>"}

Alternatives Considered

ComfyUI style workflow request like:

POST /prompt
Content-Type: application/json

{
  "1": {
    "class_type": "TextPrompt",
    "inputs": {
      "text": "A fox running through a forest"
    }
  },
  "2": {
    "class_type": "CogWidowXVideo",
    "inputs": {
      "prompt": ["1", 0],
      "duration": 5,
      "fps": 12,
      "width": 720,
      "height": 480,
      "seed": 123
    }
  }
}

This forces ComfyUI to be deployed along with AIBrix.

Deploying along with ComfyUI
ComfyUI could be used as a frontend client that connects to third-party/service API, see link1 and link2.

Currently, ComfyUI runs/manages model to run in a single instance and model management could be found here. To make it support distributed environment, we either need to update ComfyUI so it supports distributed serving (through library like xfuser) or we let ComfyUI to call backend services through API (API node, see below) and the backend services could deploying large model distributedly, and/or have multiple replicas serving requests. I found one project that attempted to do distributed ComfyUI here.

In order to adapt to ComfyUI, we could first enable this API as a service and forward ComfyUI's request through customized external API using ComfyUI's API node. It would look somewhat like the following:

Image

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions