In this document, we provide examples and steps to:
- Build your own Cosmos-Transfer1 models, training from scratch; or
- Post-train Cosmos-Transfer1 models from our checkpoint using your data.
The model is trained separately for each control input type.
We support the following Cosmos-Transfer1 models for pre-training and post-training. Review the available models and their compute requirements for training to determine the best model for your use case. We use Tensor Parallel of size 8 for training.
| Model Name | Model Status | Compute Requirements for Post-Training |
|---|---|---|
| Cosmos-Transfer1-7B [Depth] | Supported | 8 NVIDIA GPUs* |
| Cosmos-Transfer1-7B [Edge] | Supported | 8 NVIDIA GPUs* |
| Cosmos-Transfer1-7B [Keypoint] | Supported | 8 NVIDIA GPUs* |
| Cosmos-Transfer1-7B [Segmentation] | Supported | 8 NVIDIA GPUs* |
| Cosmos-Transfer1-7B [Vis] | Supported | 8 NVIDIA GPUs* |
* 80GB GPU memory required for training. H100-80GB or A100-80GB GPUs are recommended.
Please refer to the training section of INSTALL.md for instructions on environment setup.
-
Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').
-
Log in to Hugging Face with the access token:
huggingface-cli login-
Accept the LlamaGuard-7b terms
-
Download the Cosmos model weights from Hugging Face. Note that this will require about 300GB of free storage.
PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/- The downloaded files should be in the following structure.
checkpoints/
├── nvidia
│ ├── Cosmos-Transfer1-7B
│ │ ├── base_model.pt
│ │ ├── vis_control.pt
│ │ ├── edge_control.pt
│ │ ├── edge_control_distilled.pt
│ │ ├── seg_control.pt
│ │ ├── depth_control.pt
│ │ ├── keypoint_control.pt
│ │ ├── 4kupscaler_control.pt
│ │ ├── config.json
│ │ └── guardrail
│ │ ├── aegis/
│ │ ├── blocklist/
│ │ ├── face_blur_filter/
│ │ └── video_content_safety_filter/
│ │
│ ├── Cosmos-Transfer1-7B-Sample-AV/
│ │ ├── base_model.pt
│ │ ├── hdmap_control.pt
│ │ └── lidar_control.pt
│ │
│ │── Cosmos-Tokenize1-CV8x8x8-720p
│ │ ├── decoder.jit
│ │ ├── encoder.jit
│ │ ├── autoencoder.jit
│ │ └── mean_std.pt
│ │
│ └── Cosmos-UpsamplePrompt1-12B-Transfer
│ ├── depth
│ │ ├── consolidated.safetensors
│ │ ├── params.json
│ │ └── tekken.json
│ ├── README.md
│ ├── segmentation
│ │ ├── consolidated.safetensors
│ │ ├── params.json
│ │ └── tekken.json
│ ├── seg_upsampler_example.png
│ └── viscontrol
│ ├── consolidated.safetensors
│ ├── params.json
│ └── tekken.json
│
├── depth-anything/...
├── facebook/...
├── google-t5/...
└── IDEA-Research/
Checkpoint Requirements:
- Base model (
base_model.pt) and tokenizer models (underCosmos-Tokenize1-CV8x8x8-720p): Required for all training. - Control modality-specific model checkpoint (e.g.,
seg_control.pt): Only needed for post-training that specific control. Not needed if training from scratch. - Other folders such as
depth-anything,facebook/sam2-hiera-largeetc.: optional. These are helper modules to process the video data into the respective control modalities such as depth and segmentation.
There are 3 steps to train a Cosmos-Transfer1 model: preparing a dataset, prepare checkpoints, and launch training.
In the example below, we use a subset of HD-VILA-100M dataset to demonstrate the steps for preparing the data and launching training. After preprocessing, your dataset directory should be structured as follows:
datasets/hdvila/
├── metas/
│ ├── *.json
│ ├── *.txt
├── videos/
│ ├── *.mp4
├── t5_xxl/
│ ├── *.pickle
├── keypoint/
│ ├── *.pickle
├── depth/
│ ├── *.mp4
├── seg/
│ ├── *.pickle
└── <your control input modality>/
├── <your files>
File naming must be consistent across modalities. For example, to train a SegControl model with a video named videos/example1.mp4, the corresponding annotation files should be: seg/example1.pickle.
Note: Only the folder corresponding to your chosen control input modality is required. For example, if you're training with depth as the control input, only the depth/ subfolder is needed.
The first step is to prepare a dataset with videos and captions. You must provide a folder containing a collection of videos in MP4 format, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject.
Here we use a subset of sample videos from HD-VILA-100M as an example:
# Download metadata with video urls and captions
mkdir -p datasets/hdvila
cd datasets/hdvila
wget https://huggingface.co/datasets/TempoFunk/hdvila-100M/resolve/main/hdvila-100M.jsonlRun the following command to download the sample videos used for training:
# Requirements for Youtube video downloads & video clipping
pip install pytubefix ffmpeg# The script will downlaod the original HD-VILA-100M videos, save the corresponding clips, the captions and the metadata.
PYTHONPATH=$(pwd) python scripts/download_diffusion_example_data.py --dataset_path datasets/hdvila --N_videos 128 --do_download --do_clipRun the following command to pre-compute T5-XXL embeddings for the video captions used for training:
# The script will read the captions, save the T5-XXL embeddings in pickle format.
PYTHONPATH=$(pwd) python scripts/get_t5_embeddings.py --dataset_path datasets/hdvilaNext, we generate the control input data corresponding to each video. If you already have accurate control input data (e.g., ground truth depth, segmentation masks, or human keypoints), you can skip this step -- just ensure your files are organized in the above structure, and follow the data format as detailed in Process Control Input Data.
Here, as an example, we show show how to obtain the control input signals from the input RGB videos. Specifically:
-
DepthControl requires a depth video that is frame-wise aligned with the corresponding RGB video. This can be obtained by, for example, running DepthAnythingV2 on the input videos.
-
SegControl requires a
.picklefile in the SAM2 output format containing per-frame segmentation masks. See Process Control Input Data for detailed format requirements. -
KeypointControl requires a
.picklefile containing 2D human keypoint annotations for each frame. See Process Control Input Data for detailed format requirements.
For VisControl and EdgeControl models: training is self-supervised. These models get control inputs (e.g., by applying blur or extracting Canny edges) from the input videos on-the-fly during training. Therefore, you do not need to prepare control input data separately for these modalities.
Due to the large model size, we leverage TensorParallel (TP) to split the model weights across multiple GPUs. We use 8 for the TP size.
# Will split the Base model checkpoint into 8 TP checkpoints
PYTHONPATH=. python scripts/convert_ckpt_fsdp_to_tp.py checkpoints/nvidia/Cosmos-Transfer1-7B/base_model.pt
# Example: for EdgeControl checkpoint splitting for post-train.
PYTHONPATH=. python scripts/convert_ckpt_fsdp_to_tp.py checkpoints/nvidia/Cosmos-Transfer1-7B/edge_control.ptThis will generate the TP checkpoints under checkpoints/checkpoints_tp/*_mp_*.pt, which we load in the training below.
As a sanity check, run the following command to dry-run an example training job with the above data. The command will generated a full configuration of the experiment.
export OUTPUT_ROOT=checkpoints # default value
# Training from scratch
torchrun --nproc_per_node=1 -m cosmos_transfer1.diffusion.training.train --dryrun --config=cosmos_transfer1/diffusion/config/config_train.py -- experiment=CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretrain
# Post-train from our provided checkpoint (need to first split checkpoint into TP checkpoints as instructed above)
torchrun --nproc_per_node=1 -m cosmos_transfer1.diffusion.training.train --dryrun --config=cosmos_transfer1/diffusion/config/config_train.py -- experiment=CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_posttrainExplanation of the command:
-
The trainer and the passed (master) config script will, in the background, load the detailed experiment configurations defined in
cosmos_transfer1/diffusion/config/training/experiment/ctrl_7b_tp_121frames.py, and register the experiments configurations for allhint_keys(control modalities), covering both pretrain and post-train. We use Hydra for advanced configuration composition and overriding. -
The
CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretraincorresponds to an experiment name registered inctrl_7b_tp_121frames.py. By specifiying this name, all the detailed config will be generated and then written tocheckpoints/cosmos_transfer1_pretrain/CTRL_7Bv1_lvg/CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretrain/config.yaml. -
To customize your training, see
cosmos_transfer1/diffusion/config/training/experiment/ctrl_7b_tp_121frames.pyto understand how the detailed configs of the model, trainer, dataloader etc. are defined, and edit as needed.
Now we can start a real training job! Removing the --dryrun and set --nproc_per_node=8 will start a real training job on 8 GPUs:
torchrun --nproc_per_node=8 -m cosmos_transfer1.diffusion.training.train --config=cosmos_transfer1/diffusion/config/config_train.py -- experiment=CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretrainConfig group and override. An experiment determines a complete group of configuration parameters (model architecture, data, trainer behavior, checkpointing, etc.). Changing the experiment value in the command above will decide which ControlNet model is trained, and whether it's pretrain or post-train. For example, replacing the experiment name in the command with CTRL_7Bv1pt3_lvg_tp_121frames_control_input_depth_block3_posttrain will post-train the DepthControl model from the downloaded checkpoint instead.
To customize your training, see the job (experiment) config in cosmos_transfer1/diffusion/config/training/experiment/ctrl_7b_tp_121frames.py to understand how they are defined, and edit as needed.
It is also possible to modify config parameters from the command line. For example:
torchrun --nproc_per_node=8 -m cosmos_transfer1.diffusion.training.train --config=cosmos_transfer1/diffusion/config/config_train.py -- experiment=CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretrain trainer.max_iter=100 checkpoint.save_iter=40This will update the maximum training iterations to 100 (default in the registered experiments: 999999999) and checkpoint saving frequency to 40 (default: 1000).
Saving Checkpoints and Resuming Training. During training, the checkpoints will be saved in the structure below. Since we use TensorParallel across 8 GPUs, 8 checkpoints will be saved each time.
checkpoints/cosmos_transfer1_pretrain/CTRL_7Bv1_lvg/CTRL_7Bv1pt3_lvg_tp_121frames_control_input_edge_block3_pretrain/checkpoints/
├── iter_{NUMBER}.pt # "master" checkpoint, saving metadata only
├── iter_{NUMBER}_model_mp_0.pt # real TP checkpoints
├── iter_{NUMBER}_model_mp_1.pt
├── ...
├── iter_{NUMBER}_model_mp_7.pt
Since the experiment is uniquely associated with its checkpoint directory, rerunning the same training command after an unexpected interruption will automatically resume from the latest saved checkpoint.
Converting the TP checkpoints to FSDP checkpoint: To convert Tensor Parallel (TP) checkpoints to Fully Sharded Data Parallel (FSDP) format, use the conversion script convert_ckpt_tp_to_fsdp.py. This script requires the same number of GPUs as your TP size (e.g., if you trained with TP_SIZE=8, you need 8 GPUs for conversion).
Example usage:
torchrun --nproc_per_node=8 convert_ckpt_tp_to_fsdp.py \
--experiment CTRL_7Bv1pt3_lvg_tp_121frames_control_input_seg_block3_posttrain \
--checkpoint-path checkpoints/cosmos_transfer1_posttrain/CTRL_7Bv1_lvg/CTRL_7Bv1pt3_lvg_tp_121frames_control_input_seg_block3_posttrain/checkpoints/iter_000000100.ptOptional arguments:
--output-directory: Custom directory for saving FSDP checkpoints (default: automatically generated from checkpoint path)--include-base-model: Include base model in ControlNet checkpoint (default: False)
The script will create two files in the output directory:
*_reg_model.pt: Regular model checkpoint*_ema_model.pt: EMA model checkpoint
The EMA model checkpoint (*_ema_model.pt) typically presents better quality results and is recommended for running inference in the next stage. For more details about the conversion process and available options, refer to the script's docstring.
Run inference: Follow the steps in the inference README.
Q1: What if I want to use my own control input type? How should I modify the code?
A1: Modify the following scripts:
- Add new condition in:
cosmos_transfer1/diffusion/conditioner.pycosmos_transfer1/diffusion/config/transfer/conditioner.py
- Add data augmentor function in
cosmos_transfer1/diffusion/datasets/augmentors/control_input.py - Add new hint key in:
cosmos_transfer1/diffusion/inference/inference_utils.pycosmos_transfer1/diffusion/inference/world_generation_pipeline.py
- If needed, add preprocessor in
cosmos_transfer1/auxiliary/and updatecosmos_transfer1/diffusion/inference/preprocessors.py.