This pipeline provides a two-step process to augment robotic videos using Cosmos-Transfer1-7B. It leverages spatial-temporal control to modify backgrounds while preserving the shape and/or appearance of the robot foreground.
We propose two augmentation settings:
- Foreground Controls:
Edge,Vis - Background Controls:
Segmentation - Weights:
w_edge(FG) = 1w_vis(FG) = 1w_seg(BG) = 1- All other weights = 0
- Foreground Controls:
Edge - Background Controls:
Segmentation - Weights:
w_edge(FG) = 1w_seg(BG) = 1- All other weights = 0
This script extracts foreground (robot) and background information from semantic segmentation data. It processes per-frame segmentation masks and color-to-class mappings to generate spatial-temporal weight matrices for each control modality based on the selected setting.
- A
segmentationfolder containing per-frame segmentation masks in PNG format - A
segmentation_labelfolder containing color-to-class mapping JSON files for each frame, for example:{ "(29, 0, 0, 255)": { "class": "gripper0_right_r_palm_vis" }, "(31, 0, 0, 255)": { "class": "gripper0_right_R_thumb_proximal_base_link_vis" }, "(33, 0, 0, 255)": { "class": "gripper0_right_R_thumb_proximal_link_vis" } } - An input video file
Here is an example input format: Example input directory
PYTHONPATH=$(pwd) python cosmos_transfer1/auxiliary/robot_augmentation/spatial_temporal_weight.py \
--setting setting1 \
--robot-keywords world_robot gripper robot \
--input-dir assets/robot_augmentation_example \
--output-dir outputs/robot_augmentation_example-
--setting: Weight setting to use (choices: 'setting1', 'setting2', default: 'setting1')- setting1: Emphasizes robot in visual and edge features (vis: 1.0 foreground, edge: 1.0 foreground, seg: 1.0 background)
- setting2: Emphasizes robot only in edge features (edge: 1.0 foreground, seg: 1.0 background)
-
--input-dir: Input directory containing example folders- Default: 'assets/robot_augmentation_example'
-
--output-dir: Output directory for weight matrices- Default: 'outputs/robot_augmentation_example'
-
--robot-keywords: Keywords used to identify robot classes- Default: ["world_robot", "gripper", "robot"]
- Any semantic class containing these keywords will be treated as robot foreground
Use the generated spatial-temporal weight matrices to perform video augmentation with the proper controls.
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=1}"
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 \
cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/robot_example_spatial_temporal_setting1 \
--controlnet_specs assets/robot_augmentation_example/example1/inference_cosmos_transfer1_robot_spatiotemporal_weights.json \
--offload_text_encoder_model \
--offload_guardrail_models \
--num_gpus $NUM_GPU- Augmented videos are saved in
outputs/robot_example_spatial_temporal_setting1/
Input video:
input_video.mp4
You can run multiple times with different prompts (e.g., assets/robot_augmentation_example/example1/example1_prompts.json), and you can get different augmentation results: