Paper | Website | HuggingFace
Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release one multimodal LLMs: Cosmos-Reason1-7B post-trained with Physical AI SFT, and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.
- 2025-06-11: We enhance the model’s capability on judging the physical plausibility of a video. See this tutorial for details.
- 2025-05-17: We release model weights and training data under Hugging Face.
NOTE: We suggest using
fps=4
for the input video andmax_tokens=4096
to avoid truncated response.
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
repetition_penalty=1.05,
max_tokens=4096,
)
video_messages = [
{"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
{"role": "user", "content": [
{"type": "text", "text": (
"Is it safe to turn right?"
)
},
{
"type": "video",
"video": "assets/sample.mp4",
"fps": 4,
}
]
},
]
# Here we use video messages as a demonstration
messages = video_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
# FPS will be returned in video_kwargs
"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
Using Cosmos-Reason1 as Video Critic for Rejection Sampling.
The nvidia-cosmos/cosmos-rl repository is an async post-training framework specialized for Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). It prioritizes performance, scalability, and fault tolerance.
For detailed instructions on running SFT and PPO training, please refer to our User Guide.
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
NVIDIA Cosmos source code is released under the Apache 2 License.
NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].