Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)
Official implementation of Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as implicit 2D image priors, which are natively aligned with the MLLM’s CLIP visual encoder.
Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal
conda create -n bifrost1 python==3.11
conda activate bifrost1
pip install -r requirements.txtThe model checkpoint can be download from HuggingFace here.
You can download it to your specified local_dir with code:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="hanlincs/Bifrost-1",
repo_type="model",
local_dir="xxxxxxxx",
local_dir_use_symlinks=False
)
Generate images from GenEval prompts
python inference_geneval_dpgbench.py --eval_geneval --output_dir "./outputs" --local_checkpoint_path XXXXX
🌟 Please let us know in the issues or PRs if there's any questions. If you find our project useful in your research or application development, citing our paper would be the best support for us!
@inproceedings{linbifrost,
title={Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents},
author={Lin, Han and Cho, Jaemin and Zadeh, Amir and Li, Chuan and Bansal, Mohit},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}
The development of Bifrost-1 has been greatly inspired by the following amazing works and teams:
We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.
