Xichen Pan Satya Narayan Shukla† Aashu Singh Zhuokai Zhao Shlok Kumar Mishra Jialiang Wang Zhiyang Xu Jiuhai Chen Kunpeng Li Felix Juefei-Xu Ji Hou† Saining Xie†
conda env create -f environment.yml
conda activate metaquery
If you want to train the model on a single node, you can use the following command.
run_name
is the name that appears in the checkpoint path and wandb.config_file
is the path to the yaml file that contains the training configs. You can find the provided configs here. If you want to specify the configs directly in the command line, you can also skip the--config_file
argument.base_dir
is the path to the directory where you wish to save data and checkpoints.
OMP_NUM_THREADS=12 torchrun --nproc-per-node=8 train.py \
--run_name test \
--config_file llavaov0p5_sana.yaml \
--base_dir /path/to/metaquery
Tips: To speed up the data downloading, you can try to run the following command first to download the data in parallel (e.g., 64 threads), then switch to the regular training command above.
OMP_NUM_THREADS=64 torchrun --nproc-per-node=1 train.py \ --run_name test \ --config_file llavaov0p5_sana.yaml \ --base_dir /path/to/metaquery
Note: For text-to-image pretraining, we only provide the code for cc12m since it can be loaded directly with the datasets package. Using this dataset alone cannot guarantee the same performance as reported in the paper. For other datasets, you will need to modify the code to support them. For example, users can try to load BLIP3o dataset for better performance.
If you wish to train the model on multiple nodes, we also provide a sample SLURM script here for reference.
For the edit and instruction tuning training, you may need to also specify the --resume_from_checkpoint
argument to resume from the previous checkpoint.
When you have the checkpoint ready, you can run the following command to start the demo:
python app.py --checkpoint_path /path/to/checkpoint
For evaluation, please follow the instructions here.
In this work, we collect an instruction tuning dataset MetaQuery-Instruct-2.4M. We group images from web corpora based on caption similarity, then construct instruction-tuning data from these image pairs using an MLLM.
We provide the dataset curation code here for reference. The dataset is curated from mmc4.
After tuning on the MetaQuery-Instruct-2.4M dataset, the model achieves impressive zero-shot subject-driven generation performance (the first row) and surprisingly unlocks novel capabilities like visual association and logo design that go beyond copy-pasting (the second row).
With a frozen MLLM and flexible MetaQueries, we can train State-of-the-Art unified multimodal understanding and generation models as easy as fine-tuning a diffusion model.
Methods | Base (M)LLM | MME-P | MMB | SEED | MMMU | MM-Vet | COCO FID ↓ | MJHQ FID ↓ | GenEval ↑ | DPG-Bench ↑ |
---|---|---|---|---|---|---|---|---|---|---|
Show-o-512 | Phi-1.5 1.3B | 1097.2 | - | - | 26.7 | - | 9.24 | 15.18 | 0.68 | - |
Emu3 | From Scratch 7B | - | 58.5 | 68.2 | 31.6 | 37.2 | 12.80 | - | 0.66† | 80.60 |
MetaMorph | LLaMA-3 8B | - | 75.2 | 71.8 | - | - | 11.8 | - | - | - |
Transfusion | From Scratch 7B | - | - | - | - | - | 8.70 | - | 0.63 | - |
LMFusion | LLaVA-Next 8B | 1603.7 | 72.1 | 72.5 | 41.7 | - | 8.20 | - | - | - |
Janus-Pro-1B | DeepSeek-LLM 1.5B | 1444.0 | 75.5 | 68.3 | 36.3 | 39.8 | - | 14.33‡ | 0.73 | 82.63 |
Janus-Pro-7B | DeepSeek-LLM 7B | 1567.1 | 79.2 | 72.1 | 41.0 | 50.0 | - | 13.48‡ | 0.80 | 84.19 |
MetaQuery-B | LLaVA-ov 0.5B | 1238.0 | 58.5 | 66.6 | 31.4 | 29.1 | 8.91 | 6.28 | 0.74† | 80.04 |
MetaQuery-L | Qwen2.5-VL 3B | 1574.3 | 78.6 | 73.8 | 53.1 | 63.2 | 8.87 | 6.35 | 0.78† | 81.10 |
MetaQuery-XL | Qwen2.5-VL 7B | 1685.2 | 83.5 | 76.9 | 58.6 | 66.6 | 8.69 | 6.02 | 0.80† | 82.05 |
† denotes rewritten prompts. ‡ denotes results tested by us under the same settings. We report the COCO FID with frozen Stable Diffusion v1.5, and other metrics with fine-tuned Sana 1.6B. Best results are shown in bold.
The data is licensed CC-by-NC. Third party content pulled from other locations are subject to their own licenses and you may have other legal obligations or restrictions that govern your use of that content.
The MetaQuery dataset is also released under ODC-BY and Common Crawl terms of use, because it is sourced from mmc4.
If you find MetaQuery useful for your research and applications, please cite using this BibTeX:
@article{pan2025transfer,
title={Transfer between modalities with metaqueries},
author={Pan, Xichen and Shukla, Satya Narayan and Singh, Aashu and Zhao, Zhuokai and Mishra, Shlok Kumar and Wang, Jialiang and Xu, Zhiyang and Chen, Jiuhai and Li, Kunpeng and Juefei-Xu, Felix and Hou, Ji and Xie, Saining},
journal={arXiv preprint arXiv:2504.06256},
year={2025}
}