Xiaofeng Han · Shunpeng Chen · Zenghuang Fu · Zhe Feng · Lue Fan · Dong An · Zhangwei Wang · Li Guo · Weiliang Meng* · Xiaopeng Zhang · Rongtao Xu* · Shibiao Xu
This repository tracks research on multimodal fusion and vision–language models (VLMs) for robot vision, covering semantic scene understanding, 3D perception, SLAM, navigation & localization, and manipulation. We also summarize datasets, metrics, challenges (e.g., cross-modal alignment, efficient fusion, real-time deployment), and future directions.
Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.
[2025-09-03] Camera-ready accepted by Information Fusion (Vol. 126, 2026). DOI: 10.1016/j.inffus.2025.103652.
[2025-09-01] Featured by Embodied Intelligence Hub — article recap of our survey and repo. Read the post »
The overview figure illustrates the overall framework of multimodal fusion and VLMs for robot vision:
| Datasets | Scene | Multimodal Data | Venue | Year |
|---|---|---|---|---|
| 360+x | Indoor/Outdoor | Video/Audio | CVPR | 2024 |
| ScanQA | Indoor | RGB/Text | CVPR | 2022 |
| Hypersim | Indoor | RGB/Depth | ICCV | 2021 |
| NuScenes | Urban street | RGB/Lidar/Radar | CVPR | 2020 |
| Waymo | Outdoor | RGB/Lidar | CVPR | 2020 |
| Semantickitti | Urban street | RGB/Lidar | ICCV | 2019 |
| Matterport3D | Indoor | RGB/Depth | arxiv | 2017 |
| ScanNet | Indoor | RGB/Depth | CVPR | 2017 |
| Cityscapes | Urban street | RGB/Depth | CVPR | 2016 |
| NYUDv2 | Indoor | RGB/Depth | ECCV | 2012 |
| Datasets | Core Modalities | Data Scale | Main Application |
|---|---|---|---|
| DROID | RGB, Depth, Text | 76,000 trajectories | Multi-task scene adaptation |
| R2SGrasp | RGB-D, Point Cloud | 64,000 RGB-D images | Grasp detection |
| RT-1 | RGB, Text | 130,000 trajectories | Real-time task control |
| Touch and Go | RGB, Tactile | 3,971 virtual object models, 13,900 tactile interactions | Cross-modal perception |
| VisGel | GelSight Tactile, RGB | 12,000 tactile interactions | Tactile-enhanced manipulation |
| ObjectFolder 2.0 | RGB, Audio, Tactile | 1,000 virtual object models | Virtual-to-reality transfer |
| Grasp-Anything-6D | Point Cloud, Text | 1M point cloud scenes | Language-driven grasping |
| Grasp-Anything++ | Point Cloud, Text | 1M samples, 10M instructions | Fine-grained manipulation |
| Open X-Embodiment | RGB, Depth, Text, Multi-robot Data | Aggregated data from multiple institutions | Cross-robot system generalization |
- Grasp-Anything++: Language-driven Grasp Detection(CVPR, 2024) [paper]
- Grasp-Anything-6D: Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance(ECCV, 2024) [paper]
- Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection(CORL, 2024) [paper]
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models(ICRA, 2024) [paper]
- DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset(ICRA, 2024) [paper]
- Objectfolder 2.0: A multisensory object dataset for sim2real transfer (CVPR, 2022) [paper]
- Touch and Go: Learning from Human-Collected Vision and Touch Supplementary Material (NuerIPS, 2022) [paper]
- Connecting Touch and Vision via Cross-Modal Prediction (CVPR, 2019) [paper]
- Rt-1: Robotics transformer for real-world control at scale (arXiv, 2022) [paper]
| Dataset | Modalities | Unique Feature |
|---|---|---|
| Matterport3D | RGB-D, Semantic Annotations | Foundational dataset for navigation |
| R2R | RGB-D, Natural Language | Vision-and-Language Navigation |
| REVERIE | RGB-D, Object Annotations | Combines object grounding tasks |
| CVDN | RGB-D, Dialog | Introduces multi-turn interactions |
| SOON | RGB-D, Natural Language | Coarse-to-fine target localization |
| R3ED | Point Cloud, Object Labels | Real-world sensor-based data |
| Title | Venue | Date |
|---|---|---|
| Manipulation in Home Environment | ||
| HomeRobot: Open-Vocabulary Mobile Manipulation | CoRL 2023 | 2023-06-20 |
| ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks | CVPR 2020 | 2019-12-03 |
| Manipulation in On-Table Environment | ||
| OBSBench: Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning | NeurIPS 2024 | 2024-02-04 |
| LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning | NeurIPS 2023 | 2023-06-05 |
| Benchmark | Simulator | # Tasks | Real-World Reproducibility | Applicable Algorithms | Key Evaluation Metrics |
|---|---|---|---|---|---|
| RLBench | RLBench | 100+ | ✗ | RL, IL, Traditional Control | Task Success Rate, Trajectory Efficiency, Task Completion Time |
| GemBench | RLBench | 44 | ✗ | RL, IL, VLM-based | Zero-shot Task Success, Object Recognition, Generalization |
| VLMbench | RLBench | 8 | ✗ | RL, VLM-based | Task Execution Success, Compositional Generalization |
| KitchenShift | Isaac Sim | 7 | ✗ | IL, RL | Performance Under Domain Shifts, Task Success Rate |
| CALVIN | PyBullet | 34 | ✗ | RL, IL | Long-Horizon Task Success, Multi-Task Adaptability |
| COLOSSEUM | RLBench | 20 | ✓ | RL, IL | Robustness to Perturbations, Multi-Task Learning Performance |
| VIMA | Ravens | 17 | ✗ | RL, VLM-based | Zero-shot Success, Multi-Modal Task Performance |
- Rt-2: Vision-language-action models transfer web knowledge to robotic control (arXiv, 2023) [paper]
- Open x-embodiment: Robotic learning datasets and rt-x models (arXiv, 2023) [paper]
- Rt-h: Action hierarchies using language (arXiv, 2024) [paper]
- Autort: Embodied foundation models for large scale orchestration of robotic agents (arXiv, 2024) [paper]
- Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation (arXiv, 2024) [paper]
- Voxposer: Composable 3d value maps for robotic manipulation with language models (arXiv, 2023) [paper]
- Rdt-1b: a diffusion foundation model for bimanual manipulation (arXiv, 2024) [paper]
- Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation (arXiv, 2024) [paper]
- Copa: General robotic manipulation through spatial constraints of parts with foundation models (arXiv, 2024) [paper]
- OpenVLA: An Open-Source Vision-Language-Action Model (arXiv, 2024) [paper]
- Palm-e: An embodied multimodal language model (arXiv, 2023) [paper]
- Π0: A Vision-Language-Action Flow Model for General Robot Control (arXiv, 2024) [paper]
If you find our survey and repository useful for your research, please consider citing our paper:
@article{han2025multimodal,
title={Multimodal fusion and vision-language models: A survey for robot vision},
author={Han, Xiaofeng and Chen, Shunpeng and Fu, Zenghuang and Feng, Zhe and Fan, Lue and An, Dong and Wang, Changwei and Guo, Li and Meng, Weiliang and Zhang, Xiaopeng and others},
journal={Information Fusion},
pages={103652},
year={2025},
publisher={Elsevier}
}
