Multimodal Fusion and Vision–Language Models: A Survey for Robot Vision

Xiaofeng Han · Shunpeng Chen · Zenghuang Fu · Zhe Feng · Lue Fan · Dong An · Zhangwei Wang · Li Guo · Weiliang Meng* · Xiaopeng Zhang · Rongtao Xu* · Shibiao Xu

This repository tracks research on multimodal fusion and vision–language models (VLMs) for robot vision, covering semantic scene understanding, 3D perception, SLAM, navigation & localization, and manipulation. We also summarize datasets, metrics, challenges (e.g., cross-modal alignment, efficient fusion, real-time deployment), and future directions.

⭐ Share us a ⭐

Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.

News

[2025-09-03] Camera-ready accepted by Information Fusion (Vol. 126, 2026). DOI: 10.1016/j.inffus.2025.103652.

[2025-09-01] Featured by Embodied Intelligence Hub — article recap of our survey and repo. Read the post »

📖 Introduction

The overview figure illustrates the overall framework of multimodal fusion and VLMs for robot vision:

Related Surveys

📊Awesome Benchmarks

Scene Understanding Datasets

Datasets	Scene	Multimodal Data	Venue	Year
360+x	Indoor/Outdoor	Video/Audio	CVPR	2024
ScanQA	Indoor	RGB/Text	CVPR	2022
Hypersim	Indoor	RGB/Depth	ICCV	2021
NuScenes	Urban street	RGB/Lidar/Radar	CVPR	2020
Waymo	Outdoor	RGB/Lidar	CVPR	2020
Semantickitti	Urban street	RGB/Lidar	ICCV	2019
Matterport3D	Indoor	RGB/Depth	arxiv	2017
ScanNet	Indoor	RGB/Depth	CVPR	2017
Cityscapes	Urban street	RGB/Depth	CVPR	2016
NYUDv2	Indoor	RGB/Depth	ECCV	2012

Robot Manipulation Datasets

Datasets	Core Modalities	Data Scale	Main Application
DROID	RGB, Depth, Text	76,000 trajectories	Multi-task scene adaptation
R2SGrasp	RGB-D, Point Cloud	64,000 RGB-D images	Grasp detection
RT-1	RGB, Text	130,000 trajectories	Real-time task control
Touch and Go	RGB, Tactile	3,971 virtual object models, 13,900 tactile interactions	Cross-modal perception
VisGel	GelSight Tactile, RGB	12,000 tactile interactions	Tactile-enhanced manipulation
ObjectFolder 2.0	RGB, Audio, Tactile	1,000 virtual object models	Virtual-to-reality transfer
Grasp-Anything-6D	Point Cloud, Text	1M point cloud scenes	Language-driven grasping
Grasp-Anything++	Point Cloud, Text	1M samples, 10M instructions	Fine-grained manipulation
Open X-Embodiment	RGB, Depth, Text, Multi-robot Data	Aggregated data from multiple institutions	Cross-robot system generalization

Grasp-Anything++: Language-driven Grasp Detection(CVPR, 2024) [paper]
Grasp-Anything-6D: Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance(ECCV, 2024) [paper]
Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection(CORL, 2024) [paper]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models(ICRA, 2024) [paper]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset(ICRA, 2024) [paper]
Objectfolder 2.0: A multisensory object dataset for sim2real transfer (CVPR, 2022) [paper]
Touch and Go: Learning from Human-Collected Vision and Touch Supplementary Material (NuerIPS, 2022) [paper]
Connecting Touch and Vision via Cross-Modal Prediction (CVPR, 2019) [paper]
Rt-1: Robotics transformer for real-world control at scale (arXiv, 2022) [paper]

Embodied Navigation Datasets

Dataset	Modalities	Unique Feature
Matterport3D	RGB-D, Semantic Annotations	Foundational dataset for navigation
R2R	RGB-D, Natural Language	Vision-and-Language Navigation
REVERIE	RGB-D, Object Annotations	Combines object grounding tasks
CVDN	RGB-D, Dialog	Introduces multi-turn interactions
SOON	RGB-D, Natural Language	Coarse-to-fine target localization
R3ED	Point Cloud, Object Labels	Real-world sensor-based data

Manipulation Benchmarks

Title	Venue	Date
Manipulation in Home Environment
HomeRobot: Open-Vocabulary Mobile Manipulation	CoRL 2023	2023-06-20
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks	CVPR 2020	2019-12-03
Manipulation in On-Table Environment
OBSBench: Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning	NeurIPS 2024	2024-02-04
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	NeurIPS 2023	2023-06-05

Benchmark	Simulator	# Tasks	Real-World Reproducibility	Applicable Algorithms	Key Evaluation Metrics
RLBench	RLBench	100+	✗	RL, IL, Traditional Control	Task Success Rate, Trajectory Efficiency, Task Completion Time
GemBench	RLBench	44	✗	RL, IL, VLM-based	Zero-shot Task Success, Object Recognition, Generalization
VLMbench	RLBench	8	✗	RL, VLM-based	Task Execution Success, Compositional Generalization
KitchenShift	Isaac Sim	7	✗	IL, RL	Performance Under Domain Shifts, Task Success Rate
CALVIN	PyBullet	34	✗	RL, IL	Long-Horizon Task Success, Multi-Task Adaptability
COLOSSEUM	RLBench	20	✓	RL, IL	Robustness to Perturbations, Multi-Task Learning Performance
VIMA	Ravens	17	✗	RL, VLM-based	Zero-shot Success, Multi-Modal Task Performance

Embodied Large Language Models

Rt-2: Vision-language-action models transfer web knowledge to robotic control (arXiv, 2023) [paper]
Open x-embodiment: Robotic learning datasets and rt-x models (arXiv, 2023) [paper]
Rt-h: Action hierarchies using language (arXiv, 2024) [paper]
Autort: Embodied foundation models for large scale orchestration of robotic agents (arXiv, 2024) [paper]
Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation (arXiv, 2024) [paper]
Voxposer: Composable 3d value maps for robotic manipulation with language models (arXiv, 2023) [paper]
Rdt-1b: a diffusion foundation model for bimanual manipulation (arXiv, 2024) [paper]
Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation (arXiv, 2024) [paper]
Copa: General robotic manipulation through spatial constraints of parts with foundation models (arXiv, 2024) [paper]
OpenVLA: An Open-Source Vision-Language-Action Model (arXiv, 2024) [paper]
Palm-e: An embodied multimodal language model (arXiv, 2023) [paper]
Π0: A Vision-Language-Action Flow Model for General Robot Control (arXiv, 2024) [paper]

📖 Citation

If you find our survey and repository useful for your research, please consider citing our paper:

@article{han2025multimodal,
  title={Multimodal fusion and vision-language models: A survey for robot vision},
  author={Han, Xiaofeng and Chen, Shunpeng and Fu, Zenghuang and Feng, Zhe and Fan, Lue and An, Dong and Wang, Changwei and Guo, Li and Meng, Weiliang and Zhang, Xiaopeng and others},
  journal={Information Fusion},
  pages={103652},
  year={2025},
  publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
README.md		README.md
illustrates.png		illustrates.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Fusion and Vision–Language Models: A Survey for Robot Vision

⭐ Share us a ⭐

News

📖 Introduction

Table of Contents

Related Surveys

📊Awesome Benchmarks

Scene Understanding Datasets

Robot Manipulation Datasets

Embodied Navigation Datasets

Manipulation Benchmarks

Embodied Large Language Models

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Multimodal Fusion and Vision–Language Models: A Survey for Robot Vision

⭐ Share us a ⭐

News

📖 Introduction

Table of Contents

Related Surveys

📊Awesome Benchmarks

Scene Understanding Datasets

Robot Manipulation Datasets

Embodied Navigation Datasets

Manipulation Benchmarks

Embodied Large Language Models

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages