Skip to content

Xiaofeng-Han-Res/MF-RV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 

Repository files navigation

Xiaofeng Han · Shunpeng Chen · Zenghuang Fu · Zhe Feng · Lue Fan · Dong An · Zhangwei Wang · Li Guo · Weiliang Meng* · Xiaopeng Zhang · Rongtao Xu* · Shibiao Xu

License: MIT

This repository tracks research on multimodal fusion and vision–language models (VLMs) for robot vision, covering semantic scene understanding, 3D perception, SLAM, navigation & localization, and manipulation. We also summarize datasets, metrics, challenges (e.g., cross-modal alignment, efficient fusion, real-time deployment), and future directions.

⭐ Share us a ⭐

Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.

News

[2025-09-03] Camera-ready accepted by Information Fusion (Vol. 126, 2026). DOI: 10.1016/j.inffus.2025.103652.

[2025-09-01] Featured by Embodied Intelligence Hub — article recap of our survey and repo. Read the post »

📖 Introduction

The overview figure illustrates the overall framework of multimodal fusion and VLMs for robot vision:

Alt Text

Table of Contents

Related Surveys

📊Awesome Benchmarks

Scene Understanding Datasets

Datasets Scene Multimodal Data Venue Year
360+x Indoor/Outdoor Video/Audio CVPR 2024
ScanQA Indoor RGB/Text CVPR 2022
Hypersim Indoor RGB/Depth ICCV 2021
NuScenes Urban street RGB/Lidar/Radar CVPR 2020
Waymo Outdoor RGB/Lidar CVPR 2020
Semantickitti Urban street RGB/Lidar ICCV 2019
Matterport3D Indoor RGB/Depth arxiv 2017
ScanNet Indoor RGB/Depth CVPR 2017
Cityscapes Urban street RGB/Depth CVPR 2016
NYUDv2 Indoor RGB/Depth ECCV 2012

Robot Manipulation Datasets

Datasets Core Modalities Data Scale Main Application
DROID RGB, Depth, Text 76,000 trajectories Multi-task scene adaptation
R2SGrasp RGB-D, Point Cloud 64,000 RGB-D images Grasp detection
RT-1 RGB, Text 130,000 trajectories Real-time task control
Touch and Go RGB, Tactile 3,971 virtual object models, 13,900 tactile interactions Cross-modal perception
VisGel GelSight Tactile, RGB 12,000 tactile interactions Tactile-enhanced manipulation
ObjectFolder 2.0 RGB, Audio, Tactile 1,000 virtual object models Virtual-to-reality transfer
Grasp-Anything-6D Point Cloud, Text 1M point cloud scenes Language-driven grasping
Grasp-Anything++ Point Cloud, Text 1M samples, 10M instructions Fine-grained manipulation
Open X-Embodiment RGB, Depth, Text, Multi-robot Data Aggregated data from multiple institutions Cross-robot system generalization
  • Grasp-Anything++: Language-driven Grasp Detection(CVPR, 2024) [paper]
  • Grasp-Anything-6D: Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance(ECCV, 2024) [paper]
  • Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection(CORL, 2024) [paper]
  • Open X-Embodiment: Robotic Learning Datasets and RT-X Models(ICRA, 2024) [paper]
  • DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset(ICRA, 2024) [paper]
  • Objectfolder 2.0: A multisensory object dataset for sim2real transfer (CVPR, 2022) [paper]
  • Touch and Go: Learning from Human-Collected Vision and Touch Supplementary Material (NuerIPS, 2022) [paper]
  • Connecting Touch and Vision via Cross-Modal Prediction (CVPR, 2019) [paper]
  • Rt-1: Robotics transformer for real-world control at scale (arXiv, 2022) [paper]

Embodied Navigation Datasets

Dataset Modalities Unique Feature
Matterport3D RGB-D, Semantic Annotations Foundational dataset for navigation
R2R RGB-D, Natural Language Vision-and-Language Navigation
REVERIE RGB-D, Object Annotations Combines object grounding tasks
CVDN RGB-D, Dialog Introduces multi-turn interactions
SOON RGB-D, Natural Language Coarse-to-fine target localization
R3ED Point Cloud, Object Labels Real-world sensor-based data

Manipulation Benchmarks

Title Venue Date
Manipulation in Home Environment
HomeRobot: Open-Vocabulary Mobile Manipulation CoRL 2023 2023-06-20
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks CVPR 2020 2019-12-03
Manipulation in On-Table Environment
OBSBench: Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning NeurIPS 2024 2024-02-04
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning NeurIPS 2023 2023-06-05
Benchmark Simulator # Tasks Real-World Reproducibility Applicable Algorithms Key Evaluation Metrics
RLBench RLBench 100+ RL, IL, Traditional Control Task Success Rate, Trajectory Efficiency, Task Completion Time
GemBench RLBench 44 RL, IL, VLM-based Zero-shot Task Success, Object Recognition, Generalization
VLMbench RLBench 8 RL, VLM-based Task Execution Success, Compositional Generalization
KitchenShift Isaac Sim 7 IL, RL Performance Under Domain Shifts, Task Success Rate
CALVIN PyBullet 34 RL, IL Long-Horizon Task Success, Multi-Task Adaptability
COLOSSEUM RLBench 20 RL, IL Robustness to Perturbations, Multi-Task Learning Performance
VIMA Ravens 17 RL, VLM-based Zero-shot Success, Multi-Modal Task Performance

Embodied Large Language Models

  • Rt-2: Vision-language-action models transfer web knowledge to robotic control (arXiv, 2023) [paper]
  • Open x-embodiment: Robotic learning datasets and rt-x models (arXiv, 2023) [paper]
  • Rt-h: Action hierarchies using language (arXiv, 2024) [paper]
  • Autort: Embodied foundation models for large scale orchestration of robotic agents (arXiv, 2024) [paper]
  • Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation (arXiv, 2024) [paper]
  • Voxposer: Composable 3d value maps for robotic manipulation with language models (arXiv, 2023) [paper]
  • Rdt-1b: a diffusion foundation model for bimanual manipulation (arXiv, 2024) [paper]
  • Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation (arXiv, 2024) [paper]
  • Copa: General robotic manipulation through spatial constraints of parts with foundation models (arXiv, 2024) [paper]
  • OpenVLA: An Open-Source Vision-Language-Action Model (arXiv, 2024) [paper]
  • Palm-e: An embodied multimodal language model (arXiv, 2023) [paper]
  • Π0: A Vision-Language-Action Flow Model for General Robot Control (arXiv, 2024) [paper]

📖 Citation

If you find our survey and repository useful for your research, please consider citing our paper:

@article{han2025multimodal,
  title={Multimodal fusion and vision-language models: A survey for robot vision},
  author={Han, Xiaofeng and Chen, Shunpeng and Fu, Zenghuang and Feng, Zhe and Fan, Lue and An, Dong and Wang, Changwei and Guo, Li and Meng, Weiliang and Zhang, Xiaopeng and others},
  journal={Information Fusion},
  pages={103652},
  year={2025},
  publisher={Elsevier}
}

About

A survey on Multimodal Fusion for Robot Vision

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors