A curated collection of Retrieval-Augmented Generation (RAG) for Computer Vision papers, covering visual understanding , visual generation , video , documents , embodied AI , and more.
💡 Feel free to open a Pull Request to add your work on RAG for Vision !
Retrieval-Augmented Generation (RAG) integrates retrieval into generative models , enabling models to query external knowledge bases (or memory banks) at inference time.
In Computer Vision , RAG has been used for:
Image captioning / VQA with external knowledge or retrieved exemplars
Video QA and long-context understanding via retrieved transcripts or clips
Visual generation with retrieved reference images, templates, or domain knowledge
Year
Paper
Focused Areas
Main Context
GitHub
2023
Gao et al.
LLMs / NLP
RAG paradigms and components
-
2024
Fan et al.
LLMs / NLP
RA-LLMs' architectures, training, and applications
link
2024
Hu et al.
LLMs / NLP
RA-LMs' components, evaluation, and limitations
link
2024
Zhao et al.
LLMs / NLP
Challenges in data-augmented LLMs
-
2024
Gupta et al.
LLMs / NLP
Advancements and downstream tasks of RAG
-
2024
Zhao et al.
RAG in AIGC
RAG applications across modalities
link
2024
Yu et al.
LLMs / NLP
Unified evaluation process of RAG
link
2024
Procko et al.
Graph Learning
Knowledge graphs with LLM RAG
-
2024
Zhou et al.
Trustworthiness AI
Six dimensions and benchmarks about Trustworthy RAG
link
2025
Singh et al.
AI Agent
Participles and evaluation
link
2025
Ni et al.
Trustworthiness AI
Road-map and discussion
link
2025
Ours
Computer Vision
RAG for visual understanding and generation
link
Title
Authors
Venue/Date
Links
🔥Test-Time Retrieval-Augmented Adaptation for VLMs
Fan et al.
ICCV 2025
paper
🔥Retrieval-Augmented VQA for Scientific Figures (RAVQA-VLM)
Li et al.
AAAI 2025
paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
Sarwar
arXiv 2025 (Sep)
paper
mRAG: Elucidating the Design Space of Multi-modal RAG
Hu et al.
arXiv 2025 (Aug)
paper
Multimodal RAG Enhanced Visual Description
Jaiswal et al.
arXiv 2025 (Aug)
paper
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Wu et al.
arXiv 2024 (Dec)
paper
Retrieval-Augmented Open-Vocabulary Object Detection
Kim et al.
CVPR 2024
paper
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
Li et al.
arXiv 2024 (Aug)
paper
SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge
Li et al.
NeurIPS 2024
paper
Learning Customized Visual Models with Retrieval-Augmented Knowledge
Liu et al.
CVPR 2023
paper
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Lin et al.
NeurIPS 2023
paper
Retrieval-Augmented Classification for Long-Tail Visual Recognition
Long et al.
CVPR 2022
(TODO: add paper link)
1.2 (Long) Video Understanding
Title
Authors
Venue/Date
Links
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
Zeng et al.
arXiv 2025 (Jun)
paper
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
Mao et al.
arXiv 2025 (Jun)
paper
VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos
Gia et al.
CVPRW 2025
paper
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
Xiong et al.
ICLR 2025
paper
Temporal Preference Optimization for Long-Form Video Understanding
Li et al.
arXiv 2025 (Jan)
paper
StreamingRAG: Real-time Contextual Retrieval and Generation Framework
Sankaradas et al.
arXiv 2025 (Jan)
paper
VideoAuteur: Towards Long Narrative Video Generation
Xiao et al.
arXiv 2025 (Jan)
paper
Generative Frame Sampler for Long Video Understanding
Yao et al.
ACL 2024
paper
FrameFusion: Combining Similarity and Importance for Video Token Reduction on LVLMs
Fu et al.
arXiv 2024 (Dec)
paper
Vinci: A Real-time Embodied Smart Assistant based on Egocentric VLM
Huang et al.
arXiv 2024 (Dec)
paper
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Yi et al.
arXiv 2024 (Dec)
paper
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Luo et al.
arXiv 2024 (Nov)
paper
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Ataallah et al.
arXiv 2024 (Jul)
paper
ViTA: Efficient Video-to-Text with VLM for RAG-based Video Analysis
Arefeen et al.
CVPRW 2024
paper
iRAG: Advancing RAG for Videos with an Incremental Approach
Arefeen et al.
CIKM 2024
paper
1.3 Visual Spatial Understanding
Title
Authors
Venue/Date
Links
RAG-Guided LLMs for Visual Spatial Description with Adaptive Hallucination Corrector
Yu et al.
ACM MM 2024
paper
This section focuses on multimodal RAG methods that retrieve and reason over documents, knowledge bases/graphs, enterprise data , and evaluation/benchmarks .
1.4.1 Doc-RAG (Document-centric)
Title
Authors
Venue/Date
Paper Link
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
Sun et al.
Arxiv 2025 (Oct)
paper
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Peng et al.
Arxiv 2025 (Oct)
paper
CMRAG: Co-modality-based document retrieval and visual question answering
Chen et al.
Arxiv 2025 (Sep)
paper
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
Wu et al.
arXiv 2025 (Aug)
paper
Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
Mannam et al.
KDDW 2025 (Jun)
paper
DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers
Wasserman et al.
Arxiv 2025 (May)
paper
A Multi-Granularity Retrieval Framework for Visually-Rich Documents
Xu et al.
Arxiv 2025 (May)
paper
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain
Zhao et al.
Arxiv 2025 (May)
paper
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Tanaka et al.
arXiv 2025 (Apr)
paper
SuperRAG: Beyond RAG with Layout-Aware Graph Modeling
Yang et al.
NACCL 2025 (Mar)
paper
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Han et al.
Arxiv 2025 (Mar)
paper
SiQA: A Large Multi-Modal Question Answering Model for Structured Images Based on RAG
Liu et al.
ICASSP 2025 (Mar)
paper
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Yang et al.
Arxiv 2025 (Feb)
paper
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Wang et al.
Arxiv 2025 (Feb)
paper
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Caffagni et al.
CVPRW 2024
paper
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Cho et al.
Arxiv 2024 (Nov)
paper
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Yu et al.
ICLR 2025
paper
1.4.2 Knowledge-RAG (Knowledge / Knowledge Graph / External KB)
Title
Authors
Venue/Date
Links
Knowledge Graph-Guided Retrieval-Augmented Generation
Zhang et al.
ACL 2025
paper
Multimodal Iterative RAG for Knowledge Visual Question Answering
Choi et al.
Arxiv 2025 (Sep)
paper
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Yuan et al.
Arxiv 2025 (Aug)
paper
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Park et al.
Arxiv 2025 (Jun)
paper
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG
Tian et al.
ACL 2025 (Jun)
paper
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Ling et al.
Arxiv 2025 (Apr)
paper
CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model
Jiang et al.
Arxiv 2025 (Feb)
paper
MuKA: Multimodal Knowledge Augmented Visual Information-Seeking
Deng et al.
Coling 2025 (Jan)
paper
mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Zhang et al.
Arxiv 2024 (Nov)
paper
1.4.3 Enterprise / Industrial
Title
Authors
Venue/Date
Links
AUGUSTUS: An LLM-Driven Multimodal Agent System
Jain et al.
arXiv 2025
paper
Beyond the Textual: Generating Coherent Visual Options for MCQs
Wang et al.
Arxiv 2025 (Aug)
paper
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
Gondhalekar et al.
Arxiv 2025 (Jun)
paper
Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems
Zhang et al.
Arxiv 2025 (Sep)
paper
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model
Wen et al.
Arxiv 2025 (Apr)
paper
1.4.4 Evaluation / Benchmark / Robustness
Title
Authors
Venue/Date
Links
VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples
Sun et al.
arXiv 2025 (Sep)
paper
FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation
Zhang et al.
arXiv 2025 (Jun)
paper
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
Yang et al.
arXiv 2025 (Jun)
paper
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models
Jia et al.
arXiv 2025 (May)
paper
Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey
Gan et al.
arXiv 2025 (Apr)
paper
MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation
Yu et al.
arXiv 2025 (Apr)
paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S. M. Sarwar
arXiv 2025 (Feb)
paper
MRAG-Bench: Vision-centric Evaluation for Retrieval-Augmented Multimodal Models
Hu et al.
ICLR 2025
paper
Re-ranking the Context for Multimodal Retrieval Augmented Generation
Mortaheb et al.
arXiv 2025 (Jan)
paper
A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications
Marom
arXiv 2025 (Jan)
paper
Visual RAG: Expanding MLLM Visual Knowledge without Fine-tuning
Bonomo et al.
arXiv 2025 (Jan)
paper
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
Sharifymoghaddam et al.
arXiv 2024 (Oct)
paper
RoRA-VLM: Robust Retrieval Augmentation for Vision Language Models
Qi et al.
arXiv 2024 (Oct)
paper
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Sun et al.
EMNLP 2024 (Sep)
paper
SearchLVLMs
Li et al.
NeurIPS 2024
paper
RAVEN: Multitask Retrieval Augmented Vision-Language Learning
Rao et al.
COLM 2024 (Jun)
paper
Retrieval Meets Reasoning
Tan et al.
arXiv 2024 (Apr)
paper
RAR
Liu et al.
arXiv 2024 (Mar)
paper
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
Cui et al.
ACL 2024 (Feb)
paper
Fine-grained Late-interaction Multi-modal Retrieval for RAG-VQA
Lin et al.
NeurIPS 2023
paper
Retrieval-based Knowledge Augmented Vision Language Pre-training
Rao et al.
ACMMM 2023 (Apr)
paper
ReVeaL
Hu et al.
CVPR 2023 (Apr)
paper
Murag
Chen et al.
EMNLP 2022 (Oct)
paper
Title
Authors
Venue/Date
Links
🔥MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
Shaaban et al.
MICCAI 2025
paper
How to Make Medical AI Systems Safer? Simulating Vulnerabilities in Multimodal Medical RAG
Zuo et al.
arXiv 2025 (Aug)
paper
AlzheimerRAG: Multimodal RAG for Clinical Use Cases using PubMed
Lahiri et al.
arXiv 2025 (Aug)
paper
HeteroRAG: A Heterogeneous RAG Framework for Medical Vision-Language Tasks
Chen et al.
arXiv 2025 (Aug)
paper
REALM: RAG-Driven Enhancement of Multimodal EHR Analysis
Zhu et al.
arXiv 2025 (Feb)
paper
MMed-RAG: Versatile Multimodal RAG for Medical VLMs
Xia et al.
arXiv 2024 (Oct)
paper
RULE: Reliable Multimodal RAG for Factuality in Medical VLMs
Xia et al.
EMNLP 2024
paper
2.1 Image (Video) Generation
Title
Authors
Venue/Date
Links
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation
Zhu et al.
arXiv 2025 (Sep)
paper
GarmentAligner
Zhang et al.
ECCV 2025
paper
RealRAG
Lyu et al.
ICML 2025
paper
FineRAG
Yuan et al.
COLING 2025
paper
ImageRAG
Shalev-Arkushin et al.
arXiv 2025
paper
BrainRAM
Xie et al.
ACM MM 2024
paper
Instruct-Imagen
Hu et al.
CVPR 2024
paper
Retrieval-Augmented Layout Transformer
Horita et al.
CVPR 2024
paper
The Neglected Tails in Vision-Language Models
Parashar et al.
CVPR 2024
paper
FairRAG: Fair Human Generation via Fair Retrieval Augmentation
Shrestha et al.
CVPR 2024
paper
Grounding Language Models for Visual Entity Recognition
Xiao et al.
ECCV 2024
paper
RealGen
Ding et al.
ECCV 2024
paper
Factuality Tax of Diversity-Intervened Generation
Wan et al.
EMNLP 2024
paper
Prompt Expansion for Adaptive Text-to-Image Generation
Datta et al.
ACL 2024
paper
Label-Retrieval-Augmented Diffusion Models
Chen et al.
NeurIPS 2023
paper
CPR: Retrieval-Augmented Generation for Copyright Protection
Golatkar et al.
CVPR 2023
paper
ReMoDiffuse
Zhang et al.
ICCV 2023
paper
Diffusion-Based Augmentation for Captioning and Retrieval
Cioni et al.
ICCVW 2023
paper
Animate-A-Story
He et al.
arXiv 2023
paper
Retrieval-Augmented Diffusion Models
Blattmann et al.
NeurIPS 2022
paper
Re-imagen
Chen et al.
arXiv 2022
paper
Title
Authors
Venue/Date
Links
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow
Gupta et al.
arXiv 2025 (Oct)
paper
MV-RAG: Retrieval-Augmented Multiview Diffusion
Dayani et al.
arXiv 2025 (Aug)
paper
VimoRAG: Video-based Retrieval-Augmented 3D Motion Generation
Xu et al.
arXiv 2025 (Aug)
paper
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions
Wang et al.
arXiv 2024 (Sep)
paper
Retrieval-Augmented Score Distillation for Text-to-3D Generation
Seo et al.
ICML 2024
paper
Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling
Wu et al.
arXiv 2024 (Nov)
paper
Interaction-based Retrieval-Augmented Diffusion for Protein 3D Generation
Huang et al.
ICML 2024
paper
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
Zhang et al.
ICCV 2023
paper
Title
Authors
Venue/Date
Links
VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory
Wang et al.
arXiv 2026 (Jan)
paper
SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation
Ye et al.
Arxiv 2025 (Jul)
paper
RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base
Wang et al.
IROS 2025 July
paper
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving
Wang et al.
Arxiv 2025 (Mar)
paper
RANa: Retrieval-Augmented Navigation
Monaci et al.
Arxiv 2025 (Apr)
paper
P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task
Xu et al.
ACM MM 2024
paper
Realgen: Retrieval Augmented Generation for Controllable Traffic Scenarios
Ding et al.
ECCV 2024
paper
Retrieval-Augmented Embodied Agents
Zhu et al.
CVPR 2024
paper
ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception
Nazar et al.
Arxiv 2024 (Oct)
paper
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation
Xie et al.
Arxiv 2024 (Oct)
paper
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning
Yuan et al.
Arxiv 2024 (May)
paper