Awesome RAG in Computer Vision

A curated collection of Retrieval-Augmented Generation (RAG) for Computer Vision papers, covering visual understanding, visual generation, video, documents, embodied AI, and more.

💡 Feel free to open a Pull Request to add your work on RAG for Vision!

Introduction

Retrieval-Augmented Generation (RAG) integrates retrieval into generative models, enabling models to query external knowledge bases (or memory banks) at inference time.

In Computer Vision, RAG has been used for:

Image captioning / VQA with external knowledge or retrieved exemplars
Video QA and long-context understanding via retrieved transcripts or clips
Visual generation with retrieved reference images, templates, or domain knowledge

Resources

Workshops and Tutorials

RAG for Image

RAG for video

RAG for Document

Other Related Resources

Papers

Survey and Benchmark

Year	Paper	Focused Areas	Main Context	GitHub
2023	Gao et al.	LLMs / NLP	RAG paradigms and components	-
2024	Fan et al.	LLMs / NLP	RA-LLMs' architectures, training, and applications	link
2024	Hu et al.	LLMs / NLP	RA-LMs' components, evaluation, and limitations	link
2024	Zhao et al.	LLMs / NLP	Challenges in data-augmented LLMs	-
2024	Gupta et al.	LLMs / NLP	Advancements and downstream tasks of RAG	-
2024	Zhao et al.	RAG in AIGC	RAG applications across modalities	link
2024	Yu et al.	LLMs / NLP	Unified evaluation process of RAG	link
2024	Procko et al.	Graph Learning	Knowledge graphs with LLM RAG	-
2024	Zhou et al.	Trustworthiness AI	Six dimensions and benchmarks about Trustworthy RAG	link
2025	Singh et al.	AI Agent	Participles and evaluation	link
2025	Ni et al.	Trustworthiness AI	Road-map and discussion	link
2025	*Ours*	*Computer Vision*	*RAG for visual understanding and generation*	link

RAG for Vision

1 Visual Understanding

1.1 Image Understanding

Title	Authors	Venue/Date	Links
🔥Test-Time Retrieval-Augmented Adaptation for VLMs	Fan et al.	ICCV 2025	paper
🔥Retrieval-Augmented VQA for Scientific Figures (RAVQA-VLM)	Li et al.	AAAI 2025	paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA	Sarwar	arXiv 2025 (Sep)	paper
mRAG: Elucidating the Design Space of Multi-modal RAG	Hu et al.	arXiv 2025 (Aug)	paper
Multimodal RAG Enhanced Visual Description	Jaiswal et al.	arXiv 2025 (Aug)	paper
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding	Wu et al.	arXiv 2024 (Dec)	paper
Retrieval-Augmented Open-Vocabulary Object Detection	Kim et al.	CVPR 2024	paper
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning	Li et al.	arXiv 2024 (Aug)	paper
SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge	Li et al.	NeurIPS 2024	paper
Learning Customized Visual Models with Retrieval-Augmented Knowledge	Liu et al.	CVPR 2023	paper
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering	Lin et al.	NeurIPS 2023	paper
Retrieval-Augmented Classification for Long-Tail Visual Recognition	Long et al.	CVPR 2022	(TODO: add paper link)

1.2 (Long) Video Understanding

Title	Authors	Venue/Date	Links
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding	Zeng et al.	arXiv 2025 (Jun)	paper
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding	Mao et al.	arXiv 2025 (Jun)	paper
VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos	Gia et al.	CVPRW 2025	paper
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	Xiong et al.	ICLR 2025	paper
Temporal Preference Optimization for Long-Form Video Understanding	Li et al.	arXiv 2025 (Jan)	paper
StreamingRAG: Real-time Contextual Retrieval and Generation Framework	Sankaradas et al.	arXiv 2025 (Jan)	paper
VideoAuteur: Towards Long Narrative Video Generation	Xiao et al.	arXiv 2025 (Jan)	paper
Generative Frame Sampler for Long Video Understanding	Yao et al.	ACL 2024	paper
FrameFusion: Combining Similarity and Importance for Video Token Reduction on LVLMs	Fu et al.	arXiv 2024 (Dec)	paper
Vinci: A Real-time Embodied Smart Assistant based on Egocentric VLM	Huang et al.	arXiv 2024 (Dec)	paper
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models	Yi et al.	arXiv 2024 (Dec)	paper
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension	Luo et al.	arXiv 2024 (Nov)	paper
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos	Ataallah et al.	arXiv 2024 (Jul)	paper
ViTA: Efficient Video-to-Text with VLM for RAG-based Video Analysis	Arefeen et al.	CVPRW 2024	paper
iRAG: Advancing RAG for Videos with an Incremental Approach	Arefeen et al.	CIKM 2024	paper

1.3 Visual Spatial Understanding

Title	Authors	Venue/Date	Links
RAG-Guided LLMs for Visual Spatial Description with Adaptive Hallucination Corrector	Yu et al.	ACM MM 2024	paper

1.4 Multi-modal

This section focuses on multimodal RAG methods that retrieve and reason over documents, knowledge bases/graphs, enterprise data, and evaluation/benchmarks.

1.4.1 Doc-RAG (Document-centric)

Title	Authors	Venue/Date	Paper Link
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation	Sun et al.	Arxiv 2025 (Oct)	paper
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG	Peng et al.	Arxiv 2025 (Oct)	paper
CMRAG: Co-modality-based document retrieval and visual question answering	Chen et al.	Arxiv 2025 (Sep)	paper
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries	Wu et al.	arXiv 2025 (Aug)	paper
Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding	Mannam et al.	KDDW 2025 (Jun)	paper
DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers	Wasserman et al.	Arxiv 2025 (May)	paper
A Multi-Granularity Retrieval Framework for Visually-Rich Documents	Xu et al.	Arxiv 2025 (May)	paper
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain	Zhao et al.	Arxiv 2025 (May)	paper
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents	Tanaka et al.	arXiv 2025 (Apr)	paper
SuperRAG: Beyond RAG with Layout-Aware Graph Modeling	Yang et al.	NACCL 2025 (Mar)	paper
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding	Han et al.	Arxiv 2025 (Mar)	paper
SiQA: A Large Multi-Modal Question Answering Model for Structured Images Based on RAG	Liu et al.	ICASSP 2025 (Mar)	paper
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework	Yang et al.	Arxiv 2025 (Feb)	paper
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents	Wang et al.	Arxiv 2025 (Feb)	paper
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	Caffagni et al.	CVPRW 2024	paper
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding	Cho et al.	Arxiv 2024 (Nov)	paper
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents	Yu et al.	ICLR 2025	paper

1.4.2 Knowledge-RAG (Knowledge / Knowledge Graph / External KB)

Title	Authors	Venue/Date	Links
Knowledge Graph-Guided Retrieval-Augmented Generation	Zhang et al.	ACL 2025	paper
Multimodal Iterative RAG for Knowledge Visual Question Answering	Choi et al.	Arxiv 2025 (Sep)	paper
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering	Yuan et al.	Arxiv 2025 (Aug)	paper
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation	Park et al.	Arxiv 2025 (Jun)	paper
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG	Tian et al.	ACL 2025 (Jun)	paper
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework	Ling et al.	Arxiv 2025 (Apr)	paper
CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model	Jiang et al.	Arxiv 2025 (Feb)	paper
MuKA: Multimodal Knowledge Augmented Visual Information-Seeking	Deng et al.	Coling 2025 (Jan)	paper
mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA	Zhang et al.	Arxiv 2024 (Nov)	paper

1.4.3 Enterprise / Industrial

Title	Authors	Venue/Date	Links
AUGUSTUS: An LLM-Driven Multimodal Agent System	Jain et al.	arXiv 2025	paper
Beyond the Textual: Generating Coherent Visual Options for MCQs	Wang et al.	Arxiv 2025 (Aug)	paper
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering	Gondhalekar et al.	Arxiv 2025 (Jun)	paper
Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems	Zhang et al.	Arxiv 2025 (Sep)	paper
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Wen et al.	Arxiv 2025 (Apr)	paper

1.4.4 Evaluation / Benchmark / Robustness

Title	Authors	Venue/Date	Links
VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples	Sun et al.	arXiv 2025 (Sep)	paper
FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation	Zhang et al.	arXiv 2025 (Jun)	paper
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger	Yang et al.	arXiv 2025 (Jun)	paper
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models	Jia et al.	arXiv 2025 (May)	paper
Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey	Gan et al.	arXiv 2025 (Apr)	paper
MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation	Yu et al.	arXiv 2025 (Apr)	paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA	S. M. Sarwar	arXiv 2025 (Feb)	paper
MRAG-Bench: Vision-centric Evaluation for Retrieval-Augmented Multimodal Models	Hu et al.	ICLR 2025	paper
Re-ranking the Context for Multimodal Retrieval Augmented Generation	Mortaheb et al.	arXiv 2025 (Jan)	paper
A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications	Marom	arXiv 2025 (Jan)	paper
Visual RAG: Expanding MLLM Visual Knowledge without Fine-tuning	Bonomo et al.	arXiv 2025 (Jan)	paper
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models	Sharifymoghaddam et al.	arXiv 2024 (Oct)	paper
RoRA-VLM: Robust Retrieval Augmentation for Vision Language Models	Qi et al.	arXiv 2024 (Oct)	paper
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information	Sun et al.	EMNLP 2024 (Sep)	paper
SearchLVLMs	Li et al.	NeurIPS 2024	paper
RAVEN: Multitask Retrieval Augmented Vision-Language Learning	Rao et al.	COLM 2024 (Jun)	paper
Retrieval Meets Reasoning	Tan et al.	arXiv 2024 (Apr)	paper
RAR	Liu et al.	arXiv 2024 (Mar)	paper
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning	Cui et al.	ACL 2024 (Feb)	paper
Fine-grained Late-interaction Multi-modal Retrieval for RAG-VQA	Lin et al.	NeurIPS 2023	paper
Retrieval-based Knowledge Augmented Vision Language Pre-training	Rao et al.	ACMMM 2023 (Apr)	paper
ReVeaL	Hu et al.	CVPR 2023 (Apr)	paper
Murag	Chen et al.	EMNLP 2022 (Oct)	paper

1.5 Medical Vision

Title	Authors	Venue/Date	Links
🔥MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering	Shaaban et al.	MICCAI 2025	paper
How to Make Medical AI Systems Safer? Simulating Vulnerabilities in Multimodal Medical RAG	Zuo et al.	arXiv 2025 (Aug)	paper
AlzheimerRAG: Multimodal RAG for Clinical Use Cases using PubMed	Lahiri et al.	arXiv 2025 (Aug)	paper
HeteroRAG: A Heterogeneous RAG Framework for Medical Vision-Language Tasks	Chen et al.	arXiv 2025 (Aug)	paper
REALM: RAG-Driven Enhancement of Multimodal EHR Analysis	Zhu et al.	arXiv 2025 (Feb)	paper
MMed-RAG: Versatile Multimodal RAG for Medical VLMs	Xia et al.	arXiv 2024 (Oct)	paper
RULE: Reliable Multimodal RAG for Factuality in Medical VLMs	Xia et al.	EMNLP 2024	paper

2 Visual Generation

2.1 Image (Video) Generation

Title	Authors	Venue/Date	Links
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation	Zhu et al.	arXiv 2025 (Sep)	paper
GarmentAligner	Zhang et al.	ECCV 2025	paper
RealRAG	Lyu et al.	ICML 2025	paper
FineRAG	Yuan et al.	COLING 2025	paper
ImageRAG	Shalev-Arkushin et al.	arXiv 2025	paper
BrainRAM	Xie et al.	ACM MM 2024	paper
Instruct-Imagen	Hu et al.	CVPR 2024	paper
Retrieval-Augmented Layout Transformer	Horita et al.	CVPR 2024	paper
The Neglected Tails in Vision-Language Models	Parashar et al.	CVPR 2024	paper
FairRAG: Fair Human Generation via Fair Retrieval Augmentation	Shrestha et al.	CVPR 2024	paper
Grounding Language Models for Visual Entity Recognition	Xiao et al.	ECCV 2024	paper
RealGen	Ding et al.	ECCV 2024	paper
Factuality Tax of Diversity-Intervened Generation	Wan et al.	EMNLP 2024	paper
Prompt Expansion for Adaptive Text-to-Image Generation	Datta et al.	ACL 2024	paper
Label-Retrieval-Augmented Diffusion Models	Chen et al.	NeurIPS 2023	paper
CPR: Retrieval-Augmented Generation for Copyright Protection	Golatkar et al.	CVPR 2023	paper
ReMoDiffuse	Zhang et al.	ICCV 2023	paper
Diffusion-Based Augmentation for Captioning and Retrieval	Cioni et al.	ICCVW 2023	paper
Animate-A-Story	He et al.	arXiv 2023	paper
Retrieval-Augmented Diffusion Models	Blattmann et al.	NeurIPS 2022	paper
Re-imagen	Chen et al.	arXiv 2022	paper

2.2 3D Generation

Title	Authors	Venue/Date	Links
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow	Gupta et al.	arXiv 2025 (Oct)	paper
MV-RAG: Retrieval-Augmented Multiview Diffusion	Dayani et al.	arXiv 2025 (Aug)	paper
VimoRAG: Video-based Retrieval-Augmented 3D Motion Generation	Xu et al.	arXiv 2025 (Aug)	paper
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions	Wang et al.	arXiv 2024 (Sep)	paper
Retrieval-Augmented Score Distillation for Text-to-3D Generation	Seo et al.	ICML 2024	paper
Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling	Wu et al.	arXiv 2024 (Nov)	paper
Interaction-based Retrieval-Augmented Diffusion for Protein 3D Generation	Huang et al.	ICML 2024	paper
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model	Zhang et al.	ICCV 2023	paper

3. Embodied AI

Title	Authors	Venue/Date	Links
VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory	Wang et al.	arXiv 2026 (Jan)	paper
SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation	Ye et al.	Arxiv 2025 (Jul)	paper
RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base	Wang et al.	IROS 2025 July	paper
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving	Wang et al.	Arxiv 2025 (Mar)	paper
RANa: Retrieval-Augmented Navigation	Monaci et al.	Arxiv 2025 (Apr)	paper
P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task	Xu et al.	ACM MM 2024	paper
Realgen: Retrieval Augmented Generation for Controllable Traffic Scenarios	Ding et al.	ECCV 2024	paper
Retrieval-Augmented Embodied Agents	Zhu et al.	CVPR 2024	paper
ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception	Nazar et al.	Arxiv 2024 (Oct)	paper
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation	Xie et al.	Arxiv 2024 (Oct)	paper
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning	Yuan et al.	Arxiv 2024 (May)	paper

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome RAG in Computer Vision

Table of Contents

Introduction

Resources

Workshops and Tutorials

RAG for Image

RAG for video

RAG for Document

Other Related Resources

Papers

Survey and Benchmark

RAG for Vision

1 Visual Understanding

1.1 Image Understanding

1.2 (Long) Video Understanding

1.3 Visual Spatial Understanding

1.4 Multi-modal

1.4.1 Doc-RAG (Document-centric)

1.4.2 Knowledge-RAG (Knowledge / Knowledge Graph / External KB)

1.4.3 Enterprise / Industrial

1.4.4 Evaluation / Benchmark / Robustness

1.5 Medical Vision

2 Visual Generation

2.1 Image (Video) Generation

2.2 3D Generation

3. Embodied AI

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome RAG in Computer Vision

Table of Contents

Introduction

Resources

Workshops and Tutorials

RAG for Image

RAG for video

RAG for Document

Other Related Resources

Papers

Survey and Benchmark

RAG for Vision

1 Visual Understanding

1.1 Image Understanding

1.2 (Long) Video Understanding

1.3 Visual Spatial Understanding

1.4 Multi-modal

1.4.1 Doc-RAG (Document-centric)

1.4.2 Knowledge-RAG (Knowledge / Knowledge Graph / External KB)

1.4.3 Enterprise / Industrial

1.4.4 Evaluation / Benchmark / Robustness

1.5 Medical Vision

2 Visual Generation

2.1 Image (Video) Generation

2.2 3D Generation

3. Embodied AI

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages