Skip to content

zhengxuJosh/Awesome-RAG-Vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 

Repository files navigation

Awesome RAG in Computer Vision

Awesome Badge MIT License Made with Love

A curated collection of Retrieval-Augmented Generation (RAG) for Computer Vision papers, covering visual understanding, visual generation, video, documents, embodied AI, and more.

💡 Feel free to open a Pull Request to add your work on RAG for Vision!


Table of Contents


Introduction

Retrieval-Augmented Generation (RAG) integrates retrieval into generative models, enabling models to query external knowledge bases (or memory banks) at inference time.

In Computer Vision, RAG has been used for:

  • Image captioning / VQA with external knowledge or retrieved exemplars
  • Video QA and long-context understanding via retrieved transcripts or clips
  • Visual generation with retrieved reference images, templates, or domain knowledge

Resources

Workshops and Tutorials

RAG for Image

RAG for video

RAG for Document

Other Related Resources

Papers

Survey and Benchmark

Year Paper Focused Areas Main Context GitHub
2023 Gao et al. LLMs / NLP RAG paradigms and components -
2024 Fan et al. LLMs / NLP RA-LLMs' architectures, training, and applications link
2024 Hu et al. LLMs / NLP RA-LMs' components, evaluation, and limitations link
2024 Zhao et al. LLMs / NLP Challenges in data-augmented LLMs -
2024 Gupta et al. LLMs / NLP Advancements and downstream tasks of RAG -
2024 Zhao et al. RAG in AIGC RAG applications across modalities link
2024 Yu et al. LLMs / NLP Unified evaluation process of RAG link
2024 Procko et al. Graph Learning Knowledge graphs with LLM RAG -
2024 Zhou et al. Trustworthiness AI Six dimensions and benchmarks about Trustworthy RAG link
2025 Singh et al. AI Agent Participles and evaluation link
2025 Ni et al. Trustworthiness AI Road-map and discussion link
2025 Ours Computer Vision RAG for visual understanding and generation link

RAG for Vision

1 Visual Understanding

1.1 Image Understanding

Title Authors Venue/Date Links
🔥Test-Time Retrieval-Augmented Adaptation for VLMs Fan et al. ICCV 2025 paper
🔥Retrieval-Augmented VQA for Scientific Figures (RAVQA-VLM) Li et al. AAAI 2025 paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA Sarwar arXiv 2025 (Sep) paper
mRAG: Elucidating the Design Space of Multi-modal RAG Hu et al. arXiv 2025 (Aug) paper
Multimodal RAG Enhanced Visual Description Jaiswal et al. arXiv 2025 (Aug) paper
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding Wu et al. arXiv 2024 (Dec) paper
Retrieval-Augmented Open-Vocabulary Object Detection Kim et al. CVPR 2024 paper
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning Li et al. arXiv 2024 (Aug) paper
SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge Li et al. NeurIPS 2024 paper
Learning Customized Visual Models with Retrieval-Augmented Knowledge Liu et al. CVPR 2023 paper
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Lin et al. NeurIPS 2023 paper
Retrieval-Augmented Classification for Long-Tail Visual Recognition Long et al. CVPR 2022 (TODO: add paper link)

1.2 (Long) Video Understanding

Title Authors Venue/Date Links
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding Zeng et al. arXiv 2025 (Jun) paper
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding Mao et al. arXiv 2025 (Jun) paper
VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos Gia et al. CVPRW 2025 paper
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Xiong et al. ICLR 2025 paper
Temporal Preference Optimization for Long-Form Video Understanding Li et al. arXiv 2025 (Jan) paper
StreamingRAG: Real-time Contextual Retrieval and Generation Framework Sankaradas et al. arXiv 2025 (Jan) paper
VideoAuteur: Towards Long Narrative Video Generation Xiao et al. arXiv 2025 (Jan) paper
Generative Frame Sampler for Long Video Understanding Yao et al. ACL 2024 paper
FrameFusion: Combining Similarity and Importance for Video Token Reduction on LVLMs Fu et al. arXiv 2024 (Dec) paper
Vinci: A Real-time Embodied Smart Assistant based on Egocentric VLM Huang et al. arXiv 2024 (Dec) paper
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Yi et al. arXiv 2024 (Dec) paper
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Luo et al. arXiv 2024 (Nov) paper
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos Ataallah et al. arXiv 2024 (Jul) paper
ViTA: Efficient Video-to-Text with VLM for RAG-based Video Analysis Arefeen et al. CVPRW 2024 paper
iRAG: Advancing RAG for Videos with an Incremental Approach Arefeen et al. CIKM 2024 paper

1.3 Visual Spatial Understanding

Title Authors Venue/Date Links
RAG-Guided LLMs for Visual Spatial Description with Adaptive Hallucination Corrector Yu et al. ACM MM 2024 paper

1.4 Multi-modal

This section focuses on multimodal RAG methods that retrieve and reason over documents, knowledge bases/graphs, enterprise data, and evaluation/benchmarks.


1.4.1 Doc-RAG (Document-centric)
Title Authors Venue/Date Paper Link
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation Sun et al. Arxiv 2025 (Oct) paper
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG Peng et al. Arxiv 2025 (Oct) paper
CMRAG: Co-modality-based document retrieval and visual question answering Chen et al. Arxiv 2025 (Sep) paper
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries Wu et al. arXiv 2025 (Aug) paper
Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding Mannam et al. KDDW 2025 (Jun) paper
DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers Wasserman et al. Arxiv 2025 (May) paper
A Multi-Granularity Retrieval Framework for Visually-Rich Documents Xu et al. Arxiv 2025 (May) paper
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain Zhao et al. Arxiv 2025 (May) paper
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents Tanaka et al. arXiv 2025 (Apr) paper
SuperRAG: Beyond RAG with Layout-Aware Graph Modeling Yang et al. NACCL 2025 (Mar) paper
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding Han et al. Arxiv 2025 (Mar) paper
SiQA: A Large Multi-Modal Question Answering Model for Structured Images Based on RAG Liu et al. ICASSP 2025 (Mar) paper
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework Yang et al. Arxiv 2025 (Feb) paper
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents Wang et al. Arxiv 2025 (Feb) paper
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs Caffagni et al. CVPRW 2024 paper
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Cho et al. Arxiv 2024 (Nov) paper
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Yu et al. ICLR 2025 paper

1.4.2 Knowledge-RAG (Knowledge / Knowledge Graph / External KB)
Title Authors Venue/Date Links
Knowledge Graph-Guided Retrieval-Augmented Generation Zhang et al. ACL 2025 paper
Multimodal Iterative RAG for Knowledge Visual Question Answering Choi et al. Arxiv 2025 (Sep) paper
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering Yuan et al. Arxiv 2025 (Aug) paper
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation Park et al. Arxiv 2025 (Jun) paper
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG Tian et al. ACL 2025 (Jun) paper
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework Ling et al. Arxiv 2025 (Apr) paper
CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model Jiang et al. Arxiv 2025 (Feb) paper
MuKA: Multimodal Knowledge Augmented Visual Information-Seeking Deng et al. Coling 2025 (Jan) paper
mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA Zhang et al. Arxiv 2024 (Nov) paper

1.4.3 Enterprise / Industrial
Title Authors Venue/Date Links
AUGUSTUS: An LLM-Driven Multimodal Agent System Jain et al. arXiv 2025 paper
Beyond the Textual: Generating Coherent Visual Options for MCQs Wang et al. Arxiv 2025 (Aug) paper
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering Gondhalekar et al. Arxiv 2025 (Jun) paper
Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems Zhang et al. Arxiv 2025 (Sep) paper
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model Wen et al. Arxiv 2025 (Apr) paper

1.4.4 Evaluation / Benchmark / Robustness
Title Authors Venue/Date Links
VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples Sun et al. arXiv 2025 (Sep) paper
FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation Zhang et al. arXiv 2025 (Jun) paper
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger Yang et al. arXiv 2025 (Jun) paper
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models Jia et al. arXiv 2025 (May) paper
Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey Gan et al. arXiv 2025 (Apr) paper
MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation Yu et al. arXiv 2025 (Apr) paper
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA S. M. Sarwar arXiv 2025 (Feb) paper
MRAG-Bench: Vision-centric Evaluation for Retrieval-Augmented Multimodal Models Hu et al. ICLR 2025 paper
Re-ranking the Context for Multimodal Retrieval Augmented Generation Mortaheb et al. arXiv 2025 (Jan) paper
A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications Marom arXiv 2025 (Jan) paper
Visual RAG: Expanding MLLM Visual Knowledge without Fine-tuning Bonomo et al. arXiv 2025 (Jan) paper
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models Sharifymoghaddam et al. arXiv 2024 (Oct) paper
RoRA-VLM: Robust Retrieval Augmentation for Vision Language Models Qi et al. arXiv 2024 (Oct) paper
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information Sun et al. EMNLP 2024 (Sep) paper
SearchLVLMs Li et al. NeurIPS 2024 paper
RAVEN: Multitask Retrieval Augmented Vision-Language Learning Rao et al. COLM 2024 (Jun) paper
Retrieval Meets Reasoning Tan et al. arXiv 2024 (Apr) paper
RAR Liu et al. arXiv 2024 (Mar) paper
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning Cui et al. ACL 2024 (Feb) paper
Fine-grained Late-interaction Multi-modal Retrieval for RAG-VQA Lin et al. NeurIPS 2023 paper
Retrieval-based Knowledge Augmented Vision Language Pre-training Rao et al. ACMMM 2023 (Apr) paper
ReVeaL Hu et al. CVPR 2023 (Apr) paper
Murag Chen et al. EMNLP 2022 (Oct) paper

1.5 Medical Vision

Title Authors Venue/Date Links
🔥MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering Shaaban et al. MICCAI 2025 paper
How to Make Medical AI Systems Safer? Simulating Vulnerabilities in Multimodal Medical RAG Zuo et al. arXiv 2025 (Aug) paper
AlzheimerRAG: Multimodal RAG for Clinical Use Cases using PubMed Lahiri et al. arXiv 2025 (Aug) paper
HeteroRAG: A Heterogeneous RAG Framework for Medical Vision-Language Tasks Chen et al. arXiv 2025 (Aug) paper
REALM: RAG-Driven Enhancement of Multimodal EHR Analysis Zhu et al. arXiv 2025 (Feb) paper
MMed-RAG: Versatile Multimodal RAG for Medical VLMs Xia et al. arXiv 2024 (Oct) paper
RULE: Reliable Multimodal RAG for Factuality in Medical VLMs Xia et al. EMNLP 2024 paper

2 Visual Generation

2.1 Image (Video) Generation

Title Authors Venue/Date Links
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation Zhu et al. arXiv 2025 (Sep) paper
GarmentAligner Zhang et al. ECCV 2025 paper
RealRAG Lyu et al. ICML 2025 paper
FineRAG Yuan et al. COLING 2025 paper
ImageRAG Shalev-Arkushin et al. arXiv 2025 paper
BrainRAM Xie et al. ACM MM 2024 paper
Instruct-Imagen Hu et al. CVPR 2024 paper
Retrieval-Augmented Layout Transformer Horita et al. CVPR 2024 paper
The Neglected Tails in Vision-Language Models Parashar et al. CVPR 2024 paper
FairRAG: Fair Human Generation via Fair Retrieval Augmentation Shrestha et al. CVPR 2024 paper
Grounding Language Models for Visual Entity Recognition Xiao et al. ECCV 2024 paper
RealGen Ding et al. ECCV 2024 paper
Factuality Tax of Diversity-Intervened Generation Wan et al. EMNLP 2024 paper
Prompt Expansion for Adaptive Text-to-Image Generation Datta et al. ACL 2024 paper
Label-Retrieval-Augmented Diffusion Models Chen et al. NeurIPS 2023 paper
CPR: Retrieval-Augmented Generation for Copyright Protection Golatkar et al. CVPR 2023 paper
ReMoDiffuse Zhang et al. ICCV 2023 paper
Diffusion-Based Augmentation for Captioning and Retrieval Cioni et al. ICCVW 2023 paper
Animate-A-Story He et al. arXiv 2023 paper
Retrieval-Augmented Diffusion Models Blattmann et al. NeurIPS 2022 paper
Re-imagen Chen et al. arXiv 2022 paper

2.2 3D Generation

Title Authors Venue/Date Links
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow Gupta et al. arXiv 2025 (Oct) paper
MV-RAG: Retrieval-Augmented Multiview Diffusion Dayani et al. arXiv 2025 (Aug) paper
VimoRAG: Video-based Retrieval-Augmented 3D Motion Generation Xu et al. arXiv 2025 (Aug) paper
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions Wang et al. arXiv 2024 (Sep) paper
Retrieval-Augmented Score Distillation for Text-to-3D Generation Seo et al. ICML 2024 paper
Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling Wu et al. arXiv 2024 (Nov) paper
Interaction-based Retrieval-Augmented Diffusion for Protein 3D Generation Huang et al. ICML 2024 paper
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model Zhang et al. ICCV 2023 paper

3. Embodied AI

Title Authors Venue/Date Links
VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory Wang et al. arXiv 2026 (Jan) paper
SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation Ye et al. Arxiv 2025 (Jul) paper
RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base Wang et al. IROS 2025 July paper
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving Wang et al. Arxiv 2025 (Mar) paper
RANa: Retrieval-Augmented Navigation Monaci et al. Arxiv 2025 (Apr) paper
P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task Xu et al. ACM MM 2024 paper
Realgen: Retrieval Augmented Generation for Controllable Traffic Scenarios Ding et al. ECCV 2024 paper
Retrieval-Augmented Embodied Agents Zhu et al. CVPR 2024 paper
ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception Nazar et al. Arxiv 2024 (Oct) paper
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation Xie et al. Arxiv 2024 (Oct) paper
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning Yuan et al. Arxiv 2024 (May) paper

Star History

Star History Chart

About

Awesome-RAG-Vision: a curated list of advanced retrieval augmented generation (RAG) for Computer Vision

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors