A Survey of LLM × DATA

A collection of papers and projects related to LLMs and corresponding data-centric methods.

If you find our survey useful, please cite the paper:

@article{LLMDATASurvey,
    title={A Survey of LLM × DATA},
    author={Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu},
    year={2025},
    journal={arXiv preprint arXiv:2505.18458},
    url={https://arxiv.org/abs/2505.18458}
}

🌤 The IaaS Concept of DATA4LLM

The IaaS concept for LLM data (phonetically echoing Infrastructure as a Service) defines the characteristics of high-quality datasets along four key dimensions: (1) Inclusiveness ensures broad coverage across domains, tasks, sources, languages, styles, and modalities. (2) Abundance emphasizes sufficient and well-balanced data volume to support scaling, fine-tuning, and continual learning without overfitting. (3) Articulation requires clear, coherent, and instructive content with step-by-step reasoning to enhance model understanding and task performance. (4) Sanitization involves rigorous filtering to remove private, toxic, unethical, and misleading content, ensuring data safety, neutrality, and compliance.

Datasets

CommonCrawl: A massive web crawl dataset covering diverse languages and domains; widely used for LLM pretraining. [Source]
The Stack: A large-scale dataset of permissively licensed source code in multiple programming languages; used for code LLMs. [HuggingFace]
RedPajama: A replication of LLaMA’s training data recipe with open datasets; spans web, books, arXiv, and more. [Github]
SlimPajama-627B-DC: A deduplicated and filtered subset of RedPajama (627B tokens); optimized for clean and efficient training. [HuggingFace]
Alpaca-CoT: Instruction-following dataset enhanced with Chain-of-Thought (CoT) reasoning prompts; used for dialogue fine-tuning. [Github]
LLaVA-Pretrain: A multimodal dataset with image-text pairs for training visual language models like LLaVA. [HuggingFace]
Wikipedia: Structured and encyclopedic content; a foundational source for general-purpose language models. [HuggingFace]
C4: A cleaned version of CommonCrawl data, widely used in models like T5 for high-quality web text. [HuggingFace]
BookCorpus: Contains free fiction books; often used to teach models long-form language understanding. [HuggingFace]
Arxiv: Scientific paper corpus from arXiv, covering physics, math, CS, and more; useful for academic language modeling. [HuggingFace]
PubMed: Biomedical literature dataset from the PubMed database; key resource for medical domain models. [Source]
StackExchange: Community Q&A data covering domains like programming, math, philosophy, etc.; useful for QA and dialogue tasks. [Source]
OpenWebText2: A high-quality open-source web text dataset based on URLs commonly cited on Reddit; GPT-style training corpus. [Source]
OpenWebMath: A dataset of math questions and answers; designed to improve mathematical reasoning in LLMs. [HuggingFace]
Falcon-RefinedWeb: Filtered web data used in training Falcon models; emphasizes data quality through rigorous preprocessing. [HuggingFace]
CCI 3.0: A large-scale multi-domain Chinese web corpus, suitable for training high-quality Chinese LLMs. [HuggingFace]
OmniCorpus: A unified multimodal dataset (text, image, audio) designed for general-purpose AI training. [Github]
WanJuan3.0: A diverse and large-scale Chinese dataset including news, fiction, QA, and more; released by OpenDataLab. [Source]

0 Data Characteristics across LLM Stages

⬆️top

Data for Pretraining

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, et al. NeurIPS 2023. [Paper]
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Yukun Zhu, Ryan Kiros, Richard Zemel, et al. ICCV 2015.[Paper]

Data for Continual Pre-training

MedicalGPT: Training Medical GPT Model
Ming Xu. [Github]
BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark
Dakuan Lu, Hengkui Wu, Jiaqing Liang, et al. arXiv 2023. [Paper]

Data for Supervised Fine-Tuning (SFT)

General Instruction Following

Free dolly: Introducing the world’s first truly open instruction-tuned llm
Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin. [Source]

Specific Domain Usage

MedicalGPT: Training Medical GPT Model [Github]
DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services
Shengbin Yue, Wei Chen, Siyuan Wang, et al. arXiv 2023. [Paper]

Data for Reinforcement Learning (RL)

RLHF

MedicalGPT: Training Medical GPT Model [Github]
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Ganqu Cui, Lifan Yuan, Ning Ding, et al. ICML 2024. [Paper]

RoRL

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. arXiv 2025. [Paper]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team. arXiv 2025. [Paper]

Data for Retrieval-Augmented Generation (RAG)

DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue
Feiyuan Zhang, Dezhi Zhu, James Ming, et al. arXiv 2025. [Paper]
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Vicente Grau. arXiv 2024. [Paper]
ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization
Yunxiao Shi, Xing Zi, Zijing Shi, Haimin Zhang, Qiang Wu, Min Xu. arXiv 2024. [Paper]
PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents
Saber Zerhoudi, Michael Granitzer. arXiv 2024. [Paper]
DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services [Paper]

Data for LLM Evaluation

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, et al. CVPR 2024. [Paper]
LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models
Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, Yiqun Liu. NeurIPS 2024. [Paper]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits. AAAI 2021. [Paper]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, et al. arXiv 2021. [Paper]

Data for LLM Agents

STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li. arXiv 2025. [Paper]
Large Language Model-Based Agents for Software Engineering: A Survey
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, Yiling Lou. arXiv 2024. [Paper]
Advancing LLM Reasoning Generalists with Preference Trees
Lifan Yuan, Ganqu Cui, Hanbin Wang, et al. arXiv 2024. [Paper]
Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents
Zhengliang Shi, Shen Gao, Lingyong Yan, et al. arXiv 2024. [Paper]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, et al. EMNLP 2023. [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
assets		assets
README.md		README.md

weAIDB/awesome-data-llm

Folders and files

Latest commit

History

Repository files navigation

A Survey of LLM × DATA

🌤 The IaaS Concept of DATA4LLM

Table of Contents

Datasets

0 Data Characteristics across LLM Stages

Data for Pretraining

Data for Continual Pre-training

Data for Supervised Fine-Tuning (SFT)

General Instruction Following

Specific Domain Usage

Data for Reinforcement Learning (RL)

RLHF

RoRL

Data for Retrieval-Augmented Generation (RAG)

Data for LLM Evaluation

Data for LLM Agents

1 Data Processing for LLM

1.1 Data Acquisition

Data Sources

Public Data

Data Acquisition Methods

Website Crawling

Layout Analysis

1.2 Data Deduplication

Exact Substring Matching

Approximate Hashing-based Deduplication

Approximate Frequency-based Down-Weighting

Embedding-Based Clustering

Non-Text Data Deduplication

1.3 Data Filtering

Sample-level Filtering

(1) Statistical Evaluation

(2) Model Scoring

(3) Hybrid Methods

Content-level Filtering

1.4 Data Selection

Similarity-based Data Selection

Optimization-based Data Selection

Model-based Data Selection

1.5 Data Mixing

Heuristic Optimization

Bilevel Optimization

Distributionally Robust Optimization

Model-Based Optimization

1.6 Data Distillation and Synthesis

Knowledge Distillation

Pre-training Data Augmentation

SFT Data Augmentation

SFT Reasoning Data Augmentation

Reinforcement Learning

Retrieval-Augmentation Generation

1.7 End-to-End Data Processing Pipelines

1.7.1 Typical data processing frameworks

1.7.2 Typical data pipelines

1.7.3 Orchestration of data pipelines

2 Data Storage for LLM

2.1 Data Formats

Training Data Format

Model Data Format

2.2 Data Distribution

Distributed Storage Systems

Heterogeneous Storage Systems

2.3 Data Organization

Vector-Based Organization

Graph-Based Organization

2.4 Data Movement

Caching Data

Data/Operator Offloading

Overlapping of storage and computing

2.5 Data Fault Tolerance

Checkpoints

Redundant Computations

2.6 KV Cache

Cache Space Management

KV Placement