This list focuses on understanding the internal mechanism of large language models (LLM). Works in this list are accepted by top conferences (e.g. ICML, NeurIPS, ICLR, ACL, EMNLP, NAACL), or written by top research institutions.
Other paper lists focuses on SAE and neuron.
Paper recommendation (accepted by conferences): please contact me.
-
- [EMNLP 2025] [2025.8] [multimodal] [model merging]
-
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models
- [EMNLP 2025] [2025.8] [reasoning]
-
- [ACL 2025] [2025.5] [neuron]
-
Model Unlearning via Sparse Autoencoder Subspace Guided Projections
- [ICML 2025 workshop] [2025.5] [SAE]
-
- [ACL 2025] [2025.4] [multilinguality]
-
On the Biology of a Large Language Model
- [Anthropic] [2025.3]
-
Taming Knowledge Conflicts in Language Models
- [ICML 2025] [2025.3] [knowledge] [hallucination] [superposition]
-
Circuit Tracing: Revealing Computational Graphs in Language Models
- [Anthropic] [2025.3]
-
The Mirage of Model Editing: Revisiting Evaluation in the Wild
- [ACL 2025] [2025.2] [model editing]
-
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
- [ICML 2025] [2025.2] [circuit]
-
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- [ICML 2025] [2025.1] [SAE]
-
- [ACL 2025] [2024.12] [safety]
-
Disentangling Memory and Reasoning Ability in Large Language Models
- [ACL 2025] [2024.11] [reasoning]
-
Can Knowledge Editing Really Correct Hallucinations?
- [ICLR 2025] [2024.10] [knowledge] [model editing]
-
Arithmetic without algorithms: Language models solve math with a bag of heuristics
- [ICLR 2025] [2024.10] [arithmetic]
-
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
- [EMNLP 2024] [2024.9] [neuron] [arithmetic] [fine-tune]
-
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
- [ICLR 2025] [2024.7]
-
Scaling and evaluating sparse autoencoders
- [OpenAI] [2024.6] [SAE]
-
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning
- [ACL 2025] [2024.6] [model editing]
-
- [EMNLP 2024] [2024.6] [in-context learning]
-
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
- [EMNLP 2024] [2024.6] [knowledge] [reasoning]
-
Neuron-Level Knowledge Attribution in Large Language Models
- [EMNLP 2024] [2024.6] [neuron] [knowledge]
-
Knowledge Circuits in Pretrained Transformers
- [NeurIPS 2024] [2024.5] [circuit] [knowledge]
-
Not All Language Model Features Are One-Dimensionally Linear
- [ICLR 2025] [2024.5] [SAE]
-
Locating and Editing Factual Associations in Mamba
- [COLM 2024] [2024.4] [causal] [knowledge]
-
Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph
- [COLM 2024] [2024.4] [activation patching]
-
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
- [COLM 2024] [2024.3] [circuit]
-
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
- [ACL 2024] [2024.3] [logit lens] [multimodal]
-
Chain-of-Thought Reasoning Without Prompting
- [Deepmind] [2024.2] [chain-of-thought]
-
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
- [EMNLP 2024] [2024.2] [logit lens]
-
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
- [ICLR 2024] [2024.2] [fine-tune]
-
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
- [ACL 2024] [2024.2] [hallucination]
-
Understanding and Patching Compositional Reasoning in LLMs
- [ACL 2024] [2024.2] [reasoning]
-
Do Large Language Models Latently Perform Multi-Hop Reasoning?
- [ACL 2024] [2024.2] [knowledge] [reasoning]
-
Long-form evaluation of model editing
- [NAACL 2024] [2024.2] [model editing]
-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- [ICML 2024] [2024.1] [toxicity] [fine-tune]
-
The Impact of Reasoning Step Length on Large Language Models
- [ACL 2024] [2024.1] [reasoning]
-
What does the Knowledge Neuron Thesis Have to do with Knowledge?
- [ICLR 2024] [2023.11] [knowledge] [neuron]
-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- [ICLR 2024] [2023.11] [fine-tune]
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- [Anthropic] [2023.10] [SAE]
-
Interpreting CLIP's Image Representation via Text-Based Decomposition
- [ICLR 2024] [2023.10] [multimodal]
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
- [ICLR 2024] [2023.10] [causal] [circuit]
-
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
- [Deepmind] [2023.12] [neuron]
-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
- [ICLR 2024] [2023.12] [circuit]
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- [Anthropic] [2023.10] [SAE]
-
Impact of Co-occurrence on Factual Knowledge of Large Language Models
- [EMNLP 2023] [2023.10] [knowledge]
-
Function vectors in large language models
- [ICLR 2024] [2023.10] [in-context learning]
-
Neurons in Large Language Models: Dead, N-gram, Positional
- [ACL 2024] [2023.9] [neuron]
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
- [ICLR 2024] [2023.9] [SAE]
-
Can LLM-Generated Misinformation Be Detected?
- [ICLR 2024] [2023.9] [misinformation]
-
Do Machine Learning Models Memorize or Generalize?
- [2023.8] [grokking]
-
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
- [TACL 2024] [2023.7] [circuit]
-
Evaluating the ripple effects of knowledge editing in language models
- [2023.7] [knowledge] [model editing]
-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- [NeurIPS 2023] [2023.6] [hallucination]
-
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
- [EMNLP 2023] [2023.5] [logit lens]
-
Finding Neurons in a Haystack: Case Studies with Sparse Probing
- [TMLR 2024] [2023.5] [neuron]
-
Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
- [EMNLP 2023] [2023.5] [in-context learning]
-
- [ICLR 2024] [2023.5] [chain-of-thought]
-
What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning
- [ACL 2023] [2023.5] [in-context learning]
-
Language models can explain neurons in language models
- [OpenAI] [2023.5] [neuron]
-
- [EMNLP 2023] [2023.5] [causal] [arithmetic]
-
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
- [EMNLP 2023] [2023.4] [causal] [knowledge]
-
The Internal State of an LLM Knows When It's Lying
- [EMNLP 2023] [2023.4] [hallucination]
-
Are Emergent Abilities of Large Language Models a Mirage?
- [NeurIPS 2023] [2023.4] [grokking]
-
Towards automated circuit discovery for mechanistic interpretability
- [NeurIPS 2023] [2023.4] [circuit]
-
- [NeurIPS 2023] [2023.4] [circuit] [arithmetic]
-
Larger language models do in-context learning differently
- [Google Research] [2023.3] [in-context learning]
-
- [NeurIPs 2023] [2023.1] [knowledge] [model editing]
-
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
- [ACL 2023] [2022.12] [chain-of-thought]
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- [ICLR 2023] [2022.11] [arithmetic] [circuit]
-
Inverse scaling can become U-shaped
- [EMNLP 2023] [2022.11] [grokking]
-
Mass-Editing Memory in a Transformer
- [ICLR 2023] [2022.10] [model editing]
-
Polysemanticity and Capacity in Neural Networks
- [2022.10] [neuron] [SAE]
-
Analyzing Transformers in Embedding Space
- [ACL 2023] [2022.9] [logit lens]
-
- [Anthropic] [2022.9] [neuron] [SAE]
-
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
- [Google Research] [2022.9] [chain-of-thought]
-
Emergent Abilities of Large Language Models
- [Google Research] [2022.6] [grokking]
-
Towards Tracing Factual Knowledge in Language Models Back to the Training Data
- [EMNLP 2022] [2022.5] [knowledge] [data]
-
Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations
- [EMNLP 2022] [2022.5] [in-context learning]
-
Large Language Models are Zero-Shot Reasoners
- [NeurIPS 2022] [2022.5] [chain-of-thought]
-
Scaling Laws and Interpretability of Learning from Repeated Data
- [Anthropic] [2022.5] [grokking] [data]
-
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
- [EMNLP 2022] [2022.3] [neuron] [logit lens]
-
In-context Learning and Induction Heads
- [Anthropic] [2022.3] [circuit] [in-context learning]
-
Locating and Editing Factual Associations in GPT
- [NeurIPS 2022] [2022.2] [causal] [knowledge]
-
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- [EMNLP 2022] [2022.2] [in-context learning]
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- [OpenAI & Google] [2022.1] [grokking]
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- [NeurIPS 2022] [2022.1] [chain-of-thought]
-
A Mathematical Framework for Transformer Circuits
- [Anthropic] [2021.12] [circuit]
-
Towards a Unified View of Parameter-Efficient Transfer Learning
- [ICLR 2022] [2021.10] [fine-tune]
-
Deduplicating Training Data Makes Language Models Better
- [ACL 2022] [2021.7] [fine-tune] [data]
-
- [EMNLP 2021] [2021.7]
-
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
- [ACL 2022] [2021.4] [in-context learning]
-
Calibrate Before Use: Improving Few-Shot Performance of Language Models
- [ICML 2021] [2021.2] [in-context learning]
-
Transformer Feed-Forward Layers Are Key-Value Memories
- [EMNLP 2021] [2020.12] [neuron]
-
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
- [2025.3] [SAE]
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
- [2025.3] [LLM reasoning] [long COT]
-
Mechanistic Interpretability for AI Safety A Review
- [2024.8] [safety]
-
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- [2024.7] [interpretability]
-
Internal Consistency and Self-Feedback in Large Language Models: A Survey
- [2024.7]
-
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
- [2024.7] [knowledge]
-
A Primer on the Inner Workings of Transformer-based Language Models
- [2024.5] [interpretability]
-
Usable XAI: 10 strategies towards exploiting explainability in the LLM era
- [2024.3] [interpretability]
-
A Comprehensive Overview of Large Language Models
- [2023.12] [LLM]
-
- [2023.11] [hallucination]
-
A Survey of Large Language Models
- [2023.11] [LLM]
-
Explainability for Large Language Models: A Survey
- [2023.11] [interpretability]
-
A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future
- [2023.10] [chain of thought]
-
Instruction tuning for large language models: A survey
- [2023.10] [instruction tuning]
-
- [2023.9] [instruction tuning]
-
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models
- [2023.9] [hallucination]
-
Reasoning with language model prompting: A survey
- [2023.9] [reasoning]
-
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
- [2023.8] [interpretability]
-
A Survey on In-context Learning
- [2023.6] [in-context learning]
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
- [2023.3] [parameter-efficient fine-tuning]
-
https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models (interpretability)
-
https://github.com/cooperleong00/Awesome-LLM-Interpretability?tab=readme-ov-file (interpretability)
-
https://github.com/JShollaj/awesome-llm-interpretability (interpretability)
-
https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (attention)
-
https://github.com/zjunlp/KnowledgeEditingPapers (model editing)
-
https://github.com/Hannibal046/Awesome-LLM (LLM)
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP