This list focuses on sparse auto-encoder (SAE) techniques in mechanistic interpretability. Another list focuses on understanding the internal mechanism of LLMs.
Paper/preprint/blog recommendation: please release a issue or contact me.
-
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
- [arxiv] [2025.5]
-
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
- [arxiv] [2025.5]
-
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
- [arxiv] [2025.5]
-
On the Biology of a Large Language Model
- [Anthropic] [2025.3]
-
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
- [SAE survey] [2025.3]
-
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
- [arxiv] [2025.3]
-
Sparse Autoencoders Do Not Find Canonical Units of Analysis
- [ICLR 2025] [2025.2]
-
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
- [arxiv] [2025.2]
-
Scaling Sparse Feature Circuits For Studying In-Context Learning
- [openreview] [2025.1]
-
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- [ICML 2025] [2025.1]
-
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
- [arxiv] [2025.1]
-
- [arxiv] [2024.12]
-
- [Anthropic] [2024.12]
-
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
- [ICLR 2025] [2024.11]
-
Improving Steering Vectors by Targeting Sparse Autoencoder Features
- [arxiv] [2024.11]
-
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
- [arxiv] [2024.11]
-
Applying sparse autoencoders to unlearn knowledge in language models
- [arxiv] [2024.10]
-
Evaluating feature steering: A case study in mitigating social biases
- [Anthropic] [2024.10]
-
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
- [arxiv] [2024.10]
-
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
- [arxiv] [2024.10]
-
Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
- [ICLR 2025] [2024.10]
-
Sparse Crosscoders for Cross-Layer Features and Model Diffing
- [Anthropic] [2024.10]
-
Scaling Automatic Neuron Description
- [transluce] [2024.10]
-
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
- [arxiv] [2024.9]
-
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
- [Deepmind] [2024.8]
-
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- [Deepmind] [2024.8]
-
- [Lesswrong blog] [2024.8]
-
SAEs (usually) Transfer Between Base and Chat Models
- [AI alignment forum blog] [2024.7]
-
BatchTopK: A Simple Improvement for TopK-SAEs
- [Lesswrong blog] [2024.7]
-
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
- [ICML MI workshop] [2024.7]
-
Interpreting Attention Layer Outputs with Sparse Autoencoders
- [arxiv] [2024.6]
-
Transcoders Find Interpretable LLM Feature Circuits
- [arxiv] [2024.6]
-
Scaling and evaluating sparse autoencoders
- [OpenAI] [2024.6]
-
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
- [NeurIPS 2024] [2024.5]
-
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- [arxiv] [2024.5]
-
- [arxiv] [2024.5]
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- [Anthropic] [2024.5]
-
How to use and interpret activation patching
- [arxiv] [2024.4]
-
Improving Dictionary Learning with Gated Sparse Autoencoders
- [Deepmind] [2024.4]
-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- [arxiv] [2024.3]
-
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
- [ACL 2024] [2024.2]
-
Addressing Feature Suppression in SAEs
- [lesswrong blog] [2024.2]
-
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders
- [lesswrong blog] [2024.2]
-
- [arxiv] [2024.2]
-
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
- [alignmentforum blog] [2024.2]
-
Steering Llama 2 via Contrastive Activation Addition
- [ACL 2024] [2023.12]
-
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
- [arxiv] [2023.10]
-
Attribution patching outperforms automated circuit discovery
- [BlackboxNLP 2024] [2023.10]
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- [Anthropic] [2023.10]
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
- [ICLR 2024] [2023.9]
-
Steering Language Models With Activation Engineering
- [arxiv] [2023.8]
-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- [NeurIPS 2023] [2023.6]
-
Language models can explain neurons in language models
- [OpenAI] [2023.5]
-
Distributed Representations: Composition & Superposition
- [Anthropic] [2023.5]
-
Privileged Bases in the Transformer Residual Stream
- [Anthropic] [2023.3]
-
Attribution Patching: Activation Patching At Industrial Scale
- [Neel Nanda Blog] [2023.2]
-
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
- [AI alignment forum] [2022.12]
-
Taking features out of superposition with sparse autoencoders
- [AI alignment forum] [2022.12]
-
Engineering Monosemanticity in Toy Models
- [arxiv] [2022.11]
-
Polysemanticity and Capacity in Neural Networks
- [arxiv] [2022.9]
-
- [Anthropic] [2022.9]