198 lines (130 loc) · 9.62 KB

Awesome Papers for Sparse Auto-Encoder (SAE)

This list focuses on sparse auto-encoder (SAE) techniques in mechanistic interpretability. Another list focuses on understanding the internal mechanism of LLMs.

Paper/preprint/blog recommendation: please release a issue or contact me.

Papers

2025

Probing the Vulnerability of Large Language Models to Polysemantic Interventions
- [arxiv] [2025.5]
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
- [arxiv] [2025.5]
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
- [arxiv] [2025.5]
On the Biology of a Large Language Model
- [Anthropic] [2025.3]
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
- [SAE survey] [2025.3]
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
- [arxiv] [2025.3]
Sparse Autoencoders Do Not Find Canonical Units of Analysis
- [ICLR 2025] [2025.2]
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
- [arxiv] [2025.2]
Scaling Sparse Feature Circuits For Studying In-Context Learning
- [openreview] [2025.1]
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- [ICML 2025] [2025.1]
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
- [arxiv] [2025.1]

2024

BatchTopK Sparse Autoencoders
- [arxiv] [2024.12]
Stage-Wise Model Diffing
- [Anthropic] [2024.12]
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
- [ICLR 2025] [2024.11]
Improving Steering Vectors by Targeting Sparse Autoencoder Features
- [arxiv] [2024.11]
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
- [arxiv] [2024.11]
Applying sparse autoencoders to unlearn knowledge in language models
- [arxiv] [2024.10]
Evaluating feature steering: A case study in mitigating social biases
- [Anthropic] [2024.10]
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
- [arxiv] [2024.10]
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
- [arxiv] [2024.10]
Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
- [ICLR 2025] [2024.10]
Sparse Crosscoders for Cross-Layer Features and Model Diffing
- [Anthropic] [2024.10]
Scaling Automatic Neuron Description
- [transluce] [2024.10]
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
- [arxiv] [2024.9]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
- [Deepmind] [2024.8]
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- [Deepmind] [2024.8]
Self-explaining SAE features
- [Lesswrong blog] [2024.8]
SAEs (usually) Transfer Between Base and Chat Models
- [AI alignment forum blog] [2024.7]
BatchTopK: A Simple Improvement for TopK-SAEs
- [Lesswrong blog] [2024.7]
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
- [ICML MI workshop] [2024.7]
Interpreting Attention Layer Outputs with Sparse Autoencoders
- [arxiv] [2024.6]
Transcoders Find Interpretable LLM Feature Circuits
- [arxiv] [2024.6]
Scaling and evaluating sparse autoencoders
- [OpenAI] [2024.6]
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
- [NeurIPS 2024] [2024.5]
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- [arxiv] [2024.5]
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
- [arxiv] [2024.5]
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- [Anthropic] [2024.5]
How to use and interpret activation patching
- [arxiv] [2024.4]
Improving Dictionary Learning with Gated Sparse Autoencoders
- [Deepmind] [2024.4]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- [arxiv] [2024.3]
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
- [ACL 2024] [2024.2]
Addressing Feature Suppression in SAEs
- [lesswrong blog] [2024.2]
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders
- [lesswrong blog] [2024.2]
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
- [arxiv] [2024.2]
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
- [alignmentforum blog] [2024.2]

2023

Steering Llama 2 via Contrastive Activation Addition
- [ACL 2024] [2023.12]
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
- [arxiv] [2023.10]
Attribution patching outperforms automated circuit discovery
- [BlackboxNLP 2024] [2023.10]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- [Anthropic] [2023.10]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
- [ICLR 2024] [2023.9]
Steering Language Models With Activation Engineering
- [arxiv] [2023.8]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- [NeurIPS 2023] [2023.6]
Language models can explain neurons in language models
- [OpenAI] [2023.5]
Distributed Representations: Composition & Superposition
- [Anthropic] [2023.5]
Privileged Bases in the Transformer Residual Stream
- [Anthropic] [2023.3]
Attribution Patching: Activation Patching At Industrial Scale
- [Neel Nanda Blog] [2023.2]

2022

Causal Scrubbing: a method for rigorously testing interpretability hypotheses
- [AI alignment forum] [2022.12]
Taking features out of superposition with sparse autoencoders
- [AI alignment forum] [2022.12]
Engineering Monosemanticity in Toy Models
- [arxiv] [2022.11]
Polysemanticity and Capacity in Neural Networks
- [arxiv] [2022.9]
Toy Models of Superposition
- [Anthropic] [2022.9]