llava-interpret

To main bulk of work is reproducible within the llava_sae notebook. The notebook

Loads a hooked version of LLaVA so that we can explore and intervene on model activations in the resiudal stream
Loads a pretrained SAE for Gemma-2B using SAELens
We run the model on imagenet images and the prompt "describe the image."
From there, we can use the pretrained SAE to analyze the activations
We also provide a function to choose an interpretable feature and subtract that from the residual stream in the original LLaVA model. We include some examples in the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
wandb		wandb
LLaVA.ipynb		LLaVA.ipynb
README.md		README.md
activation-viz-python.py		activation-viz-python.py
activations		activations
cat_image.png		cat_image.png
classes.csv		classes.csv
eda.ipynb		eda.ipynb
gemma-2b_12-res-jb.json		gemma-2b_12-res-jb.json
gemma_lens.py		gemma_lens.py
get_cat_activations.py		get_cat_activations.py
hidden_states.h5		hidden_states.h5
image.png		image.png
llava-transformer_lens.ipynb		llava-transformer_lens.ipynb
llava_activations.ipynb		llava_activations.ipynb
llava_sae.ipynb		llava_sae.ipynb
llava_sae.py		llava_sae.py
load_test.ipynb		load_test.ipynb
sae_training.py		sae_training.py

Provide feedback