Predicting cellular responses to perturbation across diverse contexts with State

Train State transition models or pretrain State embedding models. See the State paper.

Associated repositories

Model evaluation framework: cell-eval
Dataloaders and preprocessing: cell-load

Installation

Installation from PyPI

This package is distributed via uv.

uv tool install arc-state

Installation from Source

git clone [email protected]:ArcInstitute/state.git
cd state
uv run state

When making fundamental changes to State, install an editable version with the -e flag.

git clone [email protected]:ArcInstitute/state.git
cd state
uv tool install -e .

CLI Usage

You can access the CLI help menu with:

state --help

Output:

usage: state [-h] {emb,tx} ...

positional arguments:
  {emb,tx}

options:
  -h, --help  show this help message and exit

State Transition Model (ST)

To start an experiment, write a TOML file (see examples/zeroshot.toml or examples/fewshot.toml to start). The TOML file specifies the dataset paths (containing h5ad files) as well as the machine learning task.

Training an ST example below.

state tx train \
data.kwargs.toml_config_path="examples/fewshot.toml" \
data.kwargs.embed_key=X_hvg \
data.kwargs.num_workers=12 \
data.kwargs.batch_col=batch_var \
data.kwargs.pert_col=target_gene \
data.kwargs.cell_type_key=cell_type \
data.kwargs.control_pert=TARGET1 \
training.max_steps=40000 \
training.val_freq=100 \
training.ckpt_every_n_steps=100 \
training.batch_size=8 \
training.lr=1e-4 \
model.kwargs.cell_set_len=64 \
model.kwargs.hidden_dim=328 \
model=pertsets \
wandb.tags="[test]" \
output_dir="$HOME/state" \
name="test"

The cell lines and perturbations specified in the TOML should match the values appearing in the data.kwargs.cell_type_key and data.kwargs.pert_col used above. To evaluate STATE on the specified task, you can use the tx predict command:

state tx predict --output_dir $HOME/state/test/ --checkpoint final.ckpt

It will look in the output_dir above, for a checkpoints folder.

If you instead want to use a trained checkpoint for inference (e.g. on data not specified) in the TOML file:

state tx infer --output $HOME/state/test/ --output_dir /path/to/model/ --checkpoint /path/to/model/final.ckpt --adata /path/to/anndata/processed.h5 --pert_col gene --embed_key X_hvg

Here, /path/to/model/ is the folder downloaded from HuggingFace.

TOML Configuration Files

State experiments are configured using TOML files that define datasets, training splits, and evaluation scenarios. The configuration system supports both zeroshot (unseen cell types) and fewshot (limited perturbation examples) evaluation paradigms.

Configuration Structure

Required Sections

[datasets] - Maps dataset names to their file system paths

[datasets]
replogle = "/path/to/replogle/dataset/"
# YOU CAN ADD MORE

[training] - Specifies which datasets participate in training

[training]
replogle = "train"  # Include all replogle data in training (unless overridden below)

Optional Evaluation Sections

[zeroshot] - Reserves entire cell types for validation/testing

[zeroshot]
"replogle.jurkat" = "test"     # All jurkat cells go to test set
"replogle.k562" = "val"        # All k562 cells go to validation set

[fewshot] - Specifies perturbation-level splits within cell types

[fewshot]
[fewshot."replogle.rpe1"]      # Configure splits for rpe1 cell type
val = ["AARS", "TUFM"]         # These perturbations go to validation
test = ["NUP107", "RPUSD4"]    # These perturbations go to test
# Note: All other perturbations in rpe1 automatically go to training

Configuration Examples

Example 1: Pure Zeroshot Evaluation

# Evaluate generalization to completely unseen cell types
[datasets]
replogle = "/data/replogle/"

[training]
replogle = "train"

[zeroshot]
"replogle.jurkat" = "test"     # Hold out entire jurkat cell line
"replogle.rpe1" = "val"        # Hold out entire rpe1 cell line

[fewshot]
# Empty - no perturbation-level splits

Example 2: Pure Fewshot Evaluation

# Evaluate with limited examples of specific perturbations
[datasets]
replogle = "/data/replogle/"

[training]
replogle = "train"

[zeroshot]
# Empty - all cell types participate in training

[fewshot]
[fewshot."replogle.k562"]
val = ["AARS"]                 # Limited AARS examples for validation
test = ["NUP107", "RPUSD4"]    # Limited examples of these genes for testing

[fewshot."replogle.jurkat"]
val = ["TUFM"]
test = ["MYC", "TP53"]

Example 3: Mixed Evaluation Strategy

# Combine both zeroshot and fewshot evaluation
[datasets]
replogle = "/data/replogle/"

[training]
replogle = "train"

[zeroshot]
"replogle.jurkat" = "test"        # Zeroshot: unseen cell type

[fewshot]
[fewshot."replogle.k562"]      # Fewshot: limited perturbation examples
val = ["STAT1"]
test = ["MYC", "TP53"]

Important Notes

Automatic training assignment: Any cell type not mentioned in [zeroshot] automatically participates in training, with perturbations not listed in [fewshot] going to the training set
Overlapping splits: Perturbations can appear in both validation and test sets within fewshot configurations
Dataset naming: Use the format "dataset_name.cell_type" when specifying cell types in zeroshot and fewshot sections
Path requirements: Dataset paths should point to directories containing h5ad files
Control perturbations: Ensure your control condition (specified via control_pert parameter) is available across all splits

Validation

The configuration system will validate that:

All referenced datasets exist at the specified paths
Cell types mentioned in zeroshot/fewshot sections exist in the datasets
Perturbations listed in fewshot sections are present in the corresponding cell types
No conflicts exist between zeroshot and fewshot assignments for the same cell type

State Embedding Model (SE)

After following the same installation commands above:

state emb fit --conf ${CONFIG}

To run inference with a trained State checkpoint, e.g., the State trained to 4 epochs:

state emb transform \
  --checkpoint "/large_storage/ctc/userspace/aadduri/SE-600M" \
  --input "/large_storage/ctc/datasets/replogle/rpe1_raw_singlecell_01.h5ad" \
  --output "/home/aadduri/vci_pretrain/test_output.h5ad" \

Licenses

State code is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).

The model weights and output are licensed under the Arc Research Institute State Model Non-Commercial License and subject to the Arc Research Institute State Model Acceptable Use Policy.

Any publication that uses this source code or model parameters should cite the State paper.

Name		Name	Last commit message	Last commit date
Latest commit History 648 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
scripts		scripts
src/state		src/state
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
MODEL_ACCEPTABLE_USE_POLICY.md		MODEL_ACCEPTABLE_USE_POLICY.md
MODEL_LICENSE.md		MODEL_LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting cellular responses to perturbation across diverse contexts with State

Associated repositories

Installation

Installation from PyPI

Installation from Source

CLI Usage

State Transition Model (ST)

TOML Configuration Files

Configuration Structure

Required Sections

Optional Evaluation Sections

Configuration Examples

Example 1: Pure Zeroshot Evaluation

Example 2: Pure Fewshot Evaluation

Example 3: Mixed Evaluation Strategy

Important Notes

Validation

State Embedding Model (SE)

Licenses

About

Uh oh!

Releases 8

Packages

Contributors 10

Uh oh!

Languages

License

ArcInstitute/state

Folders and files

Latest commit

History

Repository files navigation

Predicting cellular responses to perturbation across diverse contexts with State

Associated repositories

Installation

Installation from PyPI

Installation from Source

CLI Usage

State Transition Model (ST)

TOML Configuration Files

Configuration Structure

Required Sections

Optional Evaluation Sections

Configuration Examples

Example 1: Pure Zeroshot Evaluation

Example 2: Pure Fewshot Evaluation

Example 3: Mixed Evaluation Strategy

Important Notes

Validation

State Embedding Model (SE)

Licenses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 10

Uh oh!

Languages

Packages