Train State transition models or pretrain State embedding models. See the State paper.
This package is distributed via uv
.
uv tool install arc-state
git clone [email protected]:ArcInstitute/state.git
cd state
uv run state
When making fundamental changes to State, install an editable version with the -e
flag.
git clone [email protected]:ArcInstitute/state.git
cd state
uv tool install -e .
You can access the CLI help menu with:
state --help
Output:
usage: state [-h] {emb,tx} ...
positional arguments:
{emb,tx}
options:
-h, --help show this help message and exit
To start an experiment, write a TOML file (see examples/zeroshot.toml
or
examples/fewshot.toml
to start). The TOML file specifies the dataset paths
(containing h5ad files) as well as the machine learning task.
Training an ST example below.
state tx train \
data.kwargs.toml_config_path="examples/fewshot.toml" \
data.kwargs.embed_key=X_hvg \
data.kwargs.num_workers=12 \
data.kwargs.batch_col=batch_var \
data.kwargs.pert_col=target_gene \
data.kwargs.cell_type_key=cell_type \
data.kwargs.control_pert=TARGET1 \
training.max_steps=40000 \
training.val_freq=100 \
training.ckpt_every_n_steps=100 \
training.batch_size=8 \
training.lr=1e-4 \
model.kwargs.cell_set_len=64 \
model.kwargs.hidden_dim=328 \
model=pertsets \
wandb.tags="[test]" \
output_dir="$HOME/state" \
name="test"
The cell lines and perturbations specified in the TOML should match the values appearing in the
data.kwargs.cell_type_key
and data.kwargs.pert_col
used above. To evaluate STATE on the specified task,
you can use the tx predict
command:
state tx predict --output_dir $HOME/state/test/ --checkpoint final.ckpt
It will look in the output_dir
above, for a checkpoints
folder.
If you instead want to use a trained checkpoint for inference (e.g. on data not specified) in the TOML file:
state tx infer --output $HOME/state/test/ --output_dir /path/to/model/ --checkpoint /path/to/model/final.ckpt --adata /path/to/anndata/processed.h5 --pert_col gene --embed_key X_hvg
Here, /path/to/model/
is the folder downloaded from HuggingFace.
State experiments are configured using TOML files that define datasets, training splits, and evaluation scenarios. The configuration system supports both zeroshot (unseen cell types) and fewshot (limited perturbation examples) evaluation paradigms.
[datasets]
- Maps dataset names to their file system paths
[datasets]
replogle = "/path/to/replogle/dataset/"
# YOU CAN ADD MORE
[training]
- Specifies which datasets participate in training
[training]
replogle = "train" # Include all replogle data in training (unless overridden below)
[zeroshot]
- Reserves entire cell types for validation/testing
[zeroshot]
"replogle.jurkat" = "test" # All jurkat cells go to test set
"replogle.k562" = "val" # All k562 cells go to validation set
[fewshot]
- Specifies perturbation-level splits within cell types
[fewshot]
[fewshot."replogle.rpe1"] # Configure splits for rpe1 cell type
val = ["AARS", "TUFM"] # These perturbations go to validation
test = ["NUP107", "RPUSD4"] # These perturbations go to test
# Note: All other perturbations in rpe1 automatically go to training
# Evaluate generalization to completely unseen cell types
[datasets]
replogle = "/data/replogle/"
[training]
replogle = "train"
[zeroshot]
"replogle.jurkat" = "test" # Hold out entire jurkat cell line
"replogle.rpe1" = "val" # Hold out entire rpe1 cell line
[fewshot]
# Empty - no perturbation-level splits
# Evaluate with limited examples of specific perturbations
[datasets]
replogle = "/data/replogle/"
[training]
replogle = "train"
[zeroshot]
# Empty - all cell types participate in training
[fewshot]
[fewshot."replogle.k562"]
val = ["AARS"] # Limited AARS examples for validation
test = ["NUP107", "RPUSD4"] # Limited examples of these genes for testing
[fewshot."replogle.jurkat"]
val = ["TUFM"]
test = ["MYC", "TP53"]
# Combine both zeroshot and fewshot evaluation
[datasets]
replogle = "/data/replogle/"
[training]
replogle = "train"
[zeroshot]
"replogle.jurkat" = "test" # Zeroshot: unseen cell type
[fewshot]
[fewshot."replogle.k562"] # Fewshot: limited perturbation examples
val = ["STAT1"]
test = ["MYC", "TP53"]
- Automatic training assignment: Any cell type not mentioned in
[zeroshot]
automatically participates in training, with perturbations not listed in[fewshot]
going to the training set - Overlapping splits: Perturbations can appear in both validation and test sets within fewshot configurations
- Dataset naming: Use the format
"dataset_name.cell_type"
when specifying cell types in zeroshot and fewshot sections - Path requirements: Dataset paths should point to directories containing h5ad files
- Control perturbations: Ensure your control condition (specified via
control_pert
parameter) is available across all splits
The configuration system will validate that:
- All referenced datasets exist at the specified paths
- Cell types mentioned in zeroshot/fewshot sections exist in the datasets
- Perturbations listed in fewshot sections are present in the corresponding cell types
- No conflicts exist between zeroshot and fewshot assignments for the same cell type
After following the same installation commands above:
state emb fit --conf ${CONFIG}
To run inference with a trained State checkpoint, e.g., the State trained to 4 epochs:
state emb transform \
--checkpoint "/large_storage/ctc/userspace/aadduri/SE-600M" \
--input "/large_storage/ctc/datasets/replogle/rpe1_raw_singlecell_01.h5ad" \
--output "/home/aadduri/vci_pretrain/test_output.h5ad" \
State code is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).
The model weights and output are licensed under the Arc Research Institute State Model Non-Commercial License and subject to the Arc Research Institute State Model Acceptable Use Policy.
Any publication that uses this source code or model parameters should cite the State paper.