CLM-LER is a package designed to preprocess, train, and fine-tune transformer models on huggingface for tasks involving electronic health records (EHR) and laboratory data. It supports workflows for data preprocessing, model training, and evaluation, leveraging tools like PySpark, Hugging Face Transformers, and WandB for efficient and scalable operations.
- Overview
- Installation
- Pretraining Pipeline Description
- Testing with EHRSHOT Benchmarks
- UMLS for Mapping
- Acknowledgements
CLM-LER provides a modular framework for working with EHR data. It includes utilities for:
- Preprocessing raw EHR data into tokenized formats.
- Training CLM models on large-scale EHR datasets.
- Fine-tuning models for specific downstream tasks like classification.
- Handling unit conversions, percentile calculations, and UMLS-based translations.
- Python 3.10
- PySpark
- Hugging Face Transformers
- WandB (Weights and Biases)
Using a virtual environment prevents conflicting installations of packages. You can create one with the following:
conda create -n train-clm python=3.10 -y
conda activate train-clm
Specify the version of torch
to install and the index for downloading it. Torch 2.0.1 (compiled for CUDA 117) was found to work well for this project. If you're using another version of CUDA, adjust the torch version accordingly.
Good for development work and training models interactively.
<install-torch> # e.g. pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117/
pip install -e .[dev,train,spark]
For all dependencies:
<install-torch> # e.g. pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117/
pip install -e .[full]
dev -- dependencies to run unit tests and develop in this package. train -- dependencies to train a model. spark -- what you need for any data processing, i.e. a pyspark installation.
N.B. None of the packages above depend on torch explicitly, but torch is required to be installed.
We used Amazon S3 for handling much of our I/O. We open source this work showcasing how AWS access keys with permissions to a s3 bucket may work. We encourage anyone following along to perform a similar setup to reduce friction in getting started!
Given how many clinical language models we were training to get toward our CLM-LER architecture, we used W&B for support and logging.
We have online and offline options, but please raise an issue if you have some trouble!
To set this up:
wandb login
We used spark clusters on our side to handle the tens of millions patients in EMRs we work with. If you have AWS EMR, this could work for you! We suggest the use of emrflow.
We removed this from the repo to increase adoption/minimize dependencies in the case you use another distributed compute tool (e.g. Databricks clusters).
export CLM_AWS_ACCESS_KEY_ID=XYZ
export CLM_AWS_SECRET_ACCESS_KEY=XYZ
export CLM_AWS_DEFAULT_REGION=<region> # e.g. eu-west-1
export WANDB_API_KEY=XYZ
export WANDB_USERNAME=XYZ
export WANDB_ENTITY=XYZ
export CLMENCODER_DEPS=train,spark
There are four key stages handled by this package: data processing for model training, model pre-training, model fine-tuning, and testing with EHRSHOT benchmarks. Note: Explainability has been split into a separate repository here.
The input datasets are defined in configuration files, such as clm_ler/config/data_files_full.yaml
.
The first step to building up the clinical language model, is to split our clinical data into train/val/test. The following script showcases how you may trigger a similar process:
bash scripts/create_global_data_split.sh
The data must be arranged in the CLM-LER data model. For an example of how to preprocess the data, refer to the script scripts/preprocess_data.py
. This script demonstrates the steps to preprocess patient, diagnosis, prescription, procedure, and lab data into tokenized formats.
Once the data is preprocessed, follow these steps to pre-train a CLM model:
Generate a vocabulary file for the model. See scripts/preprocess_data.py
for an example!
Train the CLM model using the preprocessed data and generated vocabulary.
Training a full CLM model typically takes about a week on an NVIDIA A10G GPU (e.g., g5.xlarge EC2 instance) for an EHR dataset with over 40M U.S patients.
See scripts/preprocess_data.py
for usage example.
Fine-tune the pre-trained model for specific tasks, such as classification.
Ensure the dataset includes a label column for the target classification task.
See scripts/run_asthma_with_labs.sh
or run_all_ehrshot_training.sh
scripts for examples on the fine-tuning call.
EHRSHOT is a benchmarking dataset for evaluating model performance on various EHR-related tasks. Learn more: EHRSHOT. It provides multiple tasks for which a model can be finetuned. Using this dataset involves a few steps.
Firstly, if you are using a pre-trained model, you will want to map the EHRShot dataset's tokens to those expected by your model's vocabulary. This is handled by the script src/clm_ler/data_processing/process_ehrshot_data.py. This script takes a model and the raw data. Given a config file like src/clm_ler/config/mapping_config_to_clm_ler.yaml, the data in ehrshot and the models vocabulary is normalized into the names expected by UMLS. When running the script, you will be notified of any failed sources of codes that couldn't be sourced to UMLS. For example, this could be because you did not map ICD9 -> ICD9CM (The name of this source in UMLS.).
Secondly, this is a timeseries dataset with labelled events stored separately. Once you have created the dataset above, you need to join the labels into it, creating the dataset needed for inference and training. A config example is suppllied: src/clm_ler/config/config_add_labels_translated_data.yaml.
To see an example of processing data for CLM-LER, see scripts/run_clmler_ehrshot_preprocess.py
.
The Unified Medical Language System (UMLS) is used for mapping medical codes to standardized concepts. This ensures consistency across datasets and models.
- Mapping Configurations: See
clm_ler/config/mapping_config_to_clm_ler.yaml
for examples of UMLS mappings. - Translation Utilities: The
clm_ler.data_processing.data_processing_utils
module provides functions for deriving UMLS translations.
For more details, refer to the UMLS documentation.
This project leverages several key resources and contributions:
-
UMLS (Unified Medical Language System)
- The UMLS Metathesaurus is used for mapping medical codes to standardized concepts.
- Learn more: UMLS
- Ref: UMLS Knowledge Sources [dataset on the Internet]. Release 2024AA. Bethesda (MD): National Library of Medicine (US); 2024 May 6 [cited 2024 Jul 15]. Available from: http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
-
EHRSHOT
- The EHRSHOT benchmark datasets are used for evaluating model performance on various EHR-related tasks.
- Learn more: EHRSHOT
- Ref: Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason A. Fries, and Nigam H. Shah. 2023. EHRSHOT: an EHR benchmark for few-shot evaluation of foundation models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 2933, 67125–67137.
-
Inventors
This project was developed by:- Lukas Adamek
- Jenny Du
- Maksim Kriukov
- Towsif Rahman
- Utkarsh Vashisth
- Brandon Rufino
Special thanks to the inventors for their contributions to the development of CLM-LER.