Assertion Extraction from Biomedical Abstracts

This repository contains all code, configuration, and modules for an experimental pipeline to extract scientific assertions (subject–predicate–object triples) from biomedical literature. The core focus is on evaluating and comparing different LLMs for multi-stage information extraction.

Project Overview

The pipeline is divided into three main stages:

Finding Detection – Identify which sentences in an abstract contain scientific findings.
Completion and Resolution – Convert finding sentences into standalone factual statements.
Assertion Extraction – Convert standalone statements into structured triples: (subject, predicate, object), with optional conditions.

Directory Structure

Folder / File	Description
`01_sentence_classification/`	Sentence-level classification experiments for finding detection.
`01.1_binary_classification/`	Old binary classifier experiments (e.g. SciBERT vs Logistic) – deprecated.
`02_extract_results_CITATIONS/`	Tools to extract and prepare PubMed citation abstracts (CSV files, PMIDs).
`02.1_exmine_results_tag_API/`	Evaluation of OpenAI API tagging outputs (e.g., result/finding matching).
`03_extract_article_summary/`	Early experiments in summarization / simplification of article content.
`04_classification_model/`	Transformer-based sentence classification models (e.g., BioBERT). – deprecated
`05_three_single_pipeline/`	Main pipeline experiment. Includes finding → completion → assertion.
`.gitignore`	Specifies excluded file types (e.g. `.npy`, `.pt`).
`environment.yml`	Conda environment file with all required dependencies.

Current Main Experiment: `05_three_single_pipeline/`

This folder includes the three sequential LLM pipelines:

finding_pipeline.py – Run GPT/Claude/LLaMA to identify finding sentences.
completion_pipeline.py – Expand sentences into standalone factual units.
assertion_pipeline.py – Extract subject–predicate–object triples.

Additional utilities:

schemas.py, json_utils.py, prompts.py – shared dataclass schemas, JSON parsing, and prompt templates.
config.py – API keys and model configuration (use environment variables in production).
eval/ – evaluation outputs and summary statistics.

Environment Setup

conda env create -f environment.yml
conda activate hf-hpc

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
01.1_binary_classification		01.1_binary_classification
01.2_sentence_classification		01.2_sentence_classification
02.1_extract_results_CITATIONS		02.1_extract_results_CITATIONS
02.2_exmine_results_tag_API		02.2_exmine_results_tag_API
03_extract_article_summary		03_extract_article_summary
04_classification_model		04_classification_model
05_three_single_pipeline		05_three_single_pipeline
06_data_preprocessing		06_data_preprocessing
.conda-env		.conda-env
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assertion Extraction from Biomedical Abstracts

Project Overview

Directory Structure

Current Main Experiment: `05_three_single_pipeline/`

Environment Setup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MRCIEU/assertion_extraction

Folders and files

Latest commit

History

Repository files navigation

Assertion Extraction from Biomedical Abstracts

Project Overview

Directory Structure

Current Main Experiment: 05_three_single_pipeline/

Environment Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Current Main Experiment: `05_three_single_pipeline/`

Packages