Skip to content

Ineichen-Group/StudyTypeTeller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

211 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PreclinicalAbstractClassification

Objectives: develop methods to distinguish different types of pre-clinical literature based on abstract text.

1. Set up the environment

Poetry

The project is build using poetry for dependency management. Instructions on how to install poetry can be found in the documentation.
To install the defined dependencies for the project, make sure you have the .toml and .lock files and run the install command.

poetry install

The pyproject.toml file contains all relevant packages that need to be installed for running the project. The poetry.lock file is needed to ensure the same version of the installed libraries.

Conda

For the GPT Jupyter Notebooks, we used a conda environment that can be re-created as follows:

conda env create -f conda_environment.yml

conda activate studytype-teller

Please follow this docu to make it accessible in the Notebooks environment.

2. Data

Querying PubMed

We used the EDirect package, which includes several commands that use the E-utilities API to find and retrieve PubMed data. You can install it via the command:

sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

To obtain the initial set of relevant PMIDs, the database was queried using a generic search string related to CNS and Psychiatric conditions, as follows:

esearch -db pubmed -query 'Central nervous system diseases[MeSH] OR Mental Disorders Psychiatric illness[MeSH]' | efetch -format uid > ./pubmed/cns_psychiatric_diseases_mesh.txt

On 27/11/2023 this query returns 2,788,345 PMIDs, out of which we sample 5000 in Data_Preparation_and_Postprocessing.ipynb.

Given this list (see cns_psychiatric_diseases_mesh_5000_sample_pmids.txt).

  • Create a variable containing the list of PMIDs:

id_list=$(paste -sd, "./pubmed/cns_psychiatric_diseases_mesh_5000_sample_pmids.txt")

  • Fetch the relevant contents from PubMed based on those IDs:

efetch -db pubmed -id $id_list -format xml | xtract -pattern PubmedArticle -tab '^' -def "N/A" -element MedlineCitation/PMID PubDate/Year Journal/Title ArticleTitle AbstractText -block ArticleId -if ArticleId@IdType -equals doi -element ArticleId > "./pubmed/pmid_contents_mesh_query.txt"

  • With keywords

efetch -db pubmed -id $id_list -format xml | xtract -pattern PubmedArticle -tab '^' -def "N/A" -element MedlineCitation/PMID Journal/Title -block KeywordList -element Keyword > "./pubmed/enriched_data_keywords.txt"

The data is then cleaned and prepared for annotation with prodigy in Data_Preparation_and_Postprocessing.ipynb.

Relevant API documentation references:

Data Annotation with Prodigy

The prepared data for prodigy is stored in input. A custom recipe was developed to use prodigy for text classification and include keyword highlighting, see recipe_textcat_patterns.py.

To perform the annotations in prodigy, follow the instructions in Prodigy_Annotator_Guide.pdf.

Data Splits

We split the full dataset into train-validation-test set with a 0.6-0.2-0.2 ratio, resulting in subcorpora of 1851, 530 and 534 samples, respectively. To ensure that all classes are present across all parts of the split, we employed a customized stratification strategy which ensures that all classes are present across all dataset splits.

3. Annotation with GPT

The annotation with GPT using different prompting strategies follows the steps:

  1. We prepare the prompts in a json file and assign a unique ID to each.
  2. We read the test set (6-2-2_all_classes_enriched_with_kw/test.csv).
  3. For each title, abstract, and eventually keywords (kw) in the test set and for each prompting strategy, we send an individual GTP query and retrieve the predicted class.
  4. We save the predictions from the classifiaction in predictions.

Multi-label classification

In this setup we directly want to classify the abstracts into one of the 15 class types. Please note however that in the final evaluation we only considered 14 of the classes, excluding Animal-systematic-review. This was due to the very low amount of annotated data.

  1. The relevant prompts are in prompt_strategies.json. An example prompt is given below:
   {
      "id": "P2_1",
      "text": "Determine which of these labels fits the text best: Clinical-study-protocol, Human-systematic-review, Non-systematic-review, Human-RCT-non-drug-intervention, Human-RCT-drug-intervention, Human-RCT-non-intervention, Human-case-report, Human-non-RCT-non-drug-intervention, Human-non-RCT-drug-intervention, Animal-systematic-review, Animal-drug-intervention, Animal-non-drug-intervention, Animal-other, In-vitro-study, Remaining. The classfication applies to the journal name, title, and/or abstract of a study. Respond in json format with the key: gpt_label.",
      "strategy_type": "zero_shot_applies_to"
    }
  1. The annotation is done in Annotate_with_GPT_Prompts_multi-label.ipynb.

Binary classification

In this setup we want to classify each abstract either as ANIMAL or OTHER.

  1. The relevant prompts are in prompt_strategies_binary.json. An example prompt is given below:
  {
      "id": "P1",
      "text": "Classify this text, choosing one of these labels: 'ANIMAL' if the text is related to animal, and 'OTHER' for any other study type. Respond in json format with the key: gpt_label.",
      "strategy_type": "zero_shot"
    }
  1. The annotation is done in Annotate_with_GPT_Prompts_binary.ipynb.

Hierarchical classification

In this setup we want to first classify either as ANIMAL or OTHER. Then classify the abstracts within each of these two classes into one of the fine-grained categories within this class.

  1. The relevant prompts are in prompt_strategies_hierarchical.json. An example prompt is given below:
    {
      "id": "P1_HIERARCHY",
      "text_animal": "Classify this text, choosing one of these labels: Animal-systematic-review, Animal-drug-intervention, Animal-non-drug-intervention, Animal-other. Respond in json format with the key: gpt_label.",
      "text_other": "Classify this text, choosing one of these labels: Clinical-study-protocol, Human-systematic-review, Non-systematic-review, Human-RCT-non-drug-intervention, Human-RCT-drug-intervention, Human-RCT-non-intervention, Human-case-report, Human-non-RCT-non-drug-intervention, Human-non-RCT-drug-intervention, In-vitro-study, Remaining. Respond in json format with the key: gpt_label.",
      "strategy_type": "zero_shot"
    },
  1. The annotation is done in Annotate_with_GPT_Prompts_hierarchical.ipynb.

GPT Results Evaluation

The evaluation of GPT predictions follows the steps:

  1. We read the predictions obtained from GPT.
  2. We map the predictions to their numerical representation, as well as the target annotated columns. The mapping includes a fuzzy matching of the GPT outputs to our target labels to take into account the different spelling variations that GPT sometimes produces.
  3. We evaluate the prompts, including obtaining a classification report and a confusion matrix between predicted and target labels for each prompting strategy. The confidence interval calculations are from the local package in scripts/confidenceinterval.
  4. We map the resulting dataframes to LaTeX tables that we can directly report in our paper.

Evaluation Notebooks

  1. Multi-label: Evaluation_GPT_multi-label_CI_LaTextables.ipynb
  2. Binary: Evaluation_GPT_binary_with_CI.ipynb
  3. Hierarchical: Evaluation_GPT_hierarchical_with_CI.ipynb

4. Annotation with BERT

We experimented with the following models from the HuggingFace library:

    models_to_fine_tune = [
                            'bert-base-uncased',
                            'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
                            'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract',
                            'allenai/scibert_scivocab_uncased',
                            'dmis-lab/biobert-v1.1',
                            'michiyasunaga/BioLinkBERT-base',
                            'emilyalsentzer/Bio_ClinicalBERT',
                            ]

Hyperparameter optimization

We used the library Weights&Biases and its Sweeps functionality to automate and visualize hyperparameter search. The sweep configuration was as follows:

    SWEEP_CONFIG = {
        'method': 'bayes',
        'metric': {'name': 'val_loss', 'goal': 'minimize'},
        'parameters': {
            'learning_rate': {'min': 1e-5, 'max': 1e-3},
            'batch_size': {'values': [8, 16, 32]},
            "weight_decay": {"min": 1e-5, "max": 1e-3}
    }
}

The code for that is in hyperparam_optimization_binary.py and hyperparam_optimization_milti.py.

Models fine-tuning

The best hyperparameters were then used to fine-tune the models. The fine-tuning code is in finetune.py.

The log outputs from that process were saved in models/transformers/checkpoints.

Models Evaluation

The evaluation scripts for BERT can be found in evaluation.

The notebook performance_w_CI.ipynb contains the code to evaluate all BERT models. It also produces the confusion matrix and comparison plots of the best-performing GPT and BERT model.

5. Related Work, Baselines

MeSH terms

There was an issue with some PMIDs missing from the enriched studies of the smaller classes. Instead of a PMID they had the rayyan-software ID. We were able to obtain their PMIDs from the DOI of the papers. The code for this is in ./data/Data_Preparation_and_Postprocessing.ipynb.

We then fetched the MeSH terms for each PMID using the below code.

id_list=$(paste -sd, "enriched_data_pmids.txt")

efetch -db pubmed -id $id_list -format xml | xtract -pattern PubmedArticle -tab '^' -def "N/A" \
    -element MedlineCitation/PMID \
    -block MedlineCitation/MeshHeadingList/MeshHeading -sep "|" -element DescriptorName \
    > "./pmid_mesh_terms.txt"

An evaluation of using them as a binary classifier into Animal and Other class, can be found in ./models/baselines/MeSH_Baseline.ipynb.

GoldHamster dataset

The replication of the work of Neves M, Klippert A, Knöspel F, et al. can be found in a separate GitHub repository: https://github.com/Ineichen-Group/Preclinical_GoldHamster_Replication.

The analysis of the resulting label predictions is in ./models/baselines/StudyTypeTeller_vs_Goldhamster.ipynb.

Multi-Tagger

The work of Cohen, Aaron M., et al. has released predictions over the full PubMed database. We downloaded the files from https://arrowsmith.psych.uic.edu/evidence_based_medicine/mt_download.html, which included:

  1. A column reference that includes the multi-tagger model names as well as their optimal F1 threshholds, ./models/baselines/multi_tagger/MultiTagger_Scorefile_layout.csv.
  2. Three model score files over different time periods up to 2024. These files contain scores between 0 and 1 for each Publication Type and Study Design for each article. We loaded those files into a local PostgreSQL DB in order to more easily filter them for the PMIDs in our dataset. The resulting filtered studies with their Multi-Tagger predictions are in: ./models/baselines/multi_tagger/multitagger_filtered_data_table_1_20250206.csv, ./models/baselines/multi_tagger/multitagger_filtered_data_table_2_20250206.csv, ./models/baselines/multi_tagger/multitagger_filtered_data_table_3_20250206.csv.

The analysis of the resulting label predictions is in ./models/baselines/StudyTypeTeller_vs_MultiTagger.ipynb.

About

Project to develop classifier for PubMed abstract types. Collaboration on MSc. Thesis 2023-2024.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors