DeepEST

This is the repository with the implementation of DeepEST, a multimodal method to perform protein function prediction for bacterial species. Given the characteristics of their genome, the functional characterization of bacteria cannot rely solely on protein sequences. In fact, it requires the use of data sources capable of capturing different dimensions of the protein functionality, i.e., protein structures and expression-location patterns.

The architecture of DeepEST, visualized in the following figure, is comprised of two main modules:

the fine-tuned structure module [1], and
the expression-location module.

Subsequently, the two modalities are integrated through learnable linear combination.

Repository organization

1) DeepEST implementation

The python implementation of DeepEST is available under train_model/. The data to test the implementation is available with the release. Once the folder data has been places into train_model/, you can use the following command to train the model:

python3 train_DeepEST.py \
--split 0 \
--splitdir splits/Staphylococcus_aureus_MSSA476/ \
--config configs/fold0/config_combined_Staphylococcus_aureus_MSSA476.yaml \
--expr-loc data/expr-loc/Staphylococcus_aureus_MSSA476_genomic_info.csv \
--structures data/all_proteins.pkl \
--label data/matrix_label/label_matrix_Staphylococcus_aureus_MSSA476_thr10_new.csv \
--conversion_dict data/conversion_dicts/conv_Staphylococcus_aureus_MSSA476.txt \
--outdir predictions/ \
--species Staphylococcus_aureus_MSSA476

Dependencies

The code only supports python3 and requires the following packages and submodules:

torch (tested on 2.0.0)
pytorch_lightning (tested on 1.9.4)
flash (tested on 0.8.1.post0)
sklearn (tested on 1.0.2)
numpy (tested on 1.21.5)
pandas (tested on 1.4.4)

2) DeepEST annotations on hypothetical proteins

The folder hypothetical_proteins/ contains the annotations generated by DeepEST on the nearly 7'000 unannotated proteins from the 25 bacteria.

Data

We study a selection of 25 bacterial species, namely:

In the following, we describe how to collect the data (gene expression and location data, protein structures, and GO terms annotations) we use to perform the study described in the manuscript.

Expression and location data

As expression-location data, we use previously reported PATHOgenex dataset [2], which contains both the genomic location information and the gene expression levels of 105,088 genes in 32 clinically relevant-human bacterial pathogens under 11 in vivo mimicking stress conditions and unexposed control. Specifically, as input features for our model, we consider the log-fold change values derived from the differential expression analysis of these 11 stress conditions in comparison to the control.
Protein structures

We use protein structures downloaded from the AlphaFold database [3].
GO terms annotations

GO annotations are retrieved from the UniProt database [4] (accessed on July 12, 2023) using the RefSeq protein identifier of every known protein and the taxonomic reference code of a given pathogen's strain. To retrieve a particular GO term's children or ancestors we use the GO ontology released on October 7, 2022.

DeepEST End-to-End Tutorial

PDB → Structure Embeddings → Genomic Features → GO-Term Prediction

This comprehensive tutorial describes how to run DeepEST end-to-end on user-defined data, starting from protein structures (PDB files) and producing GO-term predictions. It also documents how to leverage DeepFRI and foldseek-based structure similarity for robust model training and evaluation.

Overview

DeepEST combines three information sources to predict Gene Ontology (GO) terms from protein sequences:

Structure embeddings: Graph Convolutional Networks (GCN) from DeepFRI applied to PDB files
Genomic/Expression features: Position encodings derived from genomic and transcriptomic data
Supervised learning: Neural network mapping combined features to GO term annotations

The pipeline supports multiple evaluation strategies:

Standard random splits
Foldseek-based structure-aware splits (recommended for realistic evaluation)

Note: Big data files such as structures, splits, bed files, etc; used to train DeepEST, are available on Zenodo.

Key Components

1. Structure Embeddings (DeepFRI-GCN)

Extracted from PDB files using pre-trained DeepFRI's Graph Convolutional Network layers
Produces vectorial feature representations of proteins
Intermediate representations (not final DeepFRI predictions)

2. Genomic/Expression-Location Features

Polar position encoding: Encodes gene/protein position relative to transcription start
Expression data: Log fold changes under various stress conditions
Critical functions:
- get_info(): Extracts genomic information from BED files
- polar(): Computes polar position encoding (essential for DeepEST)

3. Transfer Learning Strategy

Uses original DeepFRI weights as initialization
Fine-tunes final layers on your labeled dataset
Decoupled strategy (extract → predict) for computational efficiency

Required Inputs

(Per Species/Strain) For each species/strain (example: Achromobacter_xylosoxidans_SOLR10), prepare:

Input	Description	File Format	Required
PDB structures	Protein 3D structures	`.pdb` files	✓
GO label matrix	Protein × GO term annotations	`.csv`	✓
Splits directory	Train/validation/test folds	`train/`, `val/`, `test/` folders	✓
Conversion dictionary	Gene/protein ID mapping	`genes.txt`	✓
Genomic features	Expression/location information	`.csv`	✓
Protein ID mapping	Links genes to protein structures	Column in features file	✓

Recommended Resources

These archives contain pre-generated configurations and data:

Available Downloads

config_species.zip (see example S. aureus)
├── foldseek/                          # Structure-similarity-aware splits
│   ├── foldseek_fold0/
│   │   ├── config.yaml
│   │   ├── train_split.txt
│   │   └── ...
│   └── foldseek_fold1/
├── fold0/, fold1/, ...                # Standard random splits
│   ├── config.yaml
│   └── ...
└── default.yaml                       # Default configuration

scripts.zip
├── 6_position_encoding.py            # ⭐ CORE: Generate genomic features
├── 1_genes_with_expression.py        # Filter genes by expression availability
├── 2_label_matrix.py                 # Build label matrix with GO term annotations
├── 3_label_matrix_all_species.py     # Combine multi-species label matrices
├── 4_reference_split_generation.py   # Create train/val/test splits
├── 5_split_generation_all_species.py # Multi-species split generation
├── 7_create_X_all_species.py         # Create combined feature matrix
├── 8_create_expression_wo_IDs.py     # Clean expression data
├── 9_print_dimensions.py             # Verify dataset dimensions
├── 10_update_yaml.py                 # Generate config files per fold/split
└── 10_update_yaml_combined.py        # Config generation for combined models

genomic_features.zip
├── 6_position_encoding.py
├── genomic_data/
│   ├── chr1.bed
│   └── ...
└── metadata.json                     # Data mapping reference

scripts_str_and_clus.zip
├── 1_prepare_pdb_folder.py           # Consolidate PDB files
├── 2_create_foldseekDB.sh            # Build Foldseek database
├── 3_run_foldseek.sh                 # Run structure-based clustering
├── 4_check_clustering.py             # Generate structure-aware splits
├── compare_with_deepFRI_training_set.sh
├── options.txt                       # Foldseek parameter reference
└── info.txt                          # Environment setup

Installation & Setup

Prerequisites

# Python 3.7+
pip install pandas numpy scikit-learn pytorch-lightning pyyaml

# DeepFRI requirements (see deepfri_extract_features.py dependencies)

# Foldseek (for structure-based clustering, optional)
# Download from: https://github.com/steineggerlab/foldseek
export PATH=$(pwd)/foldseek/bin/:$PATH

Directory Structure

project_root/
├── gene_function/
│   └── 1st_step/
│       ├── deepfri_extract_features.py    # ⭐ Extract structure embeddings
│       ├── merge_extracted_structures.py  # Merge multi-species embeddings
│       └── deepfri_masked.py              # ⭐ Run DeepFRI with transfer learning
├── scripts/
│   ├── 6_position_encoding.py            # Generate genomic features
│   └── [other preprocessing scripts]
├── input/
│   ├── structures/Achromobacter_xylosoxidans_SOLR10/
│   │   ├── protein1.pdb
│   │   └── ...
│   ├── labels/
│   │   └── label_matrix_Achromobacter_xylosoxidans_SOLR10_thr10_new.csv
│   ├── splits/Achromobacter_xylosoxidans_SOLR10/
│   │   ├── genes.txt
│   │   ├── train/
│   │   ├── val/
│   │   └── test/
│   └── genome/
│       └── chromosomes.bed
├── output/
│   ├── [embeddings and predictions]
└── features/
    └── expr_loc_Achromobacter_xylosoxidans_SOLR10.csv

Complete Workflow

Step 1 — Extract Structure Embeddings (DeepFRI-GCN)

This step applies DeepFRI's pre-trained GCN layers to your PDB files.

Run Extraction

cd gene_function/1st_step/

# Extract embeddings for a single species
python3 deepfri_extract_features.py \
  --data_dir ../input/structures/Achromobacter_xylosoxidans_SOLR10/ \
  --out_dir ../output/

Output:

../output/
├── Achromobacter_xylosoxidans_SOLR10.pkl     # Species-specific embeddings
├── Borrelia_burgdorferi_B31.pkl
└── ...

What it does:

Reads all .pdb files from the input directory
Applies DeepFRI's trained GCN layers
Produces vectorial representations for each protein (intermediate features, NOT final predictions)
Saves as pickle file for efficient loading

Important: These extracted features are intermediate DeepFRI representations used by both DeepEST and downstream DeepFRI predictions.

Step 2 — (Optional) Merge Multiple Species Embeddings

Only required if extracting for multiple species and you want a unified file.

python3 merge_extracted_structures.py \
  --input_dir ../output/ \
  --output_file ../data/all_proteins.pkl

Output: ../data/all_proteins.pkl (combined embeddings for all species)

When to use:

Training on multi-species data
Simplifying file management

Step 3 — Generate Genomic Features

Genomic features encode gene position, expression levels, and genomic context.

Core Script: `6_position_encoding.py`

Key Functions:

def get_info(bed_path, species_genome):
    """
    Extract genomic information from BED files
    
    Inputs:
        bed_path: Path to BED file with gene annotations
        species_genome: Genome size information
    
    Returns:
        Dictionary with:
        - Strand information
        - Gene start/end positions
        - Chromosome assignments
    """
    pass

def polar(protein_id, gene_info, genome_size):
    """
    Compute polar position encoding
    
    Encodes gene/protein position relative to transcription start
    using circular coordinates (cos/sin of angle).
    
    Critical feature for DeepEST performance.
    
    Inputs:
        protein_id: Protein identifier
        gene_info: Genomic information for gene
        genome_size: Reference genome size
    
    Returns:
        Position encoding features:
        - cos-start, sin-start: Transcription start position
        - cos-end, sin-end: Gene end position
        - strand: DNA strand (+1 or -1)
        - genome: Chromosome (0) vs Plasmid (1)
    """
    pass

Using Pre-Generated Features

If using genomic_features.zip:

unzip genomic_features.zip

# Now you have:
# - 6_position_encoding.py
# - genomic_data/chr*.bed
# - metadata.json (ID mapping)

Run Feature Generation

python3 6_position_encoding.py \
  --inraw ../data/raw_expression_files/ \
  --inexp ../data/processed_files/features/

Output:

../data/processed_files/features/
├── Achromobacter_xylosoxidans_SOLR10_genomic_info.csv
├── Borrelia_burgdorferi_B31_genomic_info.csv
└── ...

Important: Ensure all identifiers (gene IDs, protein IDs) match across:

Feature matrix
Label matrix
Conversion dictionary (genes.txt)

Step 4 — (Optional) Run DeepFRI Baseline Predictions

Map extracted embeddings through DeepFRI's final layers WITHOUT transfer learning.

python3 deepfri_masked.py \
  --split 0 \
  --outdir ../output/predictions/ \
  --protein_path ../output/Achromobacter_xylosoxidans_SOLR10.pkl \
  --matrix_label_path ../input/labels/label_matrix_Achromobacter_xylosoxidans_SOLR10_thr10_new.csv \
  --split_dir ../input/splits/Achromobacter_xylosoxidans_SOLR10/ \
  --species Achromobacter_xylosoxidans_SOLR10 \
  --conversion_dict ../input/splits/Achromobacter_xylosoxidans_SOLR10/genes.txt

What it does:

Takes pre-extracted embeddings from Step 0
Applies original DeepFRI final layers (unfinetuned)
Produces baseline GO-term predictions
Useful for comparison with transfer learning approach

Step 5 — (Optional) Run DeepFRI with Transfer Learning

Fine-tune DeepFRI's final layers on your labeled data.

python3 deepfri_masked.py \
  --split 0 \
  --outdir ../output/predictions_transfer_learning/ \
  --protein_path ../output/Achromobacter_xylosoxidans_SOLR10.pkl \
  --matrix_label_path ../input/labels/label_matrix_Achromobacter_xylosoxidans_SOLR10_thr10_new.csv \
  --split_dir ../input/splits/Achromobacter_xylosoxidans_SOLR10/ \
  --species Achromobacter_xylosoxidans_SOLR10 \
  --conversion_dict ../input/splits/Achromobacter_xylosoxidans_SOLR10/genes.txt \
  --transfer_learning  # Enable fine-tuning but not supported for some versions

Transfer Learning Process:

Initialize final layers with original DeepFRI weights
Fine-tune on training set using trainer.fit(model, datamodule=data)
Monitor validation set with early stopping
Evaluate on held-out test set

Output:

../output/predictions_transfer_learning/
├── deepfri_predictions_fold0_split0.csv
├── training_metrics_fold0_split0.txt
└── ...

Decoupled Strategy Note: We extract embeddings first (Step 0), then predict (Steps 3-4) for computational efficiency. You can couple these steps if needed, with some code refactoring.

Step 6 — Train DeepEST (Main Model)

Combine structure embeddings and genomic features to predict GO terms.

Training Command

python3 train_DeepEST.py \
  --split 0 \
  --splitdir ../input/splits/Achromobacter_xylosoxidans_SOLR10/ \
  --config ../config_species/fold0/config.yaml \
  --expr-loc ../features/expr_loc_Achromobacter_xylosoxidans_SOLR10.csv \
  --structures ../output/Achromobacter_xylosoxidans_SOLR10.pkl \
  --label ../input/labels/label_matrix_Achromobacter_xylosoxidans_SOLR10_thr10_new.csv \
  --conversion_dict ../input/splits/Achromobacter_xylosoxidans_SOLR10/genes.txt \
  --outdir ../output/deepest_predictions/ \
  --species Achromobacter_xylosoxidans_SOLR10

Required Files

✓ Structure embeddings (from Step 0)
✓ Genomic features (from Step 2)
✓ GO label matrix
✓ Train/val/test split indices
✓ Conversion dictionary (protein ID → gene ID mapping)
✓ Config file (from config_species.zip)

Expected Output

../output/deepest_predictions/
├── deepest_predictions_fold0_split0.csv      # Main predictions
├── deepest_predictions_fold0_split1.csv      # [if multiple splits]
├── training_log_fold0_split0.txt
├── validation_metrics_fold0_split0.csv
└── config_fold0_split0.yaml                  # Used configuration

Model Architecture

The configuration file controls:

model_params:
  input_dim: 449                    # Total feature dimension
  hidden_sizes: [512, 16, 512]     # Hidden layer sizes
  lr: 0.0001                        # Learning rate
  p: 0.0                            # Dropout rate
  weight_decay: 0.0                 # L2 regularization

data_params:
  batch_size: 32

training_params:
  epochs: 100
  early_stopping_patience: 10

Step 7 — Structure-Aware Evaluation (Foldseek-Based Splits)

Use structure similarity to create realistic train/test splits, preventing data leakage from structurally similar proteins.

Why Use Foldseek Splits?

Standard random splits may include structurally very similar proteins in both training and test sets, leading to overly optimistic performance estimates. Foldseek-based splits account for this.

Workflow

1. Prepare PDB Files (Consolidate All Structures)

cd scripts_str_and_clus/

python3 1_prepare_pdb_folder.py \
  --input ../data/processed_files/structure/species_folders/ \
  --output ../data/processed_files/structure/all_species/

2. Create Foldseek Database

bash 2_create_foldseekDB.sh

3. Run Foldseek Clustering

bash 3_run_foldseek.sh
# Uses: lddt-threshold 0.7 (structures > 70% similar are clustered)

Output:

../data/processed_files/clustering_foldseek/
├── cluster_all_species_cluster.tsv         # Cluster assignments
└── ...

4. Generate Structure-Aware Splits

python3 4_check_clustering.py \
  --genes ../data/processed_files/splits/all_species/ \
  --clusters ../data/processed_files/clustering_foldseek/ \
  --outdir ../data/processed_files/splits_foldseek/ \
  --species all_species \
  --folds 5

Output:

../data/processed_files/splits_foldseek/all_species/
├── train_index_0.npy
├── val_index_0.npy
├── test_index_0.npy
└── [indices 1-4]

Use Foldseek Splits in Training

python3 train_DeepEST.py \
  --split 0 \
  --splitdir ../data/processed_files/splits_foldseek/all_species/ \
  --config ../config_species/foldseek/foldseek_fold0/config.yaml \
  --expr-loc ../features/expr_loc.csv \
  --structures ../output/all_proteins.pkl \
  --label ../input/labels/label_matrix.csv \
  --conversion_dict ../data/processed_files/splits_foldseek/all_species/genes.txt \
  --outdir ../output/foldseek_predictions/ \
  --species all_species

Available Foldseek Configs

Pre-computed optimal hyperparameters are in:

config_species/foldseek/
├── foldseek_fold0/config.yaml
├── foldseek_fold1/config.yaml
└── ...

These configs use structure-aware hyperparameters optimized via grid search.

Complete End-to-End Pipeline (Bash Script example)

#!/bin/bash
# Run full DeepEST workflow from PDB to predictions, example including some steps, notice step number does not necessarily match the tutorial numbers

set -e  # Exit on error

echo " Starting DeepEST end-to-end workflow..."

SPECIES="Achromobacter_xylosoxidans_SOLR10"
INPUT_STRUCTURES="../input/structures/$SPECIES/"
INPUT_LABELS="../input/labels/"
INPUT_SPLITS="../input/splits/$SPECIES/"
INPUT_GENOME="../input/genome/"
OUTPUT_DIR="../output/"
FEATURES_DIR="../features/"

# Create output directories
mkdir -p $OUTPUT_DIR $FEATURES_DIR

# ============================================================================
# STEP 1: Extract DeepFRI-GCN embeddings
# ============================================================================
echo "Step 1: Extracting structure embeddings..."
cd gene_function/1st_step/

python3 deepfri_extract_features.py \
  --data $INPUT_STRUCTURES \
  --out_dir $OUTPUT_DIR

cd ../..

# ============================================================================
# STEP 2: Generate genomic features
# ============================================================================
echo "Step 2: Generating genomic features..."

python3 scripts/6_position_encoding.py \
  --inraw ../data/raw_gene_annotations/ \
  --inexp $FEATURES_DIR

# ============================================================================
# STEP 3: (Optional) Run DeepFRI baseline
# ============================================================================
echo "Step 3: Running DeepFRI baseline predictions (optional)..."

cd gene_function/1st_step/

python3 deepfri_masked.py \
  --split 0 \
  --outdir ${OUTPUT_DIR}deepfri_baseline/ \
  --protein_path ${OUTPUT_DIR}${SPECIES}.pkl \
  --matrix_label_path ${INPUT_LABELS}label_matrix_${SPECIES}_thr10_new.csv \
  --split_dir $INPUT_SPLITS \
  --species $SPECIES \
  --conversion_dict ${INPUT_SPLITS}genes.txt

cd ../..

# ============================================================================
# STEP 4: Train DeepEST
# ============================================================================
echo "Step 4: Training DeepEST model..."

python3 train_DeepEST.py \
  --split 0 \
  --splitdir $INPUT_SPLITS \
  --config ../config_species/fold0/config.yaml \
  --expr-loc ${FEATURES_DIR}expr_loc_${SPECIES}.csv \
  --structures ${OUTPUT_DIR}${SPECIES}.pkl \
  --label ${INPUT_LABELS}label_matrix_${SPECIES}_thr10_new.csv \
  --conversion_dict ${INPUT_SPLITS}genes.txt \
  --outdir ${OUTPUT_DIR}deepest_predictions/ \
  --species $SPECIES

# ============================================================================
# STEP 5: (Optional) Train with foldseek-based splits
# ============================================================================
echo "Step 5: Training with structure-aware splits (optional)..."

python3 train_DeepEST.py \
  --split 0 \
  --splitdir ../data/processed_files/splits_foldseek/all_species/ \
  --config ../config_species/foldseek/foldseek_fold0/config.yaml \
  --expr-loc ${FEATURES_DIR}expr_loc_${SPECIES}.csv \
  --structures ${OUTPUT_DIR}all_proteins.pkl \
  --label ${INPUT_LABELS}label_matrix_all_species_thr10.csv \
  --conversion_dict ../data/processed_files/splits_foldseek/all_species/genes.txt \
  --outdir ${OUTPUT_DIR}foldseek_predictions/ \
  --species $SPECIES

echo " DeepEST workflow completed successfully!"
echo ""
echo "Results:"
echo "  - DeepEST predictions: ${OUTPUT_DIR}deepest_predictions/"
echo "  - Foldseek predictions: ${OUTPUT_DIR}foldseek_predictions/"
echo "  - DeepFRI baseline: ${OUTPUT_DIR}deepfri_baseline/"

Understanding the Pipeline Architecture

Data Flow Diagram

┌─────────────────┐
│   PDB Files     │
└────────┬────────┘
         │
         ▼
┌──────────────────────────────┐
│ Step 1: DeepFRI-GCN Extract  │
│ (deepfri_extract_features.py)│
└────────┬─────────────────────┘
         │
         ▼
┌──────────────────────────────┐
│   Structure Embeddings       │
│   (*.pkl files)              │
└────────┬─────────────────────┘
         │
         ├─────────────────┬────────────────────┐
         │                 │                    │
         ▼                 ▼                    ▼
    ┌─────────┐      ┌──────────┐       ┌────────────┐
    │ DeepFRI │      │ Merge    │       │ DeepEST    │
    │ Baseline│      │ Species  │       │ (Combine   │
    │         │      │ Embeddings       │  Features) │
    └────┬────┘      └────┬────┘       └─────┬──────┘
         │                │                   │
         ▼                ▼                   ▼
┌──────────────────────────────────────────────────────┐
│             Genomic Features                         │
│  (6_position_encoding.py)                            │
│  - Expression data                                   │
│  - Polar position encoding (cos/sin start/end)       │
│  - Strand & genome information                       │
└──────────────────────┬───────────────────────────────┘
                       │
                       ▼
         ┌─────────────────────────┐
         │   GO Label Matrix       │
         │   (protein × GO terms)  │
         └────────┬────────────────┘
                  │
                  ▼
         ┌─────────────────────────┐
         │  Train/Val/Test Splits  │
         │                         │
         │  Standard: random split │
         │  Foldseek: structure-   │
         │           aware split   │
         └────────┬────────────────┘
                  │
                  ▼
    ┌─────────────────────────────────────┐
    │  Train DeepEST / DeepFRI (Transfer) │
    │  - Early stopping on validation set │
    │  - Fine-tune final layers           │
    └────────────────────────────────────

Key Differences: Standard vs. Foldseek Splits

Aspect	Standard Splits	Foldseek Splits
Strategy	Random stratified split	Structure-similarity aware clustering
Prevents Leakage	By GO term distribution	By structural similarity
Realistic?	May be optimistic	Realistic (accounts for homologs)
Use Case	Quick prototyping	Publication-quality evaluation
Configuration	`fold*/config.yaml`	`foldseek/foldseek_fold*/config.yaml`

Overfitting Prevention & Evaluation

Standard Evaluation Protocol

┌─────────────────────────────────────────────────────────────┐
│ Split into Train / Validation / Test                         │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  Training Set                 Validation Set    Test Set    │
│  (learn weights)              (tune hyperparams) (final eval)│
│  • Backprop enabled           • No backprop      • No backprop
│  • Update weights             • Early stop       • Report metrics
│  • Use training loss          • Use val loss     • Final scores
│                               (stop if no       │
│                                improvement)     │
│                                                  └─ Report only!
└─────────────────────────────────────────────────────────────┘

Strategies If Overfitting Occurs

Strategy	Description	When to Use
Stronger regularization	Increase dropout rate, weight decay, L2 penalty	Small datasets (<500 proteins)
More training data	Data augmentation, add unlabeled proteins	Limited labeled annotations
Revisit split strategy	Switch to foldseek-based splits	Overlapping/homologous proteins
Early stopping	Increase patience, monitor validation loss	Standard approach (always use)
Model architecture	Reduce hidden layer sizes	Overfitting on small datasets
Feature selection	Use top features by importance	High-dimensional feature spaces

Monitoring Training

Metrics to Track:

# In config file or logs:
- Training loss (should decrease)
- Validation loss (should decrease, then plateau)
- Test loss (final evaluation only)

# Early stopping criteria:
- If val_loss doesn't improve for N epochs → stop
- Restore best model from validation phase

Troubleshooting

Common Issues

Issue	Symptom	Solution
Embedding file not found	`FileNotFoundError: *.pkl`	Check Step 0 completed; verify paths in config
Label matrix dimension mismatch	`Shape mismatch: features vs labels`	Verify protein IDs match in all files; check filtering steps
Memory errors during training	`CUDA out of memory` or `RAM exceeded`	Reduce `batch_size` in config (e.g., 32→16); use gradient accumulation
Poor prediction performance	AUROC < 0.6	Review data quality; check for data leakage; try foldseek splits
NaN values in loss	`Loss is NaN`	Check for invalid entries in features/labels; normalize features
Protein ID mismatches	Keys missing across files	Run `9_print_dimensions.py` to verify; use `4_reference_split_generation.py`
Foldseek not found	`Command not found: foldseek`	Add to PATH: `export PATH=$(pwd)/foldseek/bin/:$PATH`

Validation Checklist

Before training, verify:

All .pdb files are readable and in correct directory
GO label matrix shape: (num_proteins, num_GO_terms)
Protein IDs match across: features, labels, splits, conversion dict
No NaN values in feature matrices
Train/val/test indices don't overlap
Config file input_dim matches feature matrix columns
Config file output dimension matches GO term count
Conversion dictionary has entries for all proteins

Scripts Reference

Preprocessing Scripts (from `scripts.zip`)

Use these in sequence to prepare data from raw files:

# 1. Filter genes by expression availability
python3 1_genes_with_expression.py

# 2. Create GO label matrix (per species)
python3 2_label_matrix.py

# 3. Combine multi-species label matrices
python3 3_label_matrix_all_species.py

# 4. Create initial train/val/test splits
python3 4_reference_split_generation.py

# 5. Generate multi-species splits (stratified)
python3 5_split_generation_all_species.py

# 6. Generate genomic features ⭐ CRITICAL
python3 6_position_encoding.py

# 7. Combine all species feature matrices
python3 7_create_X_all_species.py

# 8. Clean expression data (remove NaNs)
python3 8_create_expression_wo_IDs.py
python3 8_create_expression_wo_IDs_all_species.py

# 9. Print and verify dataset dimensions
python3 9_print_dimensions.py --X <features.csv> --label <labels.csv>

# 10. Generate config files per fold/split
python3 10_update_yaml.py
python3 10_update_yaml_combined.py  # For multi-task models

Structure-Based Clustering Scripts (from `scripts_str_and_clus.zip`)

Use for foldseek-based evaluation splits:

# 1. Consolidate all PDB files into single directory
python3 1_prepare_pdb_folder.py

# 2. Create Foldseek database
bash 2_create_foldseekDB.sh

# 3. Run structure-based clustering
bash 3_run_foldseek.sh

# 4. Generate splits from clusters
python3 4_check_clustering.py --species all_species

Advanced Topics

Multi-Species Training

For training on multiple bacterial species simultaneously:

python3 train_DeepEST.py \
  --split 0 \
  --splitdir ../data/processed_files/splits/all_species/ \
  --config ../config_species/fold0/config.yaml \
  --expr-loc ../features/expr_loc_all_species.csv \
  --structures ../output/all_proteins.pkl \
  --label ../input/labels/label_matrix_all_species_thr10.csv \
  --conversion_dict ../data/processed_files/splits/all_species/genes.txt \
  --outdir ../output/multi_species_predictions/ \
  --species all_species

Transfer Learning from DeepFRI

The pipeline supports initializing from DeepFRI weights:

# Original DeepFRI weights are loaded from:
# data/deepfri_model.hdf5

# Fine-tuning occurs in:
trainer.fit(model, datamodule=data)

# Results in improved performance on your specific labels

Using Pre-Extracted Embeddings

If you already have DeepFRI embeddings from another source:

# Place *.pkl files in output directory
# Then skip Step 0 and proceed to Step 2 (Generate Genomic Features)
python3 train_DeepEST.py --structures <your_embeddings.pkl> ...

References Foldseek

Foldseek: van Kempen, M., et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023).
- Paper: https://doi.org/10.1038/s41587-023-01773-0
- Code: https://github.com/steineggerlab/foldseek

Citation

If you use DeepEST in your research, please cite:

@article{muzio2024bacterial,
  title={Bacterial protein function prediction via multimodal deep learning},
  author={Muzio, Giulia and Adamer, Michael and Fernandez, Leyden and Borgwardt, Karsten and Avican, Kemal},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Support & Issues

For questions or issues:

Check the Troubleshooting section
Review preprocessing script documentation
Verify dataset dimensions with 9_print_dimensions.py
Open an issue on GitHub with:
- Dataset dimensions (proteins × features, proteins × GO terms)
- Error message (full traceback)
- Steps to reproduce

References

[1] Gligorijevi ́c, V., et al.: Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021)

[2] Avican, K., et al.: RNA atlas of human bacterial pathogens uncovers stress dynamics linked to infection. Nat. Commun. 12, 3282 (2021)

[3] Mihaly Varadi et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1):D439–D444, 2021.

[4] The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2022.

Contacts

For queries on the implementation and data, please contact:

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement (No 813533). Swedish Research Council (No. 2021-02466), Kempes-tiftelserna (JCK22-0017), Insamlingsstiftelsen, Medical Faculty at Ume ̊a University to K Avican and Icelab Multidisciplinary Postdoctoral Fellowship to L Fernandez.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
annotations_hypothetical_proteins		annotations_hypothetical_proteins
train_model		train_model
LICENSE.md		LICENSE.md
README.md		README.md
gene_function.zip		gene_function.zip
genomic_features.zip		genomic_features.zip
scripts.zip		scripts.zip
scripts_str_and_clus.zip		scripts_str_and_clus.zip

License

BorgwardtLab/DeepEST

Folders and files

Latest commit

History

Repository files navigation

DeepEST

Repository organization

1) DeepEST implementation

Dependencies

2) DeepEST annotations on hypothetical proteins

Data

Expression and location data

Protein structures

GO terms annotations

DeepEST End-to-End Tutorial

Table of Contents

Overview

Key Components

1. Structure Embeddings (DeepFRI-GCN)

2. Genomic/Expression-Location Features

3. Transfer Learning Strategy

Required Inputs

Recommended Resources

Available Downloads

Installation & Setup

Prerequisites

Directory Structure

Complete Workflow

Step 1 — Extract Structure Embeddings (DeepFRI-GCN)

Run Extraction

Step 2 — (Optional) Merge Multiple Species Embeddings

Step 3 — Generate Genomic Features

Core Script: 6_position_encoding.py

Using Pre-Generated Features

Run Feature Generation

Step 4 — (Optional) Run DeepFRI Baseline Predictions

Step 5 — (Optional) Run DeepFRI with Transfer Learning

Step 6 — Train DeepEST (Main Model)

Training Command

Required Files

Expected Output

Model Architecture

Step 7 — Structure-Aware Evaluation (Foldseek-Based Splits)

Why Use Foldseek Splits?

Workflow

Use Foldseek Splits in Training

Available Foldseek Configs

Complete End-to-End Pipeline (Bash Script example)

Understanding the Pipeline Architecture

Data Flow Diagram

Key Differences: Standard vs. Foldseek Splits

Overfitting Prevention & Evaluation

Standard Evaluation Protocol

Strategies If Overfitting Occurs

Monitoring Training

Troubleshooting

Common Issues

Validation Checklist

Scripts Reference

Preprocessing Scripts (from scripts.zip)

Structure-Based Clustering Scripts (from scripts_str_and_clus.zip)

Advanced Topics

Multi-Species Training

Transfer Learning from DeepFRI

Using Pre-Extracted Embeddings

References Foldseek

Citation

Support & Issues

References

Contacts

Funding

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Core Script: `6_position_encoding.py`

Preprocessing Scripts (from `scripts.zip`)

Structure-Based Clustering Scripts (from `scripts_str_and_clus.zip`)

Packages