GO-GPT: Gene Ontology Prediction from Protein Sequences

GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

Quick Start

  1. Clone the repository:
git clone https://github.com/YOUR_ORG/gogpt
cd gogpt
  1. Run the inference notebook or use Python directly:
import sys
sys.path.insert(0, "src")

from gogpt import GOGPTPredictor

# Load from HuggingFace (downloads ~4GB on first run)
predictor = GOGPTPredictor.from_pretrained("armansa1/gogpt-dev")

# Predict GO terms
predictions = predictor.predict(
    sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
    organism="Homo sapiens"
)

print(predictions)
# {'MF': ['GO:0003674', 'GO:0005488', ...],
#  'BP': ['GO:0008150', 'GO:0008152', ...],
#  'CC': ['GO:0005575', 'GO:0110165', ...]}

Model Architecture

Component Description
Protein Encoder ESM2-3B (facebook/esm2_t36_3B_UR50D)
Decoder 12-layer GPT with prefix causal attention
Embedding Dim 900
Attention Heads 12
Total Parameters ~3.2B (3B ESM2 + 200M decoder)

Supported Organisms

GO-GPT supports organism-conditioned prediction for 200 organisms plus an <UNKNOWN> category (201 total). See organism_list.txt for the full list.

Common organisms include:

  • Homo sapiens
  • Mus musculus
  • Escherichia coli (various strains)
  • Saccharomyces cerevisiae
  • Arabidopsis thaliana
  • Drosophila melanogaster

For organisms not in the training set, predictions will use the <UNKNOWN> embedding.

Files in This Repository

File Description
model.ckpt Model weights (PyTorch Lightning checkpoint)
config.yaml Model architecture configuration
tokenizer_info.json Token vocabulary metadata
go_tokenizer.json GO term to token ID mapping
organism_mapper.json Organism name to ID mapping
organism_list.txt Human-readable list of 201 supported organisms
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train armansa1/gogpt-test