# Joint NT-ESM2 DNA-Protein Models This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis. ## Model Components ### DNA Model (`dna/`) - **Type**: Nucleotide Transformer for DNA sequences - **Context**: 4096 tokens - **Training**: Transcript-specific coding sequences ### Protein Model (`protein/`) - **Type**: ESM2 for protein sequences - **Variant**: Large model - **Training**: Corresponding protein sequences ## Usage **IMPORTANT**: Both models are masked language models. The DNA model uses the Nucleotide Transformer architecture which requires `trust_remote_code=True`. ```python from transformers import AutoModelForMaskedLM, AutoTokenizer # Load DNA model - requires trust_remote_code for custom NT architecture dna_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True) dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True) # Load protein model protein_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein") protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein") # Example joint usage dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA" protein_seq = "MKRISLHHHHHHHQVTVRWD" dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt") protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt") dna_outputs = dna_model(**dna_inputs) protein_outputs = protein_model(**protein_inputs) ``` ## Training Details - **Joint Training**: Models trained together for cross-modal understanding - **Batch Size**: 8 - **Data**: Transcript-specific coding sequences with corresponding proteins - **Architecture**: Maintained original NT and ESM2 architectures ## Repository Structure ``` ├── dna/ # NT DNA model │ ├── config.json │ ├── model.safetensors │ ├── tokenizer_config.json │ ├── vocab.txt │ └── special_tokens_map.json ├── protein/ # ESM2 protein model │ ├── config.json │ ├── model.safetensors │ ├── tokenizer_config.json │ ├── vocab.txt │ └── special_tokens_map.json └── joint_config.json # Joint model configuration ``` ## Citation If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.