| # Joint NT-ESM2 DNA-Protein Models | |
| This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis. | |
| ## Model Components | |
| ### DNA Model (`dna/`) | |
| - **Type**: Nucleotide Transformer for DNA sequences | |
| - **Context**: 4096 tokens | |
| - **Training**: Transcript-specific coding sequences | |
| ### Protein Model (`protein/`) | |
| - **Type**: ESM2 for protein sequences | |
| - **Variant**: Large model | |
| - **Training**: Corresponding protein sequences | |
| ## Usage | |
| **IMPORTANT**: Both models are masked language models. The DNA model uses the Nucleotide Transformer architecture which requires `trust_remote_code=True`. | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| # Load DNA model - requires trust_remote_code for custom NT architecture | |
| dna_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True) | |
| dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True) | |
| # Load protein model | |
| protein_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein") | |
| protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein") | |
| # Example joint usage | |
| dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA" | |
| protein_seq = "MKRISLHHHHHHHQVTVRWD" | |
| dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt") | |
| protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt") | |
| dna_outputs = dna_model(**dna_inputs) | |
| protein_outputs = protein_model(**protein_inputs) | |
| ``` | |
| ## Training Details | |
| - **Joint Training**: Models trained together for cross-modal understanding | |
| - **Batch Size**: 8 | |
| - **Data**: Transcript-specific coding sequences with corresponding proteins | |
| - **Architecture**: Maintained original NT and ESM2 architectures | |
| ## Repository Structure | |
| ``` | |
| βββ dna/ # NT DNA model | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ vocab.txt | |
| β βββ special_tokens_map.json | |
| βββ protein/ # ESM2 protein model | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ vocab.txt | |
| β βββ special_tokens_map.json | |
| βββ joint_config.json # Joint model configuration | |
| ``` | |
| ## Citation | |
| If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology. | |