vsubasri's picture
Fix loading instructions: use AutoModelForMaskedLM and trust_remote_code=True
003dc0f verified
# Joint NT-ESM2 DNA-Protein Models
This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis.
## Model Components
### DNA Model (`dna/`)
- **Type**: Nucleotide Transformer for DNA sequences
- **Context**: 4096 tokens
- **Training**: Transcript-specific coding sequences
### Protein Model (`protein/`)
- **Type**: ESM2 for protein sequences
- **Variant**: Large model
- **Training**: Corresponding protein sequences
## Usage
**IMPORTANT**: Both models are masked language models. The DNA model uses the Nucleotide Transformer architecture which requires `trust_remote_code=True`.
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load DNA model - requires trust_remote_code for custom NT architecture
dna_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True)
dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True)
# Load protein model
protein_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")
protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")
# Example joint usage
dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
protein_seq = "MKRISLHHHHHHHQVTVRWD"
dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt")
protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt")
dna_outputs = dna_model(**dna_inputs)
protein_outputs = protein_model(**protein_inputs)
```
## Training Details
- **Joint Training**: Models trained together for cross-modal understanding
- **Batch Size**: 8
- **Data**: Transcript-specific coding sequences with corresponding proteins
- **Architecture**: Maintained original NT and ESM2 architectures
## Repository Structure
```
β”œβ”€β”€ dna/ # NT DNA model
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── special_tokens_map.json
β”œβ”€β”€ protein/ # ESM2 protein model
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── special_tokens_map.json
└── joint_config.json # Joint model configuration
```
## Citation
If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.