joint-nt-esm2-transcript-coding / README.md

Fix loading instructions: use AutoModelForMaskedLM and trust_remote_code=True

003dc0f verified 5 months ago

2.6 kB

	# Joint NT-ESM2 DNA-Protein Models

	This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis.

	## Model Components

	### DNA Model (`dna/`)
	- Type: Nucleotide Transformer for DNA sequences
	- Context: 4096 tokens
	- Training: Transcript-specific coding sequences

	### Protein Model (`protein/`)
	- Type: ESM2 for protein sequences
	- Variant: Large model
	- Training: Corresponding protein sequences

	## Usage

	IMPORTANT: Both models are masked language models. The DNA model uses the Nucleotide Transformer architecture which requires `trust_remote_code=True`.

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	# Load DNA model - requires trust_remote_code for custom NT architecture
	dna_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True)
	dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna", trust_remote_code=True)

	# Load protein model
	protein_model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")
	protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")

	# Example joint usage
	dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
	protein_seq = "MKRISLHHHHHHHQVTVRWD"

	dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt")
	protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt")

	dna_outputs = dna_model(**dna_inputs)
	protein_outputs = protein_model(**protein_inputs)
	```

	## Training Details

	- Joint Training: Models trained together for cross-modal understanding
	- Batch Size: 8
	- Data: Transcript-specific coding sequences with corresponding proteins
	- Architecture: Maintained original NT and ESM2 architectures

	## Repository Structure

	```
	├── dna/ # NT DNA model
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ ├── vocab.txt
	│ └── special_tokens_map.json
	├── protein/ # ESM2 protein model
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ ├── vocab.txt
	│ └── special_tokens_map.json
	└── joint_config.json # Joint model configuration
	```

	## Citation

	If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.