AZeroS

AZeroS (Auden Zero-instruction-tuned Speech-LLM) extends a frozen LLM to speech via Self-Generated Instruction-Free Tuning (SIFT). It keeps the LLM and audio encoders frozen and trains lightweight projection modules on speech–text pairs, achieving strong semantic and paralinguistic performance with modest training cost, generalizing well to unseen instructions.

πŸ”— Paper: https://arxiv.org/pdf/2601.06086
πŸ”— Code: https://github.com/AudenAI/Auden/tree/main/examples/azeros
πŸ”— Model: https://huggingface.co/AudenAI/azeros
πŸ”— Auden Repo: https://github.com/AudenAI/Auden

πŸ” What Can This Model Do?

  • πŸŽ™οΈ Speech understanding (semantic content understanding and dialog)
  • 😊 Paralinguistic analysis (emotion, age, gender, etc.)

Quick Start

import torch
from model import AZerosModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AZerosModel.from_pretrained("AudenAI/azeros").to(device)

wav_files = ["speech1.wav", "speech2.wav"]
messages = [
    [
        {
            "role": "user",
            "content": f"{model.audio_token_wrapped} Please analyze speech content and paralinguistic information.",
        }
    ]
    for _ in wav_files
]

generate_config = {
    "max_new_tokens": 200,
    "num_beams": 1,
    "do_sample": False,
    "min_length": 1,
    "repetition_penalty": 1.0,
    "length_penalty": 1.0,
    "top_p": None,
    "top_k": None,
    "temperature": None,
}

outputs = model.generate(wav_files, messages, **generate_config)
print(outputs)

Auden Setup (Required)

This model relies on the Auden codebase for loading and inference:

git clone https://github.com/AudenAI/Auden.git
cd Auden
pip install -e .
cd examples/azeros

πŸ“Œ Model Characteristics

  • Input: Raw audio waveform (16 kHz) or text
  • Output: Text responses regarding to the input
  • Backend LLM: Qwen2.5-7B-Instruct
  • Encoders: TTA and Auden-Voice
  • Architecture: Frozen LLM + frozen audio encoders + lightweight projection modules
  • Training paradigm: Self-Generated Instruction-Free Tuning (SIFT)

πŸ“Š Evaluation

VoiceBench

Model Alpaca Eval Comm Eval Wild Voice SD-QA BBH Adv Bench IF Eval OBQA MMSU Overall
Text Only Model
Qwen2.5 4.66 4.55 4.62 62.03 80.00 99.04 70.14 84.84 71.57 82.69
Qwen2.5 (TN) 4.61 4.53 4.56 63.84 56.30 98.85 66.11 74.07 64.51 77.52
Cascaded System
Whisper+GPT-4o 4.80 4.47 4.62 75.77 87.20 98.27 76.51 92.97 81.69 87.80
Whisper+Qwen2.5 4.64 4.33 4.21 58.50 52.85 98.27 63.99 78.24 69.00 76.05
End-to-end Speech-LLM
GPT-4o 4.78 4.49 4.58 75.50 84.10 98.65 76.02 89.23 80.25 86.75
Moshi 2.01 1.60 1.30 15.64 47.40 44.23 10.12 25.93 24.04 29.51
Phi-4-multimodal 3.81 3.82 3.56 39.78 61.80 100.00 45.35 65.93 42.19 64.32
GLM-4-Voice 3.97 3.42 3.18 36.98 52.80 88.08 25.92 53.41 39.75 56.48
Qwen2-Audio 3.42 3.29 2.76 31.65 53.00 99.04 26.35 48.35 36.14 53.77
DeSTA2.5 3.73 2.52 3.30 46.47 62.40 97.69 65.47 72.75 58.56 66.04
Qwen2.5-Omni 3.88 3.77 3.52 46.75 63.70 97.31 40.19 81.54 61.45 68.26
Qwen3-Omni-30B 4.74 4.54 4.58 76.90 80.40 99.30 77.80 89.70 68.10 85.49
AZeroS (ours) 4.44 4.18 3.91 60.22 56.30 98.65 61.29 72.09 59.01 73.13

AIRBench

Model Gender Emotion Age LID Entity Intent Avg Chat
Cascaded System
Whisper+GPT-4o 21.90 59.50 41.10 96.80 69.80 87.70 62.80 7.54
Whisper+Qwen2.5 28.36 50.80 36.40 88.00 73.60 82.70 59.98 7.34
End-to-end Speech-LLM
GPT-4o * 49.10 * 76.00 61.60 85.80 * 7.53
Gemini2.5-pro 90.70 60.70 34.10 99.10 68.50 92.20 74.22 8.52
SALMONN 35.50 29.90 48.70 28.10 51.70 36.70 38.43 6.16
GLM-4-Voice 23.91 22.95 18.70 25.40 27.90 21.10 23.33 5.53
Qwen2-Audio 64.71 48.15 23.10 77.80 87.00 84.70 64.24 7.20
DeSTA2.5 84.24 64.30 65.60 97.30 65.20 83.70 76.72 7.57
Qwen2.5-Omni 89.76 54.85 44.80 89.70 79.70 88.60 74.57 6.97
Qwen3-Omni-30B 91.11 62.20 36.90 97.70 80.40 90.70 76.50 7.85
AZeroS (ours) 86.75 71.45 61.30 84.80 73.60 85.60 77.25 8.28

An additional prompt is added to ensure steady output of choices: β€œPlease make your choice among A/B/C/D and do not output other texts.”

⚠️ Limitations

  • Trained on public datasets; performance may degrade on out-of-domain audio.
  • Not designed for safety-critical applications.

Citation

If you use AZeroS in your research, please cite:

@article{shao2026azeros,
  title={AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning},
  author={Shao, Yiwen and Liu, Wei and Li, Jiahong and Wang, Tianzi and Wei, Kun and Yu, Meng and Yu, Dong},
  journal={arXiv preprint arXiv:2601.06086},
  year={2026}
}
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AudenAI/azeros

Base model

Qwen/Qwen2.5-7B
Finetuned
(2372)
this model

Dataset used to train AudenAI/azeros

Collection including AudenAI/azeros

Paper for AudenAI/azeros