AFM-4.5B-Base-KDA-Only

A research variant of AFM-4.5B-Base where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains no full-attention layers.

⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

This model explores whether full attention can be completely replaced with linear attention mechanisms. Using DistillKit, we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).

Key characteristics:

  • All 24 layers use KDA instead of full attention
  • Trained up to 32k sequence length
  • Linear memory scaling with sequence length
  • Smoother long-context degradation compared to hybrid architectures

Architecture

Component Details
Parameters 4.5B
Attention Type Kimi Delta Attention (All layers)
Positional Encoding None (inherent to KDA)
Max Training Length 32k tokens
Base Model AFM-4.5B-Base

Benchmark Results

Performance compared to the teacher model and hybrid configurations:

Benchmark Teacher (Full Attn) KDA-Only
MMLU (Avg) 63.1% 55.8%
ARC-Challenge 55.6% 49.9%
HellaSwag (Norm) 78.0% 74.3%
GSM8K (Math) 52.1% 26.8%

Key Findings

  • Knowledge benchmarks: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
  • Math performance: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
  • Long-context behavior: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff

Long-Context Performance (NIAH)

The pure-KDA model shows interesting long-context characteristics:

  • 100% single-needle retrieval up to 65k (beyond training length!)
  • Multikey retrieval degrades starting at 4k but smoothly
  • No sharp "cliff" like hybrid models exhibit past 32k

This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
  • Teacher: AFM-4.5B-Base (full attention)
  • Student Architecture: All layers converted to KDA
  • Training Length: 32k sequence length

Intended Use

This model is intended for:

  • Research into linear attention mechanisms
  • Studying attention distillation techniques
  • Exploring pure state-space-like architectures for language modeling
  • Benchmarking KDA vs full attention tradeoffs

Limitations

  • Lower math/reasoning performance compared to full attention
  • Not instruction-tuned
  • Research checkpoint—not optimized for production

License

AFM-4.5B is released under the Apache-2.0 license.

Downloads last month
18
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arcee-ai/AFM-4.5B-Base-KDA-Only

Finetuned
(6)
this model

Collection including arcee-ai/AFM-4.5B-Base-KDA-Only