AFM-4.5B-Base-KDA-Only

A research variant of AFM-4.5B-Base where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains no full-attention layers.

⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

This model explores whether full attention can be completely replaced with linear attention mechanisms. Using DistillKit, we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).

Key characteristics:

All 24 layers use KDA instead of full attention
Trained up to 32k sequence length
Linear memory scaling with sequence length
Smoother long-context degradation compared to hybrid architectures

Architecture

Component	Details
Parameters	4.5B
Attention Type	Kimi Delta Attention (All layers)
Positional Encoding	None (inherent to KDA)
Max Training Length	32k tokens
Base Model	AFM-4.5B-Base

Benchmark Results

Performance compared to the teacher model and hybrid configurations:

Benchmark	Teacher (Full Attn)	KDA-Only
MMLU (Avg)	63.1%	55.8%
ARC-Challenge	55.6%	49.9%
HellaSwag (Norm)	78.0%	74.3%
GSM8K (Math)	52.1%	26.8%

Key Findings

Knowledge benchmarks: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
Math performance: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
Long-context behavior: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff

Long-Context Performance (NIAH)

The pure-KDA model shows interesting long-context characteristics:

100% single-needle retrieval up to 65k (beyond training length!)
Multikey retrieval degrades starting at 4k but smoothly
No sharp "cliff" like hybrid models exhibit past 32k

This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
Teacher: AFM-4.5B-Base (full attention)
Student Architecture: All layers converted to KDA
Training Length: 32k sequence length

Intended Use

This model is intended for:

Research into linear attention mechanisms
Studying attention distillation techniques
Exploring pure state-space-like architectures for language modeling
Benchmarking KDA vs full attention tradeoffs

Limitations

Lower math/reasoning performance compared to full attention
Not instruction-tuned
Research checkpoint—not optimized for production

License

AFM-4.5B is released under the Apache-2.0 license.

Downloads last month: 18

Safetensors

Model size

5B params

Tensor type

BF16

F32

Model tree for arcee-ai/AFM-4.5B-Base-KDA-Only

Base model

arcee-ai/AFM-4.5B-Base

Finetuned

(6)

this model

Collection including arcee-ai/AFM-4.5B-Base-KDA-Only

AFM KDA

Collection

Collection of KDA conversions of AFM • 2 items • Updated 1 day ago • 3