tiny-tiny-stories

Oh, you think tinystories is small? Welcome to tiny-stories on steroids. A 1m param model at 1.5 bit quantization!

A 1-bit (ternary {-1, 0, +1}) transformer language model trained on TinyStories.

Specs

Parameters 998,784 (< 1M)
Weight precision 1.58-bit ternary (BitNet b1.58)
Tokenizer SentencePiece unigram, 192 vocab
Context length 512 tokens
Best val loss 1.2087 (perplexity 3.35)
Training 100K steps on 2.1M TinyStories
Checkpoint size 3.9 MB (FP32 latent), ~350 KB quantized

Architecture

  • d_model: 128
  • Heads: 4 (head_dim=32)
  • Layers: 5
  • FFN: SwiGLU (d_ff=336)
  • Position encoding: RoPE (no learned positional embeddings)
  • Normalization: RMSNorm
  • Embeddings: Tied input/output, full precision
  • All linear layers: BitLinear with ternary quantization + straight-through estimator

How it works

All Q/K/V/O attention projections and SwiGLU FFN matrices use BitLinear: weights are quantized to {-1, 0, +1} during the forward pass via round(W / mean(|W|)), with gradients flowing through a straight-through estimator to full-precision latent weights during training.

Usage

import torch
import sentencepiece as spm

# Load tokenizer and model
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')

# Load model (see train.py for BitLM class definition)
from train import BitLM, Config
cfg = Config()
cfg.vocab_size = 192
model = BitLM(cfg)
ckpt = torch.load('best.pt', map_location='cpu', weights_only=True)
state = ckpt['model']
if any(k.startswith('_orig_mod.') for k in state):
    state = {k.replace('_orig_mod.', ''): v for k, v in state.items()}
model.load_state_dict(state)
model.eval()

# Generate
ids = [sp.bos_id()] + sp.encode("Once upon a time")
idx = torch.tensor([ids])
out = model.generate(idx, max_new=200, temp=0.8, top_k=40, eos_id=sp.eos_id())
print(sp.decode(out[0].tolist()))

Sample output

Once upon a time, there was a squirrel. He was very curious and loved to play in the park. One day, he noticed a big tree in the sky. He was already laughing, but he was stronger under his houses. The squirrel was glue of all the trees, exploring the walls...

Training

Trained on 2x RTX 2080 Ti using mixed-precision (FP16) with AdamW optimizer, cosine LR schedule (1.5e-3 peak, 1000 step warmup), and gradient accumulation (effective batch size 384).

python train.py --exp-dir ./output --device cuda:0 --compile
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train hidude562/tiny-tiny-stories