Growing Transformers:Layer-wise Expansion Comparative Study

Bochkov 's Collections

Emergent Semantics Beyond Token Embeddings

Growing Transformers: Progressive Layer-Wise Growth

Demo models [proof-of-concept, ablation]

Tokenizers

Growing Transformers:Layer-wise Expansion Comparative Study

updated 3 days ago

Paper: 2507.07129 'Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate' (4.2.2, 5.2. Results)

Upvote

Bochkov/growing-transformers-model-16-bit-1-9-181m

Updated 1 day ago • 24

Note Constructively grown 9-layer decoder-only Transformer with a frozen 16‑dim (16‑bit) binary token embedding (n_embed=16) for vocab_size=65,536 (token ID fits in 16 bits). Ablation model for comparison with Bochkov/growing-transformers-model-unicode-1-9-247m on the same Transformer architecture; total params are smaller because the embedding matrix is much smaller.
Bochkov/growing-transformers-model-unicode-1-9-247m

Text Generation • Updated 3 days ago • 18

Note Constructively grown 9-layer decoder-only Transformer with fully frozen visual UNICODE embeddings (247.6M). Main comparison model for the 16-bit constructive ablation (same Transformer stack; larger embedding matrix explains parameter-count difference).
Bochkov/growing-transformers-model-unfrozen-1-9-247m

Updated 3 days ago • 9

Note Constructively grown 9-layer GPT-like model (1–3 → freeze → 4–6 → freeze → 7–9) with standard trainable embeddings at stage 1 (then frozen). Main comparison baseline vs the 16‑bit constructive model; larger param count due to full-size embedding matrix.
Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m

Updated 1 day ago • 20

Note Monolithic baseline (181M): same 9-layer architecture as growing-transformers-model-16-bit-1-9-181m, but trained end-to-end (no layer-wise growth) with the 16-bit token embeddings kept frozen.
Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m

Updated 3 days ago • 12

Note Monolithic 9-layer decoder-only Transformer (d_model=1024, n_head=32) trained end-to-end with a frozen visual UNICODE embedding matrix (V=65,536 × 1024). Baseline for comparing against the 16-bit and constructive-growth models; has more parameters than the 16-bit variant due to full-size embedding matrix.
Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m

Updated 3 days ago • 10

Note Classic GPT-like baseline: 9-layer decoder-only Transformer trained monolithically end-to-end with fully trainable token embeddings (no frozen embeddings, no layer-wise growth). More parameters than 16-bit models because it uses a full 65,536 × 1024 embedding matrix.
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 3
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 3

Upvote