Growing Transformers:Layer-wise Expansion Comparative Study
Paper: 2507.07129 'Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate' (4.2.2, 5.2. Results)
Updated • 24Note Constructively grown 9-layer decoder-only Transformer with a frozen 16‑dim (16‑bit) binary token embedding (n_embed=16) for vocab_size=65,536 (token ID fits in 16 bits). Ablation model for comparison with Bochkov/growing-transformers-model-unicode-1-9-247m on the same Transformer architecture; total params are smaller because the embedding matrix is much smaller.
Bochkov/growing-transformers-model-unicode-1-9-247m
Text Generation • Updated • 18Note Constructively grown 9-layer decoder-only Transformer with fully frozen visual UNICODE embeddings (247.6M). Main comparison model for the 16-bit constructive ablation (same Transformer stack; larger embedding matrix explains parameter-count difference).
Bochkov/growing-transformers-model-unfrozen-1-9-247m
Updated • 9Note Constructively grown 9-layer GPT-like model (1–3 → freeze → 4–6 → freeze → 7–9) with standard trainable embeddings at stage 1 (then frozen). Main comparison baseline vs the 16‑bit constructive model; larger param count due to full-size embedding matrix.
Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m
Updated • 20Note Monolithic baseline (181M): same 9-layer architecture as growing-transformers-model-16-bit-1-9-181m, but trained end-to-end (no layer-wise growth) with the 16-bit token embeddings kept frozen.
Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m
Updated • 12Note Monolithic 9-layer decoder-only Transformer (d_model=1024, n_head=32) trained end-to-end with a frozen visual UNICODE embedding matrix (V=65,536 × 1024). Baseline for comparing against the 16-bit and constructive-growth models; has more parameters than the 16-bit variant due to full-size embedding matrix.
Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m
Updated • 10Note Classic GPT-like baseline: 9-layer decoder-only Transformer trained monolithically end-to-end with fully trainable token embeddings (no frozen embeddings, no layer-wise growth). More parameters than 16-bit models because it uses a full 65,536 × 1024 embedding matrix.
-
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Paper • 2507.07129 • Published • 3 -
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Paper • 2507.04886 • Published • 3