Post
22
𧬠Midicoth: diffusion-based lossless compression ā no neural net, no GPU, no training data
What if reverse diffusion could compress text ā without a neural network?
Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree ā 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.
Beats every dictionary compressor we tested:
enwik8 (100 MB) ā 1.753 bpb (ā11.9% vs xz, ā15% vs Brotli, ā24.5% vs bzip2)
alice29.txt ā 2.119 bpb (ā16.9% vs xz)
Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs
PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics ā no mixer, no gradient descent, just counting.
The Tweedie denoising layer adds 2.3ā2.7% on every file tested ā the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based.
No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput.
š» Code: https://github.com/robtacconelli/midicoth
š Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771)
ā Space: robtacconelli/midicoth
If you ever wondered whether diffusion ideas belong in data compression ā here's proof they do. ā appreciated!
What if reverse diffusion could compress text ā without a neural network?
Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree ā 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.
Beats every dictionary compressor we tested:
enwik8 (100 MB) ā 1.753 bpb (ā11.9% vs xz, ā15% vs Brotli, ā24.5% vs bzip2)
alice29.txt ā 2.119 bpb (ā16.9% vs xz)
Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs
PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics ā no mixer, no gradient descent, just counting.
The Tweedie denoising layer adds 2.3ā2.7% on every file tested ā the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based.
No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput.
š» Code: https://github.com/robtacconelli/midicoth
š Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771)
ā Space: robtacconelli/midicoth
If you ever wondered whether diffusion ideas belong in data compression ā here's proof they do. ā appreciated!