Diffusion Engram IME Demo

English Version

本项目探索了一种基于扩散语言模型的输入法实现思路。它基于 LLaDA 的实现，并融合了 Engram 模块，期盼利用语言模型强大的上下文理解能力来提升长句输入的准确性与连贯性。

模型在 Chinese Fineweb Edu Dataset V2.1 上进行了训练，采用虎码输入方案作为编码标准。

使用方式

本项目提供了简易的交互脚本 example.py，用于演示核心功能。

输入格式

在提示符下输入字符串。

汉字: 请输入其虎码编码的前两位。
标点/大写字母/特殊符号: 直接输入原文即可。符号可以输入半角版本。
小写字母: 输入字母+空格。
混合输入: 支持编码与原文混合输入。

部分实例见 example.py 末尾注释。

局限性

⚠️本项目可能包含⚠️：

没有认真处理和选择的训练语料
拍脑袋想出来的模型架构和超参数
低效的模型实现和推理代码

杂谈

其实早在 ChatGPT 还没有横空出世、我还没有了解过 transformer 的时候，我就思考过能不能用深度学习模型做出更加强大的输入法。那时候我试着学习形码（最后并没有坚持下来），也常常幻想其他提升输入效率的手段。我曾想输入法也许可以更熟悉语言应当有的语法和语义、能以更高的概率组出合理的句子。当然它的用户界面可能与现在的输入法大相径庭、以输入句子甚至段落为核心，从而能利用起来上下文的信息。（当然做这事的人可能不少。）不过那时没有 vibe coding 帮忙，我的行动力不足以让我把这些想法变成现实。

随着 LLM 的发展，我愈发觉得输入效率很大程度上限制了人机交互的效率。了解到 diffusion language model 的思路后，我觉得这非常适合输入法的场景：模型需要根据完整的上下文来预测每个字（而在自回归模型上做 constrastive decoding 只能感知到上文），并且可以从易到难地逐步推断原文，甚至可以无缝地处理部分词由用户手动选择的情况。之所以选择形码，首要原因是形码不会有多音字的问题，数据处理简单一些，且每个编码上的字数分布更加均匀。不过最后训练下来，效果不太理想。

看到 Engram 的时候，我立刻开始重新尝试这个项目。Engram 所做的对 n-gram 查表，几乎就是完美地承担了“词库”的职责，Engram 模块应当可以大幅度减轻模型主干记忆词库的压力。训练下来，结果确实也比之前好不少。

这个项目离实际可用的输入法还有很大差距：最显然的当然是其没有一个合适的用户界面，要有合适的方式让用户修改候选结果、且能适应零点几秒的延迟；模型的训练数据在类型上很窄，尤其缺乏口语化或文学化的内容；模型的推理几乎没有优化过；模型具体应该做多大、超参数如何选择也没有认真实验过；等等。除此之外，怎么让模型利用已经上屏的部分作为上下文，以及能否针对性地再改造 Engram（和其他各个模块）使其更适合输入法场景，都是潜在的改进方向。

Introduction

This project explores an implementation idea for an Input Method Editor (IME) based on diffusion language models. It is built upon the implementation of LLaDA and incorporates the Engram module, aiming to leverage the powerful context understanding capabilities of language models to improve the accuracy and coherence of long sentence input.

The model is trained on the Chinese Fineweb Edu Dataset V2.1 and uses the Tiger Code (Huma) input scheme as the encoding standard.

Usage

This project provides a simple interactive script example.py to demonstrate core functionalities.

Input Format

Enter a string at the prompt.

Chinese Characters: Please enter the first two characters of their Tiger Code encoding.
Punctuation/Uppercase Letters/Special Symbols: Enter the original characters directly. Symbols can be entered in their half-width versions.
Lowercase Letters: Enter the letter followed by a space.
Mixed Input: Supports mixing encoding and original text input.

See comments at the end of example.py for some examples.

Limitations

⚠️ This project may contain ⚠️:

Training corpora that have not been carefully processed or selected
Model architecture and hyperparameters conceived on a whim
Inefficient model implementation and inference code

Ramblings

Actually, long before ChatGPT took the world by storm and before I knew anything about transformers, I wondered if deep learning models could be used to create a more powerful IME. At that time, I tried learning shape-based input methods (though I didn't stick with it) and often fantasized about other means to improve input efficiency. I thought that an IME could perhaps be more familiar with the syntax and semantics that language should have, and group reasonable sentences with higher probability. Of course, its user interface might be vastly different from current IMEs, focusing on inputting sentences or even paragraphs to utilize context information. (Of course, many people might be doing this.) However, without "vibe coding" to help me then, my lack of execution prevented me from turning these ideas into reality.

With the development of LLMs, I increasingly feel that input efficiency largely limits the efficiency of human-computer interaction. After learning about the idea of diffusion language models, I felt this was very suitable for IME scenarios: the model needs to predict each character based on the complete context (whereas contrastive decoding on autoregressive models can only perceive the preceding text), and can infer the original text gradually from easy to difficult, and can even seamlessly handle cases where some words are manually selected by the user. The primary reason for choosing a shape-based code is that it avoids the problem of polyphones, simplifying data processing, and the character distribution for each code is more uniform. However, the initial training results were not ideal.

When I saw Engram, I immediately started retrying this project. Engram's n-gram lookup almost perfectly assumes the responsibility of a "lexicon". The Engram module should significantly reduce the pressure on the model backbone to memorize the lexicon. After training, the results are indeed much better than before.

This project is still far from being a practically usable IME: the most obvious gap is the lack of a suitable user interface that allows users to modify candidate results and adapt to a latency of a few tenths of a second; the training data is narrow in type, especially lacking colloquial or literary content; the model's inference is hardly optimized; no serious experiments have been done on how large the model should be or how to select hyperparameters; etc. Besides, how to let the model use the already entered text as context, and whether Engram (and other modules) can be specifically transformed to better suit IME scenarios, are potential directions for improvement.

Downloads last month: 5