| --- |
| license: other |
| license_name: raml-v1.0 |
| pipeline_tag: text-generation |
| tags: |
| - model_hub_mixin |
| - pytorch_model_hub_mixin |
| - RxNN |
| - RxLM |
| - ReactiveTransformer |
| - Event-Driven |
| - MemorySystem |
| - ShortTermMemory |
| - Real-Time |
| - RxLM |
| - ReactiveLanguageModel |
| - RealTimeLanguageModel |
| language: |
| - en |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - wikimedia/wikipedia |
| - HuggingFaceFW/clean-wikipedia |
| - ReactiveAI/smol-smoltalk-Interaction-SFT |
| - ReactiveAI/cosmopedia-100k-Interaction-SFT |
| - ReactiveAI/Real-Chat-SMAT |
| - ReactiveAI/Real-Chat-No-System-SMAT |
| library_name: RxLM |
| gated: true |
| extra_gated_prompt: >- |
| Accept [Reactive AI Model & Architecture License (RAML) |
| v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to |
| access the repository and use model. Reactive Transformer (pending patent |
| #P.453260) is available for free for non-commercial usage. For commercial |
| usage please contact Reactive AI at licensing@rxai.dev |
| extra_gated_fields: |
| Company: text |
| Country: country |
| I want to use this model for: |
| type: select |
| options: |
| - Research |
| - Education |
| - label: Other |
| value: other |
| I agree to use this model for non-commercial use ONLY: checkbox |
| extra_gated_heading: >- |
| You need to agree to use this model only for research or education purposes |
| under Reactive AI Model & Architecture License (RAML) v1.0 |
| extra_gated_description: The repository will be available instantly after accepting license terms |
| extra_gated_button_content: Accept license terms |
| base_model: |
| - ReactiveAI/RxT-Beta-Micro-Supervised |
| --- |
| |
| # RxT-Beta-Micro-Supervised AI 270M |
| World's first experimental real-time **Reactive Language Model (RxLM)** trained on limited real-world data (after synthetic |
| RxT-Alpha generation). It's based on revolutionary **Reactive Transformer** architecture - processing only single |
| interactions/messages, with all the context moved to **Short-Term Memory**, managed by **Attention-Based Memory System**. |
|
|
| This model is a fine-tuned version of [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised), |
| specialized in AI/Data Science knowledge based chats and interactive [Reactive AI](https://huggingface.co/ReactiveAI) documentation |
|
|
| > Docs in progress |
|
|
| ## Model Details |
|
|
| ### Model Description |
| First **Reactive Language Model (RxLM)** trained on limited real-world datasets, based on **Reactive Transformer (RxT)** architecture |
|
|
| **RxLMs** have linear computational/inference cost scaling (`O(NT)`) compared to **LLMs** quadratic growth (`O(N²T)`), |
| where `N` is the number of messages in conversation and `T` is the number of tokens in single interaction. Thanks to that |
| scaling, they are just `N` times faster and cheaper than **LLMs**. |
|
|
| That's not all from the advantages - event-driven real-time processing with memory is a lot more natural and human-like, |
| than LLMs data-driven approach (processing full conversation history everytime). It's a crucial milestone in development |
| of AGI and awareness models. |
|
|
| > This is _Supervised_ version of the model with "weak" memory system - result of Supervised Memory System Training (SMST). It's |
| > able to remember information between interactions (without passing it explicitly in prompt/chat template), but it |
| > has to be refined in next Memory Reinforcement Learning (MRL) stage for full functionality. |
|
|
| After successful experiments with simple synthetic datasets, we moved to real-world data, but this model still had limited |
| amount of english-only data for pre-training - only 10B tokens from Wikipedia and FineWeb-Edu (+2B tokens in later stages). |
| Then it could have limited general knowledge, so we fine-tuned it for chats with AI/Data Science knowledge |
|
|
|
|
| ### Reactive Transformer Architecture |
| Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System. |
|
|
| Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention. |
| Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs. |
|
|
| The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement |
| for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer |
| architecture is a first step in transition from language models to awareness models. |
|
|
| To balance number of the parameters, decoder is based on Mixture-of-Experts architecture, while the encoder is using regular |
| dense feed forward layers. This model is using gated self/interlayer version of memory attention network with sigmoid residual gates. |
|
|
| <img src="https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised/resolve/main/reactive-transformer-self-interlayer.png" width="800" /> |
|
|
| #### Architecture details: |
| - dim: 256 |
| - layers: 14 |
| - heads (for split): 16 |
| - **Decoder:** |
| - self-attention: Sparse Query Attention |
| - query heads: 8/16 |
| - key/value heads: 4/16 |
| - memory cross-attention: Sparse Query Attention |
| - query heads: 8/16 |
| - key/value heads: 4/16 |
| - Mixture-of-Experts Feed Forward |
| - experts: 42 |
| - active experts: 4 |
| - SwiGLU feed forward with 512 dim |
| - size: \~251M (~41M Activated) |
| - **Encoder:** |
| - self-attention: symmetric Sparse Query Attention |
| - query/key/value heads: 8/16 |
| - SwiGLU feed forward with 768 dim |
| - size: ~18.3M |
| - **Memory Attention:** |
| - variant: **Gated Self/Interlayer Memory Attention** |
| - attention layers: symmetric Sparse Query Attention |
| - query/key/value heads: 8/16 |
| - residual gate: elementwise with sigmoid activation (per STM slot) |
| - size: ~3.73M |
| - RoPE for self-attention, memory cross-attention (query only) and memory attention (key only) |
| - RMS Norm for all normalization layers |
| - vocab: 32k (english only) |
| - interaction (query + answer) length: 1024 tokens |
| - STM size: 14 layers * 1024 slots (* 256 dim) |
| - context/messages: **Infinite** |
| - size: ~270M |
| - Library: RxLM |
| --- |
| - **Developed by:** [Adam Filipek](https://huggingface.co/AdamF92) & [Reactive AI](https://huggingface.co/ReactiveAI) |
| - **Funded by:** [Reactive AI](https://huggingface.co/ReactiveAI) |
| - **Model type:** **Reactive Language Model (RxLM)** |
| - **Language(s) (NLP):** English |
| - **License:** [Reactive AI Model & Architecture License (RAML) v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) |
| - **Finetuned from model:** [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised) |
|
|
| ### Model Sources |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** [RxLM Framework](https://github.com/RxAI-dev/rxlm) |
| - **Paper:** [Reactive Transformer (RxT) - Stateful Real-Time Processing for Event-Driven Reactive Language Models](https://arxiv.org/abs/2510.03561) |
| - **Demo:** In progress |
|
|
| ## Uses |
| This model is fine-tuned version of [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised), trained on AI/Data Science knowledge |
| and Reactive AI documentation based conversations. It's made for interactive documentation of our technologies. |
|
|
| Base model is still experimental and it was pre-trained on limited corpus with only 10B tokens, so it's general knowledge is also limited, but it should |
| work correctly for AI/Data Science oriented topics |
|
|
| **Supervised** RxT models are partially functional intermediate stage models - it's recommended to refine them in Memory Reinforcement Learning (MRL) and Reactive |
| Reinforcement Learning from Human Feedback (RxRLHF) to reach final stage. |
|
|
| ### Direct Use |
| It's recommended to refine the model in reinforcement learning stages for full functionality (in progress). |
|
|
| **Reactive Transformer** models are made for conversational tasks, especially chatbots or as a stateful base for agentic systems. |
|
|
| This model is made to act as interactive documentation of Reactive AI technologies and AI/Data Science knowledge agent. |
|
|
| ### Out-of-Scope Use |
| **Reactive Transformer** models are natively conversational and made for multi-step tasks. They aren't typical Gen AI and aren't made |
| for single-step generative tasks (like summarization, dataset generation, etc.) - they will work in those scenarios, but it will be waste |
| of computational resources (initializing/processing memory, when it's not needed). For that case it's better to use stateless LLM. |
|
|
| ## Bias, Risks, and Limitations |
| The model is still experimental, made to test **Reactive Transformer** architecture on real-world data, after succesful experiments with simple synthetic data. |
| It was pre-trained on 10B tokens only (and additional 2B in next stages), so it's general knowledge is limited and responses could be inaccurate. |
|
|
| Conversation context is theoretically infinite (1024 tokens limit is only for single interaction), but after some number of messages model will slowly forget |
| outdated information - that's why it's called **Short-Term Memory**. It will be extended in upcoming generations with **Long-Term Memory** for true infinite context. |
|
|
| AI/Data Science knowledge and Reactive AI documentation datasets for fine-tuned model were created _"semi-synthetically"_ with LLMs (GPT-OSS and Qwen3) - the |
| conversation examples were generated by LLM, based on provided documentation. It's then possible, that they include some hallucinations and incorrect facts, but |
| is should be rather rare. |
|
|
| ### Recommendations |
| As mentioned before, supervised models are in intermediate stage and it's recommended to continue the training in reinforcement learning stages. |
|
|
| ## How to Get Started with the Model |
| Model could be loaded and used with our RxLM framework (https://github.com/RxAI-dev/RxLM): |
|
|
| ```python |
| import torch |
| from rxlm.rxt.models import RxTBeta |
| from rxlm.training.tokenizer import load_tokenizer_from_hf_hub |
| |
| tokenizer = load_tokenizer_from_hf_hub('ReactiveAI/RxT-Beta-Micro') |
| |
| model = RxTBeta.from_pretrained('ReactiveAI/RxT-Beta-Micro-Supervised-AI', tokenizer=tokenizer) |
| model.share_components() # currently required to connect embeddings/STM |
| |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| model.to(device) |
| |
| seq_len = 1024 |
| |
| # Memory init - could be used as "system prompt" in LLMs (not recommended in this model, as it wasn't trained with system prompts) |
| stm_init_state = model.tokenize_full_interaction('System prompt like', 'Initial memory for the model', max_seq_len=seq_len, device=device) |
| model.init_stm_state(**stm_init_state) |
| |
| # Helper function |
| def interaction(query: str): |
| tokenized_query = model.tokenize_query(query, max_seq_len=seq_len, device=device) |
| for token_id in model.interact(**tokenized_query, max_seq_len=seq_len, temperature=1.0): |
| if token_id == -1: print('\n', '[Start memory update...]') |
| elif token_id == -2: print('[Memory updated]') |
| else: |
| txt_token = model.stringify_token(token_id) |
| print(txt_token, end='') |
| |
| # Process first interaction |
| interaction('Hello! Who are you?') |
| # Process follow-up interaction |
| interaction('Follow-up question?') |
| |
| ``` |
|
|
| ## Training Details |
| Stateful & real-time nature of **Reactive Transformer** architecture, especially asynchronous memory update, requires advanced training pipeline with multiple |
| supervised and reinforcement learning stages: |
| - Supervised: |
| - Joint Language Models Pre-Training | raw large text corpora |
| - Interaction Supervised Fine-Tuning | single, not connected interactions (query + answer) |
| - Self-Supervised Memory Attention Pre-Training | multi-step conversations (SMAT datasets) |
| - Supervised Memory-Aware Training (SMAT) | multi-step conversations |
| - Reinforcement: |
| - Memory Reinforcement Learning (MRL) | multi-step conversations |
| - Reactive Reinforcement Learning from Human Feedback (RxRLHF) | multi-step conversations |
|
|
| Fine-tuning for narrow specialization was performed in additional epochs of **Supervised Memory-Aware Training (SMAT)** |
|
|
| ### Training Data |
| We used public open-source datasets for pre-training and our custom datasets (converted from public datasets) for other stages: |
| - Joint Language Models Pre-Training |
| - 'sample-10BT' subset from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
| - '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
| - Interaction SFT |
| - [ReactiveAI/smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
| - [ReactiveAI/cosmopedia-100k-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT) |
| - Self-Supervised Memory Attention Pre-Training |
| - 30% of [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| - Supervised Memory-Aware Training (SMAT) |
| - [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| - [ReactiveAI/Real-Chat-No-System-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
| - Specialization SMAT |
| - [ReactiveAI/AI-Knowledge-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/AI-Knowledge-Chat-SMAT) |
| - [ReactiveAI/ReactiveAI-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/ReactiveAI-Chat-SMAT) |
|
|
| ### Training Procedure |
| Supervised Memory System Training includes 4 steps, before proceeding to Reinforcement Learning stages. |
|
|
| #### Joint Language Models Pre-Training |
| Decoder was trained with Encoder and additional MLM head model, using Joint LM Training (with MLM and Autoregressive loss), |
| using [**HuggingFaceFW/fineweb-edu**](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [**wikimedia/wikipedia**](https://huggingface.co/datasets/wikimedia/wikipedia) datasets. |
| Both encoder and decoder are using shared embedding layer |
|
|
| #### Supervised Fine-Tuning |
| **RxT-Beta Micro** model was fine-tuned to real-time interactions (sequences) format on our datasets, derived from HuggingFace ones: |
| - [**ReactiveAI/smol-smoltalk-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
| - [**ReactiveAI/cosmopedia-100k-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT-Interaction-SFT). |
|
|
| Models were fine-tuned using Joint LM Training mode (for memory cross-attention pre-training): |
| - encode data with encoder and calculate MLM loss for it |
| - save encoder layer's results as Short-Term Memory (available for decoder by memory cross-attention) |
| - process data with decoder and calculate autoregressive loss |
|
|
| That training results in decoder with ~95% accuracy, because it has access to all next tokens information with memory cross-attention. In next training stages it |
| will access previous interactions data with those layers. |
|
|
| #### Self-Supervised Memory Attention Pre-Training |
| Memory Attention was pre-trained to combine accumulated Short-Term Memory states with next interaction data processed by the |
| encoder, using weighted mean (with randomized arbitrary weights) as labels and negative cosine similarity as loss. Label weights |
| depending on inner step: |
| - first step, when STM is in initial random normal state, using 90% of new encoded data |
| - follow-up steps are using `50% - step * 5%` of new encoded data |
| - each step could have 0-15% random differences in weights |
|
|
| Additionally, random noise is added to both inputs and labels. |
|
|
| This model was trained on six arbitrary selected steps using single epoch on 30% from [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) dataset. |
|
|
| #### Supervised Memory-Aware Training |
| Finally, with pre-trained/fine-tuned components, in last supervised stage, model is trained to use previous/accumulated STM |
| states as memory cross-attention input, instead of the same sequences as decoder's input: |
| - previous (or first) interaction is processed by encoder and used to update memory |
| - next interaction is processed by decoder, using related information from STM |
| - loss is calculated from decoder's logits and gradients propagate through memory attention to encoder |
|
|
| We used staged memory-aware training with different datasets: |
| - starting from 2 epochs on raw 80k examples (with 7 interactions) - [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| - then 5 epochs on filtered 27k better quality examples - [**ReactiveAI/Real-Chat-No-System-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
|
|
| #### Specialization |
| After described stages, general purpose model were saved as [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised) and we moved to AI/Data Science specialization. |
|
|
| It's the same training procedure as previous stage - Supervised Memory-Aware Training: |
| - we used 21.5k synthetically generated examples with AI/Data Science knowledge chats from [**ReactiveAI/AI-Knowledge-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/AI-Knowledge-Chat-SMAT), combined with 6.5k examples from filtered general dataset |
| - finally we used 50% of dataset from previous step and new [**ReactiveAI/ReactiveAI-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/ReactiveAI-Chat-SMAT) with information about our own technologies and model identity |
|
|
| #### Preprocessing |
| Pre-training is done on raw text corpora and it require only tokenization. In next stages, model is processing sequences in simple **Interaction format**, that's used |
| instead complex chat templates - `[Q] User's query... [A] Model's answer`. For upcoming reasoning models, it will be extended to `[Q] User's query... [T] Reasoning... [A] Model's answer` |
|
|
| #### Training Hyperparameters |
| - **Training regime:** bf16 mixed precision (AMP autocast) |
| - **Optimizer**: AdamW |
| - **Scheduler**: Cosine annealing |
|
|
| ## Evaluation |
| Evaluation is in progress - more details soon! |
|
|
| ### Testing Data, Factors & Metrics |
|
|
| #### Testing Data |
|
|
| <!-- This should link to a Dataset Card if possible. --> |
|
|
| [More Information Needed] |
|
|
| #### Factors |
|
|
| <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
| [More Information Needed] |
|
|
| #### Metrics |
| In progress |
|
|
| ##### Supervised Memory-Aware Training Validation Metrics |
| - **Loss:** 0.5360 |
| - **Perplexity**: 1.7091 |
| - **Accuracy**: 88.97% |
|
|
|
|
| ### Results |
|
|
| [More Information Needed] |
|
|
| #### Summary |
|
|
|
|
| ## Environmental Impact |
| - Base model |
| - **Hardware Type:** 4x NVIDIA A100 40GB |
| - **Hours used:** 150 |
| - Specialization |
| - **Hardware Type:** 1x NVIDIA A100 40GB |
| - **Hours used:** 30 |
|
|
| ## Model Card Contact |
| [Adam Filipek](https://huggingface.co/AdamF92) - adamfilipek@rxai.dev |
|
|
| Licences - licensing@rxai.dev |