Language-based Trial and Error Falls Behind in the Era of Experience
Paper
•
2601.21754
•
Published
•
15
This repository contains the final checkpoint of the SCOUT (Sequential RL) framework. The model is based on Qwen2.5-3B-Instruct and has been trained sequentially across multiple environments, ending with the Sudoku task.
The SCOUT framework enables large language models to acquire new skills sequentially while maintaining performance on previously learned tasks. This specific model represents the culmination of the training pipeline, achieving state-of-the-art multi-task performance within the SCOUT benchmark.
The following results demonstrate the model's performance across all environments after completing the full sequential training curriculum:
| Task Group | Environment / Setting | Score |
|---|---|---|
| Bandit | General | 1.0 |
| FrozenLake | Static / Slippery | 0.89 / 0.88 |
| Sokoban | Box1 / Box2 | 0.95 / 0.59 |
| Rubiks' Cube | Rotation 1 / 2 / 3 | 1.0 / 1.0 / 0.89 |
| Sudoku | General | 0.98 |
| Average | Overall Multi-task | 0.91 |
Data source: Internal evaluation metrics for the SCOUT sequential RL pipeline.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Harryis/SCOUT_multitask"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# Example: Prompt the model for a Sudoku move or Sokoban action