SCOUT-Multitask Sequential RL Agent

This repository contains the final checkpoint of the SCOUT (Sequential RL) framework. The model is based on Qwen2.5-3B-Instruct and has been trained sequentially across multiple environments, ending with the Sudoku task.

Model Description

The SCOUT framework enables large language models to acquire new skills sequentially while maintaining performance on previously learned tasks. This specific model represents the culmination of the training pipeline, achieving state-of-the-art multi-task performance within the SCOUT benchmark.

Framework: SCOUT (Sequential RL with Exploration & Distillation)
Training Stage: Final Checkpoint (+PPO on Sudoku)
Base Model: Qwen2.5-3B-It

Experimental Results

The following results demonstrate the model's performance across all environments after completing the full sequential training curriculum:

Task Group	Environment / Setting	Score
Bandit	General	1.0
FrozenLake	Static / Slippery	0.89 / 0.88
Sokoban	Box1 / Box2	0.95 / 0.59
Rubiks' Cube	Rotation 1 / 2 / 3	1.0 / 1.0 / 0.89
Sudoku	General	0.98
Average	Overall Multi-task	0.91

Data source: Internal evaluation metrics for the SCOUT sequential RL pipeline.

Capability Highlights

Zero Forgetting: Maintains a perfect 1.0 score on the initial Bandit task even after sequential training on four subsequent complex environments.
Logical Reasoning: Shows exceptional proficiency in high-dimensional state spaces, particularly Sudoku (0.98) and Rubiks' Cube (Average ~0.96).
Robustness: Demonstrates strong performance in stochastic environments like FrozenLake Slippery (0.88).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Harryis/SCOUT_multitask"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# Example: Prompt the model for a Sudoku move or Sokoban action

Model tree for Harryis/SCOUT_multitask

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(957)

this model

Paper for Harryis/SCOUT_multitask

Language-based Trial and Error Falls Behind in the Era of Experience

Paper • 2601.21754 • Published 3 days ago • 15

Harryis
/

SCOUT_multitask