The repository contains the codes for our paper "SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution"
This paper introduces Stepwise Progress Attribution (SPA), a novel reward redistribution framework that provides fine-grained intermediate rewards by decomposing delayed rewards into incremental, per-step contributions, enabling more effective reinforcement learning for complex, multi-step agent tasks (Webshop, ALFWorld and VirtualHome).
- Fine-grained Reward Redistribution: Effectively decomposes delayed rewards into intermediate rewards, reflecting incremental progress toward task completion.
- Effective Multi-turn RL Training: Utilizes these intermediate rewards in PPO, achieving outstanding performance on complex long-horizon tasks.
- [2025.05.28] 🔥 Release our paper on arXiv. See here.
- [2025.05.26] 🚀 SPA-RL-Agent Repo launched!
# Create virtual environment for agentic evaluation
conda create -n SPA python=3.9
conda activate SPA
pip install -r requirements.txt
# Download webshop environment setting
cd envs/webshop
pip install -e .
python -m spacy download en_core_web_lg
conda install -y -c conda-forge openjdk=11
# Download data for WebShop environment
gdown https://drive.google.com/uc?id=1G_0ccLWn5kZE5rpeyAdh_YuoNzvBUjT9
gdown https://drive.google.com/uc?id=11zOUDkJSgGhYin9NxQtG8PVpDsika86y
unzip data.zip
mkdir search_index
unzip indexes.zip -d search_index/
# Download data for ALFWorld environment
cd ../..
cd eval_agent/data/alfworld
gdown https://drive.google.com/uc?id=1y7Vqeo0_xm9d3I07vZaP6qbPFtyuJ6kI
unzip alfworld_data.zip
# Download data for VirtualHome environment
cd ../..
gdown https://drive.google.com/uc?id=1kZKWkWhtJ-DneqfS1Nb_FybR1RBxPeef
unzip virtualhome_master.zip
# Download expert trajectories for Webshop, ALFWorld and VirtualHome environment
cd ../..
gdown https://drive.google.com/uc?id=1_tBMDixZcIjKuv-LExNllha-YIRxhKIq
unzip data.zip
Create another virtual environment for RL training due to package conflicts:
conda create -n RL_train python=3.10
conda activate RL_train
pip install -r ppo/requirements.txt
conda activate SPA
cd sft
# For ALFWorld environment
bash alfworld_llama3b.sh
# For Webshop environment
bash webshop_llama3b.sh
# For VirtualHome environment
bash virtualhome_llama3b.sh
cd ..
# For ALFWorld environment
bash exploration/alfworld/my_generate_response.sh
# For WebShop environment
bash exploration/webshop/my_generate_response_webshop.sh
To orgainize the exploration data for progress estimator training, please run the following scripts firstly.
python prm/data_org.py
Then you could run the following script to train progress estimator.
deepspeed --include=localhost:0,1,2,3 prm/train_our_progress_model.py
python prm/inference_prm.py
To organize the inference data for RL training, please run the following script first.
python prm/rl_data_org.py
Next, execute the following script to perform reinforcement learning training using LoRA:
conda activate RL_train
bash ppo/train_ppo.sh
Before evaluation, we need to merge the LoRA weights with the original LLM weights to obtain the final model:
python ppo/merge.py
Then, we would run the evaluation scripts:
conda activate SPA
# For ALFWorld environment
bash eval/llama3_2_3b_eval_alfworld.sh
# For WebShop environment
bash eval/llama3_2_3b_eval_webshop.sh
# For VirtualHome environment
bash eval/llama3_2_3b_eval_virtualhome.sh
TODO
Our code implementation is based on ETO and steptool. We thank them for their great work.
Also very thankful for my wonderful co-authors: Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li.
If you find our work useful in your research, please consider citing our paper:
@article{wang2025spa,
title={SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution},
author={Wang, Hanlin and Leong, Chak Tou and Wang, Jiashuo and Wang, Jian and Li, Wenjie},
journal={arXiv preprint arXiv:2505.20732},
year={2025}
}