TFM: Reward Machines on RL

TFM: Reward Machines on RL

external-link

Apr 2026

In progress

Master's thesis inspired by 'Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning' (arXiv:2010.03950) and Rodrigo Toro Icarte's repository. Implements tabular Q-Learning where the Q-table is indexed by (RM_state, env_state) instead of just env_state. The Reward Machine is a deterministic finite automaton loaded from .txt files defining transitions with propositional conditions (e.g., 'p' = passenger picked up, 'd' = delivered, with negations '!p'). Supports CRM: when the agent violates a proposition, counterfactual experiences are generated to accelerate learning. Environments: Taxi-v3 (Gymnasium, 500 states), a custom MultiTaxiEnv with 2 passengers and dynamic state space (up to ~10000), and MiniGrid-DoorKey-5x5. Dynamic Q-Table storage (only saves visited pairs). Benchmarks 5 variants comparing convergence speed and mean reward. Includes GIF video recording of learned policies.

Technologies
ai

AI

jupyter

Jupyter

matplot

Matplot

numpy

NumPy

pygame

Pygame

python

Python

reinforcement-learning

Reinforcement Learning

https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/base_description_Taxi-v3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/farming.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_doorkey.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video_static.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/taxi_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/base_description_Taxi-v3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/farming.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_doorkey.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video_static.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/taxi_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/base_description_Taxi-v3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/farming.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_doorkey.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video_static.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/taxi_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/base_description_Taxi-v3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/farming.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_doorkey.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2-p2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_2_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_p2-normal_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-0.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-1.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-2.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-3.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_2p.txt-p-4.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/rm_taxi_v2.txt_qtable_video_static.gif
https://raw.githubusercontent.com/MiquelGomezCorral/TFM-Reinforcement-Learning-Reward-Machines/main/videos/taxi_qtable_video.gif