ExploRLLM: guiding Exploration in Reinforcement Learning with Language Models

1Delft University of Technology


Reinforcement learning struggles with low sample efficiency, slow training speed, and uncertain convergence in large observation and action spaces for image-based robot manipulation tasks. As an alternative, large pre-trained foundation models have shown promise in robotic manipulation, particularly in zero-shot and few-shot applications. However, using these models directly is unreliable due to limited reasoning capabilities and challenges in understanding physical and spatial contexts. Despite their limitations, the inferential capabilities of foundational models can serve as ``experts'' to inform and guide the reinforcement learning process. This paper proposes ExploRLLM, a novel approach that combines the strengths of reinforcement learning with the knowledge from foundational models. We use the actions suggested by large language models to direct the exploration process, significantly enhancing the efficiency of reinforcement learning and enabling robots to perform better than when relying solely on foundational models. Our experiments demonstrate that this guided exploration facilitates much quicker convergence than training without. % the guidance of foundational models. Additionally, we validate that integrating reinforcement learning with foundational models results in higher success rates and improved task performance than foundational models alone.


ExploRLLM (guiding Exploration in Reinforcement Learning with Language Models)is a novel methodology that integrates the advantages of reinforcement learning with knowledge from foundational models. Our approach utilizes a reinforcement learning agent, equipped with a residual action space and observation space derived from affordances recognized by foundation models. We leverage actions recommended by large language models to guide the exploration process, enhancing the learning strategy.

For the creation of plans in robotic manipulation tasks, prior research often prompts LLMs on every step to generate plans. However, this method of frequent LLM invocation during the training phase is highly resource-intensive, incurring significant time and financial costs due to the numerous iterations required to train a single RL agent. Drawing inspiration from Code-as-Policy, our methodology employs the LLM to hierarchically generate language model programs, which are then executed iteratively during the training phase as exploratory actions, enhancing efficiency and resource utilization.

Interpolate start reference image.


Experiments demonstrate that the exploration method significantly shortens RL's convergence time. We also show that ExploRLLM outperforms the policies derived solely from the LLM and VLM.

As the VLM has already extracted the observation space, the reinforcement learning agent trained within the simulation environment encounters fewer distractions from real-world noise. ExploRLLM approach still yields promising results for zero-shot applications using foundational models.