ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models

¹Delft University of Technology ²RWTH Aachen University

Abstract

In robot manipulation tasks with large observation and action spaces, reinforcement learning (RL) often suffers from low sample efficiency and uncertain convergence. As an alternative, foundation models have shown promise in zero-shot and few-shot applications. However, these models can be unreliable due to their limited reasoning and challenges in understanding physical and spatial contexts. This paper introduces ExploRLLM, a method that combines the commonsense reasoning of foundation models with the experiential learning capabilities of RL. We leverage the strengths of both paradigms by using foundation models to obtain a base policy, an efficient representation, and an exploration policy. A residual RL agent learns when and how to deviate from the base policy while its exploration is guided by the exploration policy. In table-top manipulation experiments, we demonstrate that ExploRLLM outperforms both baseline foundation model policies and baseline RL policies. Additionally, we show that this policy can be transferred to the real world without further training.

ExploRLLM

ExploRLLM is a novel methodology that integrates the advantages of reinforcement learning with knowledge from foundational models. Our approach involves a reinforcement learning agent, equipped with a residual action space and observation space derived from affordances recognized by foundation models. We leverage actions recommended by large language models to guide the exploration process, increasing the likelihood of visiting meaningful states.

For the creation of plans in robotic manipulation tasks, prior research often prompts LLMs on every step to generate plans. However, this method of frequent LLM invocation during the training phase is highly resource-intensive, incurring significant time and financial costs due to the numerous iterations required to train a single RL agent. Drawing inspiration from Code-as-Policy, our methodology employs the LLM to hierarchically generate language model programs, which are then executed iteratively during the training phase as exploratory actions, enhancing efficiency and resource utilization.

Results

Experiments show that the exploration method significantly reduces RL convergence time. Additionally, ExploRLLM outperforms policies based solely on the LLM and VLM.

Since the VLM has already extracted a reduced observation space, the RL agent in the simulation faces fewer distractions from real-world noise. ExploRLLM also shows promising results in zero-shot applications with foundational models.