RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World

Abstract

Existing policy learning methods predominantly adopt the task-centric paradigm, necessitating the collection of task data in an end-to-end manner. Consequently, The learned policy tends to fail in tackling novel tasks. Moreover, it is hard to localize the errors for a complex task with multiple stages due to end-to-end learning. To address these challenges, we propose RoboMatrix, a skill-centric and hierarchical framework for scalable task planning and execution. We first introduce a novel skill-centric paradigm thatextracts the common meta-skills from different complex tasks. This allows for the capture of embodied demonstrations through a skill-centric approach, enabling the completion of open-world tasks by combining learned meta-skills. To fully leverage meta-skills, we further develop a hierarchical framework that decouples complex robot tasks into three interconnected layers: (1) a high-level modular scheduling layer; (2) a middle-level skill layer; and (3) a low-level hardware layer. Experimental results illustrate that our skill-centric and hierarchical framework achieves remarkable generalization performance across novel objects, scenes, tasks, and embodiments. This framework offers a novel solution for robot task planning and execution in open-world scenarios. Our software and hardware will be made avaliable as open-source resources.

Skill-Centric

Inspiration of the skill-centric method. Robots with different modalities can perform different tasks and robots with the same modality can be used in various scenarios. We extract similar elements from the multitude of diverse robotic tasks, defining these elements as meta-skills, and store them in a skill list. Then, these skills are used to train the Vision-Language-Action (VLA) model or to construct traditional models, which can eventually lead to a skill model capable of adapting to new tasks.

RoboMatrix

The system accepts the task description in either text or audio format. The text can be entered manually, while the audio is converted into text format by the audio-to-text module. The Modular Scheduling Layer serves as the high-level planner of the system. The agent decomposes complex tasks into an ordered sequence of subtasks based on the robot's skill list and adds them sequentially to the execution queue. Before executing a subtask, the execution checker verifies its executability by determining whether the object to be manipulated or grasped is present in the scene based on the robot's environment observations. The Skill Layer maps the description of subtasks to robot actions using either hybrid model or VLA model, with the action including a stop signal to determine whether the current subtask is complete. The Hardware Layer manages the controller and stage observer of the robot, with the controller converting actions into control signals and the stage observer continuously updating the robot's state and image in real-time.

Generalization

Vision-Language-Action model generalization to novel and challenging scenarios.

"Place the pink cube into black toolbox."

"Crossing the obstacles at the front."

Dynamic Adversarial Interaction

A human dynamically manipulates the placement of obstacles in the robot's path. The VLA model dynamically processes visual input and generates actions to help the robot avoid these obstacles.

"Crossing the obstacles at the front."

"Place the blue cup into the brown basket."

Super Long-horizon Tasks

RoboMatrix takes a composite task prompt as input, decomposes it into a skill list with 15 sequential steps, and leverages the VLA skill model to successfully execute the super long-Horizon tasks.

BibTeX

@article{mao2024robomatrix,
  title={Robomatrix: A skill-centric hierarchical framework for scalable robot task planning and execution in open-world},
  author={Mao, Weixin and Zhong, Weiheng and Jiang, Zhou and Fang, Dong and Zhang, Zhongyue and Lan, Zihan and Li, Haosheng and Jia, Fan and Wang, Tiancai and Fan, Haoqiang and others},
  journal={arXiv preprint arXiv:2412.00171},
  year={2024}
}