Building Embodied EvoAgent: A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models

Embodied artificial intelligence has rapidly developed under the impetus of multimodal learning, robotics, and cognitive science, demonstrating great potential in fields such as navigation and manipulation. However, building embodied agents that can robustly operate in diverse and dynamic environments still faces challenges, such as handling partial observability and environmental adaptability. Multimodal large language models (MLLMs) are vital for embodied intelligence due to their ability to process multimodal information, but they encounter difficulties in understanding spatial environments and performing dynamic decisions and evolution. Inspired by the functional specialization of the left and right hemispheres of the human brain, this paper proposes a learning and evolution paradigm for embodied agents. The method designs an embodied context-augmented MLLM to simulate the language processing and logical analysis capabilities of the left hemisphere, responsible for understanding instructions and visual scenes. At the same time, it constructs a perceptual context-guided world model based on the recurrent state space model to simulate the spatial perception and holistic thinking functions of the right hemisphere, capturing environmental dynamics and predicting future states. By simulating the communication function of the corpus callosum, we propose dynamic communication slots for efficient information exchange between MLLMs and the world model, which also allows the agent to quickly adapt to dynamic environments without requiring extensive computational resources. Experiments show that the proposed paradigm significantly improves the performance of embodied agents in a series of tasks and enhances their generalization ability in zero-shot tasks through embodied exploration experience and online evolution.

Building Embodied EvoAgent :
A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models

Abstract

Overview

Method

Results

Figure R1: Overall comparison with specialized models and the unified baseline model NaviLLM on in-domain tasks. We report GP for CVDN, SPL for SOON, R2R, and REVERIE, and report EM Accuracy for ScanQA. * indicates experimental results that we have reproduced.

Figure R2: Comparative results of parameter-efficient fine-tuning for multiple MMLMs on in-domain datasets.

Figure R3: Task success rates on 3 subsets of EB-ALFRED, EB-Habitat,EB-Navigation, and EB-Manipulation of EmbodiedBench. GPT-4o and Claude-3.5 are SoTA proprietary MLLMs for reference. Superior results compared to baselines are shown in bold.

Figure R4: Evaluation on VSI-Bench. † indicates results on VSI-Bench (tiny) set.

Figure R5: Ablation study of the proposed BEEA.

Visualization

Figure V1: Trajectory visualization comparison with NaviLLM on the REVERIE dataset.

Figure V2: Embodied execution examples in EB-ALFRED and EB-Manipulation using our BEEA with InternVL2.5-78B.

Figure V3: Qualitative comparison on the VSI-Bench dataset using the InternVL2.5-78B model.

BibTeX

Building Embodied EvoAgent : A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models