:
Embodied artificial intelligence has rapidly developed under the impetus of multimodal learning, robotics, and cognitive science, demonstrating great potential in fields such as navigation and manipulation. However, building embodied agents that can robustly operate in diverse and dynamic environments still faces challenges, such as handling partial observability and environmental adaptability. Multimodal large language models (MLLMs) are vital for embodied intelligence due to their ability to process multimodal information, but they encounter difficulties in understanding spatial environments and performing dynamic decisions and evolution. Inspired by the functional specialization of the left and right hemispheres of the human brain, this paper proposes a learning and evolution paradigm for embodied agents. The method designs an embodied context-augmented MLLM to simulate the language processing and logical analysis capabilities of the left hemisphere, responsible for understanding instructions and visual scenes. At the same time, it constructs a perceptual context-guided world model based on the recurrent state space model to simulate the spatial perception and holistic thinking functions of the right hemisphere, capturing environmental dynamics and predicting future states. By simulating the communication function of the corpus callosum, we propose dynamic communication slots for efficient information exchange between MLLMs and the world model, which also allows the agent to quickly adapt to dynamic environments without requiring extensive computational resources. Experiments show that the proposed paradigm significantly improves the performance of embodied agents in a series of tasks and enhances their generalization ability in zero-shot tasks through embodied exploration experience and online evolution.
Our framework comprises three bio-inspired modules:
(1) EC-MLLM🗣️ (left hemisphere) processes language-visual inputs for task understanding;
(2) PC-WM🌍 (right hemisphere) models environment dynamics through recurrent state space model;
(3) DCS🔄 (corpus callosum) enables inter-module communication via bidirectional message passing.
In this section, we evaluate the capabilities of embodied execution, generalization, and evolution of the brain-inspired embodied evolutionary agent (BEEA) proposed in this paper. Firstly, training on basic embodied tasks such as navigation enhances the effectiveness of the foundation model. Subsequently, we conduct zero-shot generalization across diverse embodied tasks, validating the proposed paradigm's contribution to improving embodied execution and spatial intelligence capabilities.
For detailed experimental settings and comparative results, please refer to the Supplementary Materials.
For more qualitative analysis, please refer to the Supplementary Materials.
@InProceedings{Gao_2025_ACMMM,
author = {Junyu Gao, Xuan Yao, Yong Rui and Changsheng Xu},
title = {Building Embodied EvoAgent: A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM)},
year = {2025},
}