AssemBrain

Abstract

Industry 5.0 advocates human-centric, intelligent, and flexible manufacturing, with growing attention to unstructured assembly tasks. Multimodal large language models offer a promising pathway by enabling cross-modal understanding and contextual awareness. However, their integration with embodied robots especially in physically interactive industrial settings remains limited. To address this, we propose AssemBrain, a brain-inspired hierarchical framework integrating MLLMs and embodied control to achieve generative AI-based cognizing, planning, and execution across three levels: task understanding and planning, vision-language-guided planar manipulation, and reinforcement learning-based contact-rich assembly primitives. The proposed framework is evaluated through simulated tasks of varying complexity and degrees of freedom, covering diverse object and texture types. Experimental results demonstrate strong execution efficiency. This study establishes a baseline for embodied intelligence-integrated manufacturing in this emerging domain. It is hoped that this work advances industrial robotics toward human-centered collaboration and provide insights for intelligent assembly systems.

Method

This paper explores three paradigms for integrating MLLMs with embodied intelligence in smart manufacturing. Inspired by the hierarchical architecture of the brain, AssemBrain is conceptualized as a multi-level cognitive system for intelligent spatial assembly.

1. LLM for Unstructured Task Planning

Level 1 (L1): "Cognitive Cortex"

L1 corresponds to the brain's higher-order cognitive regions, focusing on understanding language instructions and long-term planning. This is fulfilled by MLLMs, which support zero-shot task interpretation, code generation, and task decomposition for unstructured tasks.

2. VLA for Visuo-Lingo Motor Policy

Level 2 (L2): "Sensorimotor Cortex"

L2 mirrors the sensorimotor regions that integrate visual and linguistic inputs to guide physical actions. This is embodied by Vision-Language Action models, excelling in SE(2) planar tasks such as pick-and-place.

3. Primitives for Contact-Rich Assembly

Level 3 (L3): "Cerebellum"

L3 is like the cerebellum, which refines motor commands for precision and stability. This paradigm leverages action primitives and reinforcement learning to execute complex SE(3) manipulation in contact-rich assembly, handling the nuances of force control and adaptation.

Planning Experiments

Task Level	Task Type	Objects and Success Rates
Object-Generalization	Pick-and-Place	Hammer (85.00%) Screwdriver (65.00%) Gear (95.00%)	Nut (56.00%) Bolt (79.00%) Average: 76.00%
Scene-Understanding	Contextual Pick	Gold (86.00%) Cuprum (89.00%) Steel (84.04%) Aluminum (83.00%)	Corroded metal (93.00%) Bronze (82.00%) Wooden (85.00%) Plastic (90.00%) Average: 86.50%
Long-Horizon	Rearrangement	Hammer, Screw, Gear	Success Rate: 41.00%

Object-Generalization

Scene-Understanding Long-Horizon

Assembly Experiments

Policy Comparison:

A2C (Advantage Actor-Critic): Baseline method using standard policy gradients with single-step updates.
PPO (Proximal Policy Optimization): Enhanced method with clipping mechanism for stable training and mini-batch updates.

Control Strategy Comparison:

OSC (Operational Space Control): Computes end-effector force/torque in task space, suitable for dynamic interaction and contact-rich tasks.
IK (Inverse Kinematics): Transforms pose error to joint positions, simple but less robust to contact.

All experiments are conducted in the Isaac Lab environment. The comparative study evaluates both policy learning algorithms and control strategies across different assembly tasks.

Method	PegInsert			GearMesh			NutThread
Method	avg reward	avg step	success	avg reward	avg step	success	avg reward	avg step	success
PPO+IK	12.50	449.0	0	12.20	449.0	0	12.90	449.0	0
A2C+OSC	65.24	149.0	15.15%	66.25	299.0	12.5%	53.29	449.0	50%
PPO+OSC	362.19	149.0	97.66%	715.28	299.0	98.44%	888.85	449.0	99.22%

Before Training

Peg Insertion

Gear Mesh

Nut Thread

After Training