🤖AssemBrain

Towards Embodied Intelligence Paradigms for Robot Collaboration in Spatial Assembly

Shanghai Jiao Tong University
(Corresponding Author)
Figure 1 Framework

Abstract

Industry 5.0 advocates human-centric, intelligent, and flexible manufacturing, with growing attention to unstructured assembly tasks. Multimodal large language models offer a promising pathway by enabling cross-modal understanding and contextual awareness. However, their integration with embodied robots especially in physically interactive industrial settings remains limited. To address this, we propose AssemBrain, a brain-inspired hierarchical framework integrating MLLMs and embodied control to achieve generative AI-based cognizing, planning, and execution across three levels: task understanding and planning, vision-language-guided planar manipulation, and reinforcement learning-based contact-rich assembly primitives. The proposed framework is evaluated through simulated tasks of varying complexity and degrees of freedom, covering diverse object and texture types. Experimental results demonstrate strong execution efficiency. This study establishes a baseline for embodied intelligence-integrated manufacturing in this emerging domain. It is hoped that this work advances industrial robotics toward human-centered collaboration and provide insights for intelligent assembly systems.

Method

This paper explores three paradigms for integrating MLLMs with embodied intelligence in smart manufacturing. Inspired by the hierarchical architecture of the brain, AssemBrain is conceptualized as a multi-level cognitive system for intelligent spatial assembly.

1. LLM for Unstructured Task Planning

Level 1 (L1): "Cognitive Cortex"

L1 corresponds to the brain's higher-order cognitive regions, focusing on understanding language instructions and long-term planning. This is fulfilled by MLLMs, which support zero-shot task interpretation, code generation, and task decomposition for unstructured tasks.

LLM for Unstructured Task Planning

2. VLA for Visuo-Lingo Motor Policy

Level 2 (L2): "Sensorimotor Cortex"

L2 mirrors the sensorimotor regions that integrate visual and linguistic inputs to guide physical actions. This is embodied by Vision-Language Action models, excelling in SE(2) planar tasks such as pick-and-place.

VLA for Visuo-Lingo Motor Policy

3. Primitives for Contact-Rich Assembly

Level 3 (L3): "Cerebellum"

L3 is like the cerebellum, which refines motor commands for precision and stability. This paradigm leverages action primitives and reinforcement learning to execute complex SE(3) manipulation in contact-rich assembly, handling the nuances of force control and adaptation.

Primitives for Contact-Rich Assembly

Planning Experiments

Assembly tools and textures
Task Level Task Type Objects and Success Rates
Object-Generalization Pick-and-Place Hammer (85.00%)
Screwdriver (65.00%)
Gear (95.00%)
Nut (56.00%)
Bolt (79.00%)
Average: 76.00%
Scene-Understanding Contextual Pick Gold (86.00%)
Cuprum (89.00%)
Steel (84.04%)
Aluminum (83.00%)
Corroded metal (93.00%)
Bronze (82.00%)
Wooden (85.00%)
Plastic (90.00%)
Average: 86.50%
Long-Horizon Rearrangement Hammer, Screw, Gear Success Rate: 41.00%

Object-Generalization
Scene-Understanding Long-Horizon

Assembly Experiments

Policy Comparison:

  • A2C (Advantage Actor-Critic): Baseline method using standard policy gradients with single-step updates.
  • PPO (Proximal Policy Optimization): Enhanced method with clipping mechanism for stable training and mini-batch updates.

Control Strategy Comparison:

  • OSC (Operational Space Control): Computes end-effector force/torque in task space, suitable for dynamic interaction and contact-rich tasks.
  • IK (Inverse Kinematics): Transforms pose error to joint positions, simple but less robust to contact.

All experiments are conducted in the Isaac Lab environment. The comparative study evaluates both policy learning algorithms and control strategies across different assembly tasks.


Method PegInsert GearMesh NutThread
avg reward avg step success avg reward avg step success avg reward avg step success
PPO+IK 12.50 449.0 0 12.20 449.0 0 12.90 449.0 0
A2C+OSC 65.24 149.0 15.15% 66.25 299.0 12.5% 53.29 449.0 50%
PPO+OSC 362.19 149.0 97.66% 715.28 299.0 98.44% 888.85 449.0 99.22%

Before Training

Peg Insertion

Gear Mesh

Nut Thread

After Training

Peg Insertion

Gear Mesh

Nut Thread

LLM for Unstructured Task Planning

Discussions

Key Question L1: LLM for Unstructured Task Planning L2: VLA for Visuo-Lingo Motor Policy L3: Primitives for Contact-Rich Assembly
Q1: Which paradigm to choose? Suitable for planning in complex, long-horizon tasks, such as unstructured production Ideal for visual-language-guided operations, such as multimodal HRC with AR Tailored for spatial tasks with dense physical contact, such as nut and gear assembly
Q2: What are the strengths? Enables zero-shot task interpretation, code generation, and task decomposition, reducing ambiguity Enables direct action generation from vision-language inputs, excelling in SE(2) planar tasks Enables action primitives as a foundation for fine-grained SE(3) manipulation via reinforcement learning
Q3: Where are the challenges? Reliable code generation remains challenging for dynamic interactions Lacks proprioceptive sensing, limiting performance in contact-sensitive tasks Reward design remains a bottleneck requiring domain expertise and tuning

BibTeX

If you find it helpful, please consider citing our work:

@article{wu2024assembrain,
  title={AssemBrain: Towards Embodied Intelligence Paradigms for Robot Collaboration in Spatial Assembly},
  author={Wu,Duidi, Zhao,Qianyou, Shuo,Zhang, Qi,Jin, Ma,Jin, Zhu,Guo-Niu, and Hu,Jie},
  journal={preprint},
  year={2024}
}