RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

Abstract

Embodied intelligence seamlessly integrates vision, language, and action. However, most multimodal robotic models rely on massive pre-training or extra datasets, incurring high time and hardware costs. To address this, we introduce RoboBERT, an end-to-end multi modal manipulation model built around a novel two-stage training paradigm.

In the first stage, we freeze most of the vision encoder and train with a single “standard” instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy.~In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance. We further employ systematic data augmentations—salt-and-pepper noise, affine transformations, color jitter, and robotic mixup—to enhance robustness against visual perturbations.

Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD→D benchmark and 3.79 on the ABC→D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture. Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data. These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for lightweight multimodal robotic systems.

RoboBERT Model

The RoboBERT architecture consists of language connectors, a modality fusion layer, and a diffusion head, responsible for sentence understanding, modality integration, and action generation, respectively. The last layer of the ViT is unfrozen during training to adapt to the task. The policy workflow begins by taking observations from the last 1-2 frames, predicting actions over multiple frames, and then outputting actions for the near future. Afterward, new observations are taken, and the cycle repeats.

Two-stage Training Method

To address the challenge of training with varied natural language inputs, we divide training into two stages. In the first stage, we use a single consistent "standard language" to stabilize label values, allowing the model to focus on learning policy.

To accelerate training and preserve the pretrained weights of the CLIP encoder during the first stage, the vision encoder—typically resource-intensive—is frozen, with the exception of the final layer.

After completing the first stage, the model learns to align fixed instructions with their corresponding actions. In the second stage, we inject natural, diverse language labels to fine-tune the model, ensuring fast and accurate alignment with the previously learned policy. The second-stage loss is minimal, reducing the risk of damaging the pretrained model, and all parameters, including the vision encoder, are unfrozen for further fine-tuning.

Data Augmentation

The data augmentation is quite important whe using the limited data for end-to-end training. The various data augmentation techiques are employed and demonstrated as follows. (a) are common techiques for CV tasks. (b) is mixup for robotic task. It worthy noting that not all data augmentations have the positive effect to the model like translation. In the latest experiment, Mixup is only applied during the training of subset ABC_D because mixup don't have an evident improvement for ABCD_D subset but it improve a lot for ABC_D subset. It always needs more epoches to get the best results so we don't use it for ABCD_D.