Training Robots - Part 2: Intro to Robotics Models

A series on robotics

Trelis Research

Jul 08, 2025

This is the second video in the Trelis series on robotics!

I go through the main approaches to AI robots:

The Action Chunking Transformer (ACT)
GR00T N1 from @nvidia
pi0

I also talk about some of the key challenges with robots versus LLMs.

Cheers, Ronan

Purchase Repo Access

Timestamps:

0:00 Introduction to Robotics AI Models

0:30 Video Overview: Modelling Challenges, ACT, GR00T-N1, pi0

0:59 Robotics AI Models - quick review

4:23 Challenges in modelling robots: Realtime, Delays, Styles

7:53 The style / multi-modality problem in robotics

10:42 Action Chunking Transformer (ACT Model)

13:03 Variations of ACT, Bi-ACT (uses torques), Multi-ACT (multi-task)

14:50 Advanced notes on ACT

17:11 GR00T N1

19:54 How GR00T trains on unlabelled data

22:30 pi0

24:33 GR00T paper

26:04 pi0 paper

27:41 What robotics model to use?

28:10 Beyond ACT and GR00T

29:25 Scripts and future videos: Trelis.com/ADVANCED-robotics

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Understanding Modern Robot Control Models: ACT, GR00T, and pi0

Neural networks for robot control take several key inputs: text commands, wrist camera images, overhead/front camera views, and current motor positions. These inputs help predict future motor positions, typically in chunks of 10 timesteps spaced 20ms apart. Here's how the main approaches differ:

The ACT Model Architecture

ACT (Action Chunking Transformer) uses a conditional variational autoencoder with:

Small size: ~200MB, few hundred million parameters
Direct motor position prediction
Cross-attention between sinusoidal embeddings and encoded inputs
No self-attention between predicted timesteps
Single task focus due to mode collapse issues

Key limitation: When trained on multiple movement styles (e.g. avoiding obstacles left vs right), ACT averages the predictions, potentially causing collisions.

The GR00T Approach

GR00T introduces several advances:

Diffusion-based prediction to handle multiple movement styles
Training on unlabeled video data using learned action latents
Integration with Eagle-2 vision-language model
Cross-attention at multiple layers
~3B parameters total

The diffusion process starts with noise and iteratively refines it into motor predictions, avoiding the averaging problem of direct prediction.

pi0's Innovations

pi0 builds on similar foundations with:

Layer-wise attention instead of repeated cross-attention
Bidirectional self-attention between predicted timesteps
Separate weights for action prediction vs vision-language processing
Training on labeled data
~2.6B parameters using Gemma base model + 300M for the motor position decoder

Practical Considerations

When choosing between models:

ACT advantages:

Fast inference
Small footprint
Edge device compatible

ACT limitations:

Single task only
Mode collapse issues
No style handling

GR00T/pi0 advantages:

Multi-task capable
Better style handling
More general purpose

GR00T/pi0 limitations:

Slower inference
Larger compute requirements
More complex training

Future improvements could include:

Adding torque predictions for better dexterity
Including short motor position history
Optimizing chunk prediction overlap
Balancing speed vs capability trade-offs