This is the second video in the Trelis series on robotics!
I go through the main approaches to AI robots:
The Action Chunking Transformer (ACT)
GR00T N1 from @nvidia
pi0
I also talk about some of the key challenges with robots versus LLMs.
Cheers, Ronan
Timestamps:
0:00 Introduction to Robotics AI Models
0:30 Video Overview: Modelling Challenges, ACT, GR00T-N1, pi0
0:59 Robotics AI Models - quick review
4:23 Challenges in modelling robots: Realtime, Delays, Styles
7:53 The style / multi-modality problem in robotics
10:42 Action Chunking Transformer (ACT Model)
13:03 Variations of ACT, Bi-ACT (uses torques), Multi-ACT (multi-task)
14:50 Advanced notes on ACT
17:11 GR00T N1
19:54 How GR00T trains on unlabelled data
22:30 pi0
24:33 GR00T paper
26:04 pi0 paper
27:41 What robotics model to use?
28:10 Beyond ACT and GR00T
29:25 Scripts and future videos: Trelis.com/ADVANCED-robotics
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Understanding Modern Robot Control Models: ACT, GR00T, and pi0
Neural networks for robot control take several key inputs: text commands, wrist camera images, overhead/front camera views, and current motor positions. These inputs help predict future motor positions, typically in chunks of 10 timesteps spaced 20ms apart. Here's how the main approaches differ:
The ACT Model Architecture
ACT (Action Chunking Transformer) uses a conditional variational autoencoder with:
Small size: ~200MB, few hundred million parameters
Direct motor position prediction
Cross-attention between sinusoidal embeddings and encoded inputs
No self-attention between predicted timesteps
Single task focus due to mode collapse issues
Key limitation: When trained on multiple movement styles (e.g. avoiding obstacles left vs right), ACT averages the predictions, potentially causing collisions.
The GR00T Approach
GR00T introduces several advances:
Diffusion-based prediction to handle multiple movement styles
Training on unlabeled video data using learned action latents
Integration with Eagle-2 vision-language model
Cross-attention at multiple layers
~3B parameters total
The diffusion process starts with noise and iteratively refines it into motor predictions, avoiding the averaging problem of direct prediction.
pi0's Innovations
pi0 builds on similar foundations with:
Layer-wise attention instead of repeated cross-attention
Bidirectional self-attention between predicted timesteps
Separate weights for action prediction vs vision-language processing
Training on labeled data
~2.6B parameters using Gemma base model + 300M for the motor position decoder
Practical Considerations
When choosing between models:
ACT advantages:
Fast inference
Small footprint
Edge device compatible
ACT limitations:
Single task only
Mode collapse issues
No style handling
GR00T/pi0 advantages:
Multi-task capable
Better style handling
More general purpose
GR00T/pi0 limitations:
Slower inference
Larger compute requirements
More complex training
Future improvements could include:
Adding torque predictions for better dexterity
Including short motor position history
Optimizing chunk prediction overlap
Balancing speed vs capability trade-offs