Lecture 2: Vision-Language-Action Models — Modern Robot Learning V2

📓 Practical Hands-on Notebooks

Build every component of a Vision-Language-Action model from scratch — from language encoding to a complete MiniVLA.

📗

Language Encoding Levels

From bag-of-words to Word2Vec to GRU — understand how text representations evolve for robot instruction following.

Open in Colab

📘

Transformer From Scratch

Build a full transformer block — multi-head attention, QKV projections, feed-forward networks, and layer norm.

Open in Colab

📙

Vision Transformer (ViT)

Patch embedding, positional encoding, and self-attention on images — build SigLIP's visual encoder from first principles.

Open in Colab

📕

Vision-Language Model

Fuse visual tokens with text tokens into a unified VLM — the PaliGemma backbone that powers pi0.

Open in Colab

🤖

Build MiniVLA

Assemble the complete VLA — visual encoder, language model, and action expert with flow matching into one policy.

Open in Colab

🎬 Component Animations

Manim-animated deep-dives into key VLA components — visual explanations embedded directly in the lecture slides.

📝

MiniVLA Pipeline

🎲 Interactive Visualizers

Explore the pi0 VLA architecture through animated 3D walkthroughs and interactive 2D visualizers.

Hub · All Steps

🏠

pi0 Viz Hub

Central hub linking all four pi0 architecture steps — visual encoding, VLM backbone, action expert, and input assembly.

Open Hub Three.js · 3D

👁️

Step 1: SigLIP Visual Encoding

Image patches flow through the ViT encoder to produce visual tokens — the first stage of pi0's perception.

Launch 3D Three.js · 3D

🧠

Step 2: VLM Backbone

Visual tokens merge with text tokens inside PaliGemma — producing fused multimodal embeddings.

Launch 3D Three.js · 3D

🎯

Step 3: Action Expert

Flow matching denoises action tokens — from random noise to precise robot joint commands via velocity fields.

Launch 3D Three.js · 3D

🧩

Step 4: Input Assembly

Three token blocks assemble into one sequence with block-wise attention masking through the shared transformer.

Launch 3D Interactive · 2D

🔄

Token Attention Explorer

Interactive token-level attention visualization — see how visual, text, and action tokens attend to each other.

Open LIBERO · Attention

🔬

Pi0.5 Attention Visualized

What does pi0.5 actually see? Attention heatmaps, denoising steps, cross-modal grounding, and the arm paradox on a LIBERO task.

Open

Vision-Language-Action Models

📓 Practical Hands-on Notebooks

Language Encoding Levels

Transformer From Scratch

Vision Transformer (ViT)

Vision-Language Model

Build MiniVLA

🎬 Component Animations

Bag of Words

Word2Vec Space

GRU Processing

Attention Scores

Multi-Head Attn

Next-Token Pred

VLM Processing

MiniVLA Pipeline

🎲 Interactive Visualizers

pi0 Viz Hub

Step 1: SigLIP Visual Encoding

Step 2: VLM Backbone

Step 3: Action Expert

Step 4: Input Assembly

Token Attention Explorer

Pi0.5 Attention Visualized