Lecture 2 March 28, 2026

Vision-Language-Action Models

From duct tape to transformers — trace how images, language, and actions fuse inside a VLA. Build every component from scratch: ViT, transformer, VLM, and a complete MiniVLA.

SigLIP / ViT Transformer VLM Flow Matching pi0
Open Lecture Slides

📓 Practical Hands-on Notebooks

Build every component of a Vision-Language-Action model from scratch — from language encoding to a complete MiniVLA.

📗

Language Encoding Levels

From bag-of-words to Word2Vec to GRU — understand how text representations evolve for robot instruction following.

Open in Colab
📘

Transformer From Scratch

Build a full transformer block — multi-head attention, QKV projections, feed-forward networks, and layer norm.

Open in Colab
📙

Vision Transformer (ViT)

Patch embedding, positional encoding, and self-attention on images — build SigLIP's visual encoder from first principles.

Open in Colab
📕

Vision-Language Model

Fuse visual tokens with text tokens into a unified VLM — the PaliGemma backbone that powers pi0.

Open in Colab
🤖

Build MiniVLA

Assemble the complete VLA — visual encoder, language model, and action expert with flow matching into one policy.

Open in Colab

🎬 Component Animations

Manim-animated deep-dives into key VLA components — visual explanations embedded directly in the lecture slides.

📝

Bag of Words

🗺️

Word2Vec Space

🔁

GRU Processing

🎯

Attention Scores

👁️

Multi-Head Attn

🔮

Next-Token Pred

🧠

VLM Processing

🤖

MiniVLA Pipeline


🎲 Interactive Visualizers

Explore the pi0 VLA architecture through animated 3D walkthroughs and interactive 2D visualizers.

Hub · All Steps
🏠

pi0 Viz Hub

Central hub linking all four pi0 architecture steps — visual encoding, VLM backbone, action expert, and input assembly.

Open Hub
Three.js · 3D
👁️

Step 1: SigLIP Visual Encoding

Image patches flow through the ViT encoder to produce visual tokens — the first stage of pi0's perception.

Launch 3D
Three.js · 3D
🧠

Step 2: VLM Backbone

Visual tokens merge with text tokens inside PaliGemma — producing fused multimodal embeddings.

Launch 3D
Three.js · 3D
🎯

Step 3: Action Expert

Flow matching denoises action tokens — from random noise to precise robot joint commands via velocity fields.

Launch 3D
Three.js · 3D
🧩

Step 4: Input Assembly

Three token blocks assemble into one sequence with block-wise attention masking through the shared transformer.

Launch 3D
Interactive · 2D
🔄

Token Attention Explorer

Interactive token-level attention visualization — see how visual, text, and action tokens attend to each other.

Open
LIBERO · Attention
🔬

Pi0.5 Attention Visualized

What does pi0.5 actually see? Attention heatmaps, denoising steps, cross-modal grounding, and the arm paradox on a LIBERO task.

Open