From duct tape to transformers — trace how images, language, and actions fuse inside a VLA. Build every component from scratch: ViT, transformer, VLM, and a complete MiniVLA.
Open Lecture SlidesBuild every component of a Vision-Language-Action model from scratch — from language encoding to a complete MiniVLA.
From bag-of-words to Word2Vec to GRU — understand how text representations evolve for robot instruction following.
Open in ColabBuild a full transformer block — multi-head attention, QKV projections, feed-forward networks, and layer norm.
Open in ColabPatch embedding, positional encoding, and self-attention on images — build SigLIP's visual encoder from first principles.
Open in ColabFuse visual tokens with text tokens into a unified VLM — the PaliGemma backbone that powers pi0.
Open in ColabAssemble the complete VLA — visual encoder, language model, and action expert with flow matching into one policy.
Open in ColabManim-animated deep-dives into key VLA components — visual explanations embedded directly in the lecture slides.
Explore the pi0 VLA architecture through animated 3D walkthroughs and interactive 2D visualizers.
Central hub linking all four pi0 architecture steps — visual encoding, VLM backbone, action expert, and input assembly.
Open Hub Three.js · 3DImage patches flow through the ViT encoder to produce visual tokens — the first stage of pi0's perception.
Launch 3D Three.js · 3DVisual tokens merge with text tokens inside PaliGemma — producing fused multimodal embeddings.
Launch 3D Three.js · 3DFlow matching denoises action tokens — from random noise to precise robot joint commands via velocity fields.
Launch 3D Three.js · 3DThree token blocks assemble into one sequence with block-wise attention masking through the shared transformer.
Launch 3D Interactive · 2DInteractive token-level attention visualization — see how visual, text, and action tokens attend to each other.
Open LIBERO · AttentionWhat does pi0.5 actually see? Attention heatmaps, denoising steps, cross-modal grounding, and the arm paradox on a LIBERO task.
Open