Lecture 3 April 4, 2026

From VLAs to Real Robots

SmolVLA, Flow Matching & SO-101 Deployment. Go from architecture to hardware — deploy a vision-language-action model on a real robot arm, debug the 10 critical bugs, and run bimanual inference.

SmolVLA OpenVLA Flow Matching SO-101 Deployment Bimanual
Open Lecture Slides

Practical Hands-on Notebooks

From flow matching math to real-robot deployment — build, debug, and deploy a VLA end-to-end.

📗

Flow Matching From Scratch

Build flow matching from first principles — optimal transport paths, velocity fields, and the connection to diffusion.

Open in Colab
📘

Cross-Attention & VLA Architectures

Understand cross-attention conditioning and compare VLA architecture choices — from self-attention to cross-attention fusion.

Open in Colab
📙

SmolVLA Efficiency Tricks

PixelShuffle token compression, layer skipping, and async inference — the tricks that make SmolVLA fast enough for real-time.

Open in Colab
📕

SmolVLA Deployment Simulation

Simulate the full deployment pipeline — from recording demonstrations to training, evaluating, and running inference on SO-101.

Open in Colab
🐛

The 10 Bugs: Interactive Debugging

Walk through the 10 critical bugs we hit deploying SmolVLA on SO-101 — with interactive exercises to find and fix each one.

Open in Colab

Lecture Outline

Seven parts covering the full journey from architecture to deployment — jump to any section in the slides.

🏗️

Open-Source VLAs

🌊

Flow Matching

🤖

Meet SO-101

🚀

Deploy a VLA

🐛

10 Critical Bugs

🎬

Live Demo

🦾

Bimanual Inference


Demo Videos

Real SO-101 bimanual demos from our lab — SmolVLA controlling two arms for box-pass tasks.

Video
🦾

Bimanual Demo 1

Two SO-101 arms performing a coordinated box-pass task with SmolVLA inference.

Video
🦾

Bimanual Demo 2

Handover variation with different object placement and camera angles.

Video
📦

Bimanual Demo 3

Extended sequence showing recovery behavior and multi-step manipulation.

Video
🎯

Bimanual Demo 4

Language-conditioned pick-and-place with voice control for object selection.