Insights

/

feb 16, 2025

Why Robots Need Billions of Videos for World Models

/

AUTHOR

/

AUTHOR

/

AUTHOR

Kevin YENA

In natural language processing, pretraining on the internet has unlocked the era of large language models (LLMs). But robotics doesn’t have the web of motion. You can’t scrape “how to tie shoelaces” or “how to pick a grape without crushing it” from HTML pages. To endow robots with safe, versatile motion intelligence, we need billions of egocentric videos capturing real-world human activity.

The Web Isn’t Enough

-LLMs scale on text scraped from billions of pages.

-Robots can’t learn motion from YouTube alone: too noisy, too sparse, not labeled, and often lacking the low-level physics signals (angles, forces, speeds).

-Robots don’t just need what to do (semantics) but how to do it (control).


World Models as a Bridge

World models (e.g. Dreamer, PlaNet, Gato, Gemini’s VLA branch) attempt to compress video, proprioception, and action into predictive models.

-Pretraining on billions of videos lets these models learn physics priors: gravity, inertia, contact dynamics.

-Without this massive base, robot policies overfit: they can fold towels in the lab but fail in a messy kitchen.


Egocentric Data: Cameras on Humans

Human Wave’s approach: contributors wear lightweight head-mounted cameras and capture daily motions.

-Provides first-person perspective (what the robot’s own cameras would see).

-Captures grasping subtleties (force of grip, wrist angle).

-Orders of magnitude more scalable than lab demos: a contributor can generate thousands of labeled clips per day.


Numbers That Matter

-State-of-the-art LLMs pretrain on ~1T tokens.

-Equivalent scale in robotics: billions of motion frames (video + action + metadata).

-A single robot may need >10M task-specific samples before safe deployment.


Robotics won’t scale like software unless the data problem is solved. Human Wave’s billions of videos represent the equivalent of the web crawl for movement. It’s the missing dataset layer for robotic world models.


Robots needs real-world data, people need jobs.

@2025 HUMAN WAVE. ALL RIGHTS RESERVED.