EN
English
简体中文
Log inGet started for free

From Sora to Cosmos: The Hidden Infrastructure Behind Physical AI Training Data

The Narrative vs. The Reality

When OpenAI unveiled Sora in February 2024, the demos were mesmerizing. A woman walking down a Tokyo street. A hamster on an airplane. Two golden retrievers podcasting on a mountain. The internet lost its mind over the generative capabilities, the temporal coherence, the physics simulation.

When NVIDIA announced Cosmos in December 2025, the narrative shifted. This wasn’t just about generating pretty videos. This was about physical AI. Robots that understand how to pick up objects. Autonomous vehicles that predict pedestrian behavior. Industrial systems that optimize manufacturing flows. The demos showed AI reasoning about gravity, friction, object permanence, and cause and effect.

The public conversation focused on model architecture. Diffusion transformers. Video tokenizers. Causal reasoning heads. World foundation models.

But behind every announcement, there’s a number that barely gets mentioned. NVIDIA trained Cosmos on 20 million hours of video. OpenAI reportedly used hundreds of millions of video clips for Sora. Google DeepMind’s Genie 2 scraped gameplay from thousands of sources.

The real story isn’t the model. It’s the data pipeline that collected, filtered, processed, and fed that data into the model. And that pipeline has a problem that no transformer architecture can solve.

The Physics of Data Collection

Let’s talk about what it actually takes to build a world model training dataset.

A single hour of 720p video at 30fps is approximately 1GB of storage. That’s after compression. Raw sensor data from autonomous vehicles or robots is 10-100x larger, but most world models today train on internet video because it’s the only source at sufficient scale.

Twenty million hours, the Cosmos training set, is 20 petabytes of raw video. Before preprocessing. Before filtering out corrupted files, static scenes, text overlays, and copyrighted content you can’t use. The actual pipeline probably processed 50-100 petabytes to produce those 20 petabytes of training data.

Where does this data come from? Not from curated academic datasets. Ego4D has 3,000 hours. Kinetics has 650,000 video clips but they’re short and narrow in scope. Something-Something V2 has 220,000 clips focused on human-object interactions.

The only source at the required scale is the open web. YouTube alone has over 500 million hours of video content. TikTok adds hundreds of millions more. Instagram, Twitter/X, Twitch, Bilibili, Reddit, and dozens of regional platforms add billions more hours.

But here’s the problem: every single one of these platforms is designed to prevent exactly the kind of bulk collection that world model training requires.

The Arms Race Nobody Sees

YouTube’s anti-bot system is called “Mainline” internally. It’s a machine learning system that processes hundreds of signals per request: IP reputation, TLS fingerprint, browser signature, request timing, JavaScript execution, mouse movement patterns, scroll depth, session history, and behavioral biometrics.

TikTok’s system is even more aggressive. It fingerprinted devices at the kernel level, analyzing hardware identifiers, sensor calibration data, and network timing characteristics.

Twitter/X has rate limits so strict that even legitimate API users struggle. Instagram requires authenticated sessions with real engagement history. Twitch uses HLS stream protection that changes keys dynamically.

These aren’t obstacles you overcome with clever code. They’re industrial-scale defense systems built by teams of engineers whose entire job is to distinguish humans from bots.

And your world model training pipeline? It’s a bot. By definition. It’s making millions of requests, downloading terabytes of data, following patterns that no human would exhibit. The platforms will detect it. They will block it. They will do so faster than you can adapt.

Unless you look like a human. Millions of humans. From everywhere. All the time.

The Residential Proxy Imperative

This is where the conversation shifts from machine learning to infrastructure. From model architecture to network architecture.

A residential proxy doesn’t hide you. It transforms you. Your requests originate from IP addresses assigned to actual households by actual ISPs. Comcast in Philadelphia. Orange in Paris. NTT in Osaka. Telstra in Sydney.

To YouTube’s Mainline system, a request from a residential IP in São Paulo looks like a Brazilian user watching soccer highlights. Because that’s exactly what the IP is doing when it’s not proxying your requests. It’s a real household, with real browsing history, real Netflix sessions, real WhatsApp calls.

The technical capabilities that matter for world model data collection:

Scale without signature. 50 million IPs means you can make a million requests and never repeat an address. No pattern to cluster. No fingerprint to build. The platform’s ML system sees noise, not signal.

Geographic authenticity. A world model needs to understand driving in Mumbai, walking in Tokyo, cooking in Mexico City, construction in Lagos. You can’t get that from a datacenter in Virginia. You need IPs from those actual locations, accessing content as local users see it.

Session continuity. Some collection tasks require persistence. Logging into a platform, navigating through pages, maintaining cookies across requests. Sticky sessions let you keep the same IP for 10-30 minutes, completing a logical task before rotating.

Temporal distribution. Real humans don’t download videos at 3 AM in perfectly regular intervals. Residential proxies distribute your requests across real usage patterns, time zones, and behavioral rhythms.

ThorData’s residential proxy infrastructure provides this at the scale world model training requires. Fifty million IPs across 195 countries. City-level targeting. Per-request rotation or sticky sessions. Sub-second latency for pipeline integration.

What the Infrastructure Actually Looks Like

Let me describe a real production pipeline, not a toy example. This is what a team training a world model at the scale of Cosmos or Sora actually builds.

The discovery layer runs hundreds of concurrent workers. Each worker queries platform APIs, search engines, and social media trends to find candidate videos. A query like “first person driving rain” might execute simultaneously from IPs in Seattle, London, Mumbai, and São Paulo, because the search results and recommendations differ by geography.

The metadata layer stores video IDs, titles, descriptions, durations, view counts, upload dates, channel information, and geographic origin. This feeds into a quality scoring model that predicts training value. A dashcam video from a German autobahn gets high scores for driving physics. A static cooking tutorial gets low scores for motion diversity. A GoPro mountain bike descent gets high scores for egocentric perspective and dynamic camera movement.

The download layer pulls actual video files. This is where sticky sessions matter. A 10-minute 1080p video might take 5 minutes to download. If your IP rotates mid-download, the connection breaks and you start over. Sticky sessions maintain the same IP for the duration, then rotate for the next download.

The preprocessing layer extracts frames, computes optical flow, separates audio, transcribes speech, detects scenes, and filters out corrupted or low-quality content. This runs on GPU clusters separate from the collection infrastructure.

The training store holds the final dataset, organized by scenario type, geographic origin, camera perspective, and temporal characteristics. This feeds into the actual model training runs.

Every layer above the proxy layer is your code. The proxy layer is the infrastructure that makes all of it possible.

The Geographic Diversity Imperative

Here’s a concrete example of why this matters beyond just “more data.”

NVIDIA’s Cosmos paper emphasizes “physical AI” for robotics and autonomous systems. But consider what “physical” means in different contexts:

A robot trained primarily on US warehouse footage will struggle in a Japanese factory where workers use different tools, different spatial layouts, and different safety protocols. An autonomous vehicle trained on California highways will fail in Indian traffic where lane markings are optional, horn usage is communication, and pedestrians share the road with cows. A home assistant robot trained on American suburban homes will be confused by European apartments with different room configurations, appliance designs, and storage systems.

The physical world is not uniform. Your training data cannot be uniform either.

Residential proxies with geographic targeting let you collect from the actual environments your model needs to understand. Not just country-level, but city-level. A model that needs to understand Mumbai traffic should collect videos from Mumbai, not from a generic “India” IP pool that might land in Bangalore or Delhi.

ThorData’s city-level targeting enables this precision. Need first-person footage from Tokyo’s Shibuya crossing specifically? Targetable. Need construction site videos from Dubai? Targetable. Need rural driving footage from the Australian outback? Targetable.

The Economics of Scale

World model training is not cheap. The compute alone runs into millions of dollars for serious models. But the data infrastructure is often underestimated.

Consider the cost structure of a production world model training pipeline:

ComponentPercentage of Total CostNotes
Compute (GPUs, TPUs)40-50%The visible cost everyone talks about
Storage (hot, warm, cold tiers)20-25%Petabytes add up fast
Data collection infrastructure15-20%Includes proxy costs, API fees, engineering
Preprocessing and filtering10-15%GPU clusters for video processing
Compliance and legal5-10%Licensing, attribution, content review

Proxy costs are typically 5-8% of total infrastructure spend. The alternative is far more expensive: blocked pipelines, incomplete datasets, biased models that fail in real-world deployment, and months of retraining because your initial data was too narrow.

The companies winning the world model race are not the ones with the most GPUs. They’re the ones with the most comprehensive, diverse, and reliably collected training data. The GPUs are a commodity. The data pipeline is a competitive advantage.

The Hidden Risk: Compliance and Ethics

There’s a darker side to world model data collection that responsible teams must address.

Copyright is the obvious concern. Training on copyrighted video without permission exists in legal gray areas that vary by jurisdiction. The EU AI Act has specific requirements for training data documentation. The US has ongoing litigation around fair use for AI training.

Privacy is equally critical. World models trained on internet video will inevitably capture identifiable individuals, license plates, private property, and sensitive locations. Responsible teams implement face blurring, license plate redaction, and geographic exclusion zones around sensitive facilities.

Consent is the hardest problem. The people in these videos did not consent to having their likenesses used to train AI systems. Some teams are exploring synthetic data generation to reduce reliance on real human footage, but we’re years away from synthetic data replacing real-world diversity.

Residential proxies don’t solve these ethical and legal challenges. But they do enable responsible collection practices. Distributed requests from real IPs look less like aggressive scraping and more like normal platform usage. Geographic targeting lets you avoid jurisdictions with stricter regulations. Session management lets you respect platform terms of service rather than hammering APIs with brute force.

The teams that build sustainable world model businesses will be the ones that solve both the technical and ethical infrastructure challenges.

The Competitive Landscape

The world model space is crowded and moving fast:

CompanyModelKnown Data ScaleFocus Area
OpenAISoraHundreds of millions of clipsVideo generation
NVIDIACosmos20 million hoursPhysical AI, robotics, autonomous
Google DeepMindGenie 2Thousands of game sourcesInteractive environments
World LabsLWMUndisclosedSpatial intelligence
Physical Intelligenceπ0UndisclosedRobot manipulation
Figure AIHelixUndisclosedHumanoid robotics

What unites them all? None of them are talking about their data infrastructure. The architecture papers are public. The training compute is known. The data pipeline is the trade secret.

This is your opportunity. The companies that build superior data collection infrastructure today will train superior world models tomorrow. The gap between good and great in this space is not model architecture. It’s the diversity and scale of the physical world the model has been exposed to.

Building Your Data Moat

If you’re serious about world models, here’s your 18-month roadmap:

Months 1-3: Foundation Audit your current data sources. Identify geographic and scenario gaps. Build a proxy-integrated discovery pipeline. Start collecting from 5-10 platforms with ThorData residential proxies.

Months 4-6: Scale Expand to 50+ geographic regions. Implement quality scoring and automated filtering. Build preprocessing pipelines for frame extraction, optical flow, and audio separation.

Months 7-12: Diversity Target underrepresented scenarios. Rural environments. Non-Western indoor spaces. Extreme weather conditions. Industrial processes. Add 10+ new platforms including regional social media.

Months 13-18: Optimization Implement active learning. Use your current model to identify data gaps and prioritize collection. Build synthetic data pipelines to augment real-world footage. Document compliance and provenance for regulatory requirements.

Ongoing: Monitor and adapt Platform anti-bot systems evolve continuously. Your proxy strategy must adapt. Monitor block rates, success rates, and geographic coverage. Rotate providers if needed. Maintain relationships with multiple proxy networks for redundancy.

Conclusion

The world model race is not won in the model architecture papers. It’s won in the data centers, the proxy networks, the storage systems, and the preprocessing pipelines that nobody sees.

Sora was impressive because of what it generated. Cosmos is impressive because of what it understands. But both are impressive because of the world they have seen. The scale of that seeing is measured in petabytes, in millions of hours of video from every corner of the planet, collected through infrastructure that survives the platforms’ best efforts to prevent exactly that collection.

Your world model will only be as capable as the world it has experienced. Build the infrastructure to show it the whole world.

Start building your world model data infrastructure with ThorData residential proxies