Fetch real-time data from 100+ websites,No development or maintenance required.
Over 100 million real residential IPs from genuine users across 190+ countries.
SCRAPING SOLUTIONS
Get accurate and in real-time results sourced from Google, Bing, and more.
With 120+ prebuilt and custom scrapers ready for any use case.
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Execute scripts in stealth browsers with full rendering and automation
PROXY INFRASTRUCTURE
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
SCRAPING SOLUTIONS
PROXY INFRASTRUCTURE
DATA FEEDS
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Products $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Data for AI $/GB
Pricing $0/GB
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN $/GB
产品 $/GB
AI数据 $/GB
定价 $0/GB
产品文档 $/GB
资源 $/GB
简体中文 $/GB
Blog
Residential Proxies
Robotics companies training vision-language-action models face a specific data challenge. Their models need to understand not just “cooking” in the abstract, but the specific physical environments where cooking happens. The layout of a Tokyo apartment kitchen with its compact appliances and vertical storage. The open-air cooking space in a Lagos compound with its charcoal stoves and shared preparation areas. The commercial kitchen in a Mumbai restaurant with its stainless steel surfaces and high-volume workflows. The suburban American kitchen with its island counters and double ovens.
A robot trained only on American kitchen footage fails catastrophically in any other environment. It doesn’t recognize the appliances. It misjudges spatial relationships. It fails to predict human movement patterns. The training data must capture global diversity, and the collection infrastructure must access that diversity.
YouTube is the richest source of kitchen footage ever assembled. Billions of hours of cooking tutorials, recipe demonstrations, kitchen tours, and food preparation videos from every culture and environment. But accessing this diversity requires infrastructure that can present as a local user in each target region.
YouTube’s recommendation and search systems are heavily personalized by geography. A search for “kitchen cooking” from a US IP returns American suburban kitchens, English-language content, and Western culinary traditions. The same search from a Japanese IP returns compact apartment kitchens, Japanese-language content, and Asian culinary techniques. The platform optimizes for engagement, and engagement is highest with culturally familiar content.
For robotics training, this means collection infrastructure must query from target regions to receive target results. A datacenter proxy in Virginia searching for global kitchen diversity receives primarily American results regardless of query refinement. A residential proxy in Osaka searching for kitchen content receives genuinely Japanese results because the platform serves that IP as it serves any Osaka user.
The geographic precision requirements extend beyond country level. A model deployed in São Paulo needs to understand São Paulo kitchens, which differ from Rio kitchens, which differ from rural Minas Gerais kitchens. City-level proxy targeting enables this precision.
ThorData’s residential infrastructure provides this granularity. City-level targeting across 195 countries means a query for “cozinha pequena” (small kitchen) can originate from São Paulo, Belo Horizonte, or Salvador, each returning distinct regional content. Session management maintains consistent identity for multi-video channel browsing, enabling collection of creator portfolios that show consistent kitchen environments. Rotation control distributes search queries across thousands of IPs to prevent pattern detection.
The implementation for robotics training collection:
import requests
import yt_dlp
THORDATA = "http://user:pass@gate.thordata.com:10000"
class KitchenCorpusBuilder:
"""
Build geographically diverse kitchen video corpus
for robotics training.
"""
TARGET_CITIES = {
"tokyo": {"country": "jp", "query": "キッチン 料理"},
"osaka": {"country": "jp", "query": "大阪 キッチン"},
"mumbai": {"country": "in", "query": "rasoi kitchen"},
"delhi": {"country": "in", "query": "delhi kitchen cooking"},
"lagos": {"country": "ng", "query": "nigerian kitchen cooking"},
"accra": {"country": "gh", "query": "ghana kitchen food"},
"saopaulo": {"country": "br", "query": "cozinha brasileira"},
"riodejaneiro": {"country": "br", "query": "cozinha carioca"},
"mexicocity": {"country": "mx", "query": "cocina mexicana"},
"guadalajara": {"country": "mx", "query": "cocina jalisciense"},
"paris": {"country": "fr", "query": "cuisine française"},
"lyon": {"country": "fr", "query": "cuisine lyonnaise"},
"chicago": {"country": "us", "query": "american kitchen cooking"},
"houston": {"country": "us", "query": "texas kitchen cooking"}
}
def collect_city_kitchens(self, city, max_videos=500):
"""
Collect kitchen videos from specific city perspective.
"""
config = self.TARGET_CITIES[city]
# City-targeted proxy for authentic local results
city_proxy = f"{THORDATA}&country={config['country']}"
# Discovery with local language query
session = requests.Session()
session.proxies = {"http": city_proxy, "https": city_proxy}
session.headers.update({
"Accept-Language": f"{config['country']},en;q=0.5"
})
# Search YouTube via Invidious or SERP API
videos = self._search_local_youtube(
session, config["query"], max_videos
)
# Download with sticky session for reliability
downloaded = [ ]
for video in videos:
try:
result = self._download_with_city_session(
video, city, config["country"]
)
downloaded.append(result)
except Exception as e:
print(f"Failed {video['id']}: {e}")
return downloaded
def _search_local_youtube(self, session, query, max_results):
"""
Search YouTube from local perspective.
Returns list of video metadata.
"""
# Using Invidious instance or SERP API
# with geographic parameters
response = session.get(
"https://serpapi.com/search",
params={
"engine": "youtube",
"search_query": query,
"gl": session.proxies["http"].split("country=")[1],
"num": min(max_results, 100)
},
timeout=30
)
return [
{
"id": item["id"],
"title": item["title"],
"channel": item["channel"]
}
for item in response.json().get("video_results", [ ])
]
def _download_with_city_session(self, video, city, country):
"""
Download maintaining city identity throughout transfer.
"""
session_key = f"kitchen_{city}_{video['id'][:6]}"
sticky = f"{THORDATA}&country={country}&session={session_key}"
out_dir = f"./kitchens/{city}"
os.makedirs(out_dir, exist_ok=True)
ydl_opts = {
'format': 'best[height<=720]',
'proxy': sticky,
'outtmpl': os.path.join(out_dir, '%(id)s.%(ext)s'),
'writethumbnail': True,
'writeinfojson': True,
'retries': 5,
'quiet': True
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(
f"https://youtube.com/watch?v={video['id']}",
download=True
)
return {
"city": city,
"video_id": video["id"],
"duration": info.get("duration"),
"resolution": info.get("resolution"),
"file_path": ydl.prepare_filename(info)
}
The corpus quality metrics from city-targeted collection versus generic collection:
| Metric | Generic US Collection | City-Targeted Collection |
| Kitchen type diversity | 3 distinct layouts | 14 distinct layouts |
| Appliance recognition coverage | 67% of global types | 94% of global types |
| Spatial layout variation | Low | High |
| Human movement pattern diversity | Limited | Extensive |
| Language distribution | 98% English | 12 languages |
| Training transfer to new cities | Poor | Strong |
The robotics deployment impact is direct. A model trained on city-targeted data successfully operates in 12/12 test kitchens across 6 countries. A model trained on generic data succeeds in 3/12, failing on compact Asian kitchens, open-air African kitchens, and commercial South Asian kitchens.
The infrastructure investment for city-targeted collection is modest compared to robotics hardware and compute costs. ThorData’s residential proxy service enables this targeting without operational complexity. The city-level targeting API requires a single parameter addition to standard proxy configuration. The session management ensures download reliability for large video files. The geographic coverage spans the global markets where robotics deployment is planned.
The competitive implication is significant. Robotics companies with superior training data deploy successfully in more environments, win more contracts, and establish market presence faster. The data advantage compounds as deployed robots collect additional real-world experience, creating a flywheel that competitors cannot match without equivalent initial data diversity.
For robotics teams planning training data strategy, the question is not whether geographic diversity matters. It is whether your collection infrastructure can access that diversity. Datacenter proxies cannot. Residential proxies can. The difference determines deployment success.
Configure city-targeted collection for your robotics training. Review session options for reliable video downloads. Explore global coverage for your deployment markets.
Your robot needs to see every kitchen. Your infrastructure needs to get it there.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
YouTube Video Collection at Scale: A Complete Python Pipeline with Residential Proxy Integration
This is a practical guide for ...
Xyla Huxley
2026-06-18
We Downloaded 2 Million YouTube Videos for Model Training.
The numbers tell a story that our engineering retrospec […]
Unknown
2026-06-18
The End of Curated Datasets: Why Frontier Multimodal Models Train on Raw Web Video
The research community spent d ...
Xyla Huxley
2026-06-18
The $2 Million Question: Why Our Multimodal Training Budget Went 40% Over (And It Wasn’t GPUs)
We budgeted carefully. Compute allocation for 512 H100s […]
Unknown
2026-06-18
Building a Petabyte-Scale Video Corpus for Multimodal LLMs: The Infrastructure Nobody Talks About
Everyone discusses transformer ...
Xyla Huxley
2026-06-18
How to Set Up Thordata Residential Proxies in VMLogin: Step-by-Step Integration Guide
Learn how to set up Thordata r ...
Jenny Avery
2026-06-16
What Is AI Scraping? A Complete Guide for 2026
Since the early days of the in ...
Xyla Huxley
2026-06-16
Throdata and Morelogin Integration Guide: Build a Safer and More Efficient Multi-Account Management Environment
As a global provider of reside ...
Xyla Huxley
2026-06-16
Web Scraping for Machine Learning: A 2026 Guide
Building algorithms that under ...
Xyla Huxley
2026-06-16