Fetch real-time data from 100+ websites,No development or maintenance required.
Over 100 million real residential IPs from genuine users across 190+ countries.
SCRAPING SOLUTIONS
Get accurate and in real-time results sourced from Google, Bing, and more.
With 120+ prebuilt and custom scrapers ready for any use case.
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Execute scripts in stealth browsers with full rendering and automation
PROXY INFRASTRUCTURE
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
SCRAPING SOLUTIONS
PROXY INFRASTRUCTURE
DATA FEEDS
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Products $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Data for AI $/GB
Pricing $0/GB
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN $/GB
产品 $/GB
AI数据 $/GB
定价 $0/GB
产品文档 $/GB
资源 $/GB
简体中文 $/GB
The numbers tell a story that our engineering retrospectives still reference. Eight months to collect 200,000 videos. Then six weeks to collect the remaining 1.8 million. Same team. Same budget. Same target quality. The difference was infrastructure, specifically the proxy layer we treated as an afterthought.
Our initial architecture was standard for 2023. Python workers running yt-dlp. A rotating pool of datacenter proxies from a budget provider. Random delays between requests. Header rotation. We considered this sophisticated. It worked for the proof of concept. It worked for the first ten thousand videos. Then the problems started.
YouTube’s detection systems don’t block immediately. They observe. They build confidence scores. Our initial requests passed because our volume was low and our patterns were irregular enough to avoid immediate flags. As we scaled to 5,000 daily downloads, the confidence score accumulated. IP reputation from datacenter ranges. TLS fingerprint consistent with yt-dlp. Request timing that showed statistical regularity despite randomization. No human-like navigation patterns.
The degradation was gradual. Block rates increased from 5% to 15% to 35% over six weeks. We responded with engineering fixes. Longer delays. More proxy rotation. Headless browser configuration. Each fix bought two weeks of improved performance. Then detection adapted. The cycle repeated.
The breaking point came at 200,000 videos. Our block rate hit 70%. Our effective download rate dropped below 500 videos daily. At that velocity, our 2 million target would take three years. Our training schedule assumed completion in fourteen months. The gap was unbridgeable with our current approach.
We evaluated three solutions. Building our own proxy infrastructure was rejected due to operational complexity and legal risk in residential IP sourcing. Multiple datacenter providers in rotation was attempted but failed because platforms detect and blacklist entire ASN ranges regardless of provider. Residential proxy services were selected after evaluating pool size, geographic distribution, session management, and API reliability.
ThorData was selected based on specific capabilities relevant to our use case. The 50 million IP pool provided sufficient distribution for sustained high-volume collection without pattern detection. The 195-country coverage enabled geographic diversity requirements for our model. The sticky session feature maintained consistent IP identity throughout multi-minute high-resolution downloads. The sub-second latency minimized worker idle time.
The migration required two weeks of engineering effort. The configuration was straightforward. The impact was transformative.
import yt_dlp
import requests
THORDATA_PROXY = "http://user:pass@gate.thordata.com:10000"
def download_training_video(video_id, scenario_type, target_quality="720"):
"""
Download with sticky session for reliability.
Rotate per video for discovery distribution.
"""
# Sticky session maintains IP for this download
session_key = f"train_{scenario_type}_{video_id[:8]}"
sticky_proxy = f"{THORDATA_PROXY}&session={session_key}"
ydl_opts = {
'format': f'best[height<={target_quality}]',
'proxy': sticky_proxy,
'outtmpl': f'./corpus/{scenario_type}/%(id)s.%(ext)s',
'writethumbnail': True,
'writeinfojson': True,
'retries': 5,
'fragment_retries': 5,
'skip_unavailable_fragments': True,
'quiet': True
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(
f"https://youtube.com/watch?v={video_id}",
download=True
)
return {
"file_path": ydl.prepare_filename(info),
"duration": info.get("duration"),
"resolution": info.get("resolution"),
"fps": info.get("fps")
}
The discovery layer used per-request rotation for search and metadata operations:
def discover_by_scenario(scenario, regions, max_results=5000):
"""
Search for training videos across geographic regions.
Per-request rotation prevents search pattern detection.
"""
videos = [ ]
for region in regions:
# Each search request uses different IP
proxy = f"{THORDATA_PROXY}&country={region}"
session = requests.Session()
session.proxies = {"http": proxy, "https": proxy}
# Platform API or SERP query
results = session.get(
"https://serpapi.com/search",
params={
"engine": "google",
"q": f"{scenario} video site:youtube.com",
"tbm": "vid",
"num": 100,
"gl": region
},
timeout=30
).json()
for item in results.get("video_results", [ ]):
videos.append({
"url": item["link"],
"title": item["title"],
"region": region,
"scenario": scenario
})
return videos
The performance comparison after migration:
| Metric | Datacenter Proxy Phase | Residential Proxy Phase |
| Daily download volume | 500-2,000 | 28,000-35,000 |
| Block rate | 65-75% | 0.2-0.4% |
| Engineering time on evasion | 60% | 8% |
| Geographic coverage | 12 countries | 78 countries |
| Average file completion rate | 45% | 99.7% |
| Time to 2M videos | 36 months (projected) | 6.5 weeks (actual) |
The geographic expansion was particularly valuable. Our model required understanding of indoor domestic environments across cultures. Datacenter collection produced 73% Western content, 18% East Asian, 9% other. Residential targeting produced proportional global distribution with intentional over-sampling of underrepresented regions. Model performance on non-Western benchmarks improved 31%.
The engineering time reallocation was equally significant. Engineers previously spending 60% of capacity on evasion adaptations redirected to quality filtering, deduplication, format normalization, and metadata enrichment. The data quality improvement from better preprocessing exceeded the quality improvement from higher collection volume.
For teams facing similar collection challenges, the diagnostic framework is straightforward. Measure your current block rate. If it exceeds 10%, your proxy infrastructure is insufficient for your volume targets. Measure your geographic distribution. If any continent represents less than 5% of collection despite being 15% of global population, your proxy infrastructure lacks geographic diversity. Measure engineering time allocation. If evasion exceeds 20% of data engineering capacity, your proxy infrastructure is consuming resources that should improve data quality.
Evaluate ThorData’s residential proxy infrastructure for your collection targets. Review session management for reliable large-file downloads. Explore geographic targeting for diverse training corpora.
The first 200,000 videos taught us what doesn’t work. The remaining 1.8 million taught us what does.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
Training a Cooking Robot? Your YouTube Data Pipeline Needs to See Every Kitchen in the World
Robotics companies training vi ...
Xyla Huxley
2026-06-18
YouTube Video Collection at Scale: A Complete Python Pipeline with Residential Proxy Integration
This is a practical guide for ...
Xyla Huxley
2026-06-18
The End of Curated Datasets: Why Frontier Multimodal Models Train on Raw Web Video
The research community spent d ...
Xyla Huxley
2026-06-18
The $2 Million Question: Why Our Multimodal Training Budget Went 40% Over (And It Wasn’t GPUs)
We budgeted carefully. Compute allocation for 512 H100s […]
Unknown
2026-06-18
Building a Petabyte-Scale Video Corpus for Multimodal LLMs: The Infrastructure Nobody Talks About
Everyone discusses transformer ...
Xyla Huxley
2026-06-18
How to Set Up Thordata Residential Proxies in VMLogin: Step-by-Step Integration Guide
Learn how to set up Thordata r ...
Jenny Avery
2026-06-16
What Is AI Scraping? A Complete Guide for 2026
Since the early days of the in ...
Xyla Huxley
2026-06-16
Throdata and Morelogin Integration Guide: Build a Safer and More Efficient Multi-Account Management Environment
As a global provider of reside ...
Xyla Huxley
2026-06-16
Web Scraping for Machine Learning: A 2026 Guide
Building algorithms that under ...
Xyla Huxley
2026-06-16