We Downloaded 2 Million YouTube Videos for Model Training.

Chrome的代理扩展程序

免费的Chrome代理管理器扩展，适用于任何代理提供者。

The numbers tell a story that our engineering retrospectives still reference. Eight months to collect 200,000 videos. Then six weeks to collect the remaining 1.8 million. Same team. Same budget. Same target quality. The difference was infrastructure, specifically the proxy layer we treated as an afterthought.

Our initial architecture was standard for 2023. Python workers running yt-dlp. A rotating pool of datacenter proxies from a budget provider. Random delays between requests. Header rotation. We considered this sophisticated. It worked for the proof of concept. It worked for the first ten thousand videos. Then the problems started.

YouTube’s detection systems don’t block immediately. They observe. They build confidence scores. Our initial requests passed because our volume was low and our patterns were irregular enough to avoid immediate flags. As we scaled to 5,000 daily downloads, the confidence score accumulated. IP reputation from datacenter ranges. TLS fingerprint consistent with yt-dlp. Request timing that showed statistical regularity despite randomization. No human-like navigation patterns.

The degradation was gradual. Block rates increased from 5% to 15% to 35% over six weeks. We responded with engineering fixes. Longer delays. More proxy rotation. Headless browser configuration. Each fix bought two weeks of improved performance. Then detection adapted. The cycle repeated.

The breaking point came at 200,000 videos. Our block rate hit 70%. Our effective download rate dropped below 500 videos daily. At that velocity, our 2 million target would take three years. Our training schedule assumed completion in fourteen months. The gap was unbridgeable with our current approach.

We evaluated three solutions. Building our own proxy infrastructure was rejected due to operational complexity and legal risk in residential IP sourcing. Multiple datacenter providers in rotation was attempted but failed because platforms detect and blacklist entire ASN ranges regardless of provider. Residential proxy services were selected after evaluating pool size, geographic distribution, session management, and API reliability.

ThorData was selected based on specific capabilities relevant to our use case. The 50 million IP pool provided sufficient distribution for sustained high-volume collection without pattern detection. The 195-country coverage enabled geographic diversity requirements for our model. The sticky session feature maintained consistent IP identity throughout multi-minute high-resolution downloads. The sub-second latency minimized worker idle time.

The migration required two weeks of engineering effort. The configuration was straightforward. The impact was transformative.

import yt_dlp import requests THORDATA_PROXY = "http://user:pass@gate.thordata.com:10000" def download_training_video(video_id, scenario_type, target_quality="720"): """ Download with sticky session for reliability. Rotate per video for discovery distribution. """ # Sticky session maintains IP for this download session_key = f"train_{scenario_type}_{video_id[:8]}" sticky_proxy = f"{THORDATA_PROXY}&session={session_key}" ydl_opts = { 'format': f'best[height<={target_quality}]', 'proxy': sticky_proxy, 'outtmpl': f'./corpus/{scenario_type}/%(id)s.%(ext)s', 'writethumbnail': True, 'writeinfojson': True, 'retries': 5, 'fragment_retries': 5, 'skip_unavailable_fragments': True, 'quiet': True } with yt_dlp.YoutubeDL(ydl_opts) as ydl: info = ydl.extract_info( f"https://youtube.com/watch?v={video_id}", download=True ) return { "file_path": ydl.prepare_filename(info), "duration": info.get("duration"), "resolution": info.get("resolution"), "fps": info.get("fps") }

The discovery layer used per-request rotation for search and metadata operations:

def discover_by_scenario(scenario, regions, max_results=5000): """ Search for training videos across geographic regions. Per-request rotation prevents search pattern detection. """ videos = [ ] for region in regions: # Each search request uses different IP proxy = f"{THORDATA_PROXY}&country={region}" session = requests.Session() session.proxies = {"http": proxy, "https": proxy} # Platform API or SERP query results = session.get( "https://serpapi.com/search", params={ "engine": "google", "q": f"{scenario} video site:youtube.com", "tbm": "vid", "num": 100, "gl": region }, timeout=30 ).json() for item in results.get("video_results", [ ]): videos.append({ "url": item["link"], "title": item["title"], "region": region, "scenario": scenario }) return videos

The performance comparison after migration:

Metric	Datacenter Proxy Phase	Residential Proxy Phase
Daily download volume	500-2,000	28,000-35,000
Block rate	65-75%	0.2-0.4%
Engineering time on evasion	60%	8%
Geographic coverage	12 countries	78 countries
Average file completion rate	45%	99.7%
Time to 2M videos	36 months (projected)	6.5 weeks (actual)

The geographic expansion was particularly valuable. Our model required understanding of indoor domestic environments across cultures. Datacenter collection produced 73% Western content, 18% East Asian, 9% other. Residential targeting produced proportional global distribution with intentional over-sampling of underrepresented regions. Model performance on non-Western benchmarks improved 31%.

The engineering time reallocation was equally significant. Engineers previously spending 60% of capacity on evasion adaptations redirected to quality filtering, deduplication, format normalization, and metadata enrichment. The data quality improvement from better preprocessing exceeded the quality improvement from higher collection volume.

For teams facing similar collection challenges, the diagnostic framework is straightforward. Measure your current block rate. If it exceeds 10%, your proxy infrastructure is insufficient for your volume targets. Measure your geographic distribution. If any continent represents less than 5% of collection despite being 15% of global population, your proxy infrastructure lacks geographic diversity. Measure engineering time allocation. If evasion exceeds 20% of data engineering capacity, your proxy infrastructure is consuming resources that should improve data quality.

The first 200,000 videos taught us what doesn’t work. The remaining 1.8 million taught us what does.

We Downloaded 2 Million YouTube Videos for Model Training.

Looking for Top-Tier Residential Proxies?

您在寻找顶级高质量的住宅代理吗？

Related Articles