EN
English
简体中文
Log inGet started for free

Blog

blog

we-downloaded-2-million-youtube-videos-for-model-training

We Downloaded 2 Million YouTube Videos for Model Training.

The numbers tell a story that our engineering retrospectives still reference. Eight months to collect 200,000 videos. Then six weeks to collect the remaining 1.8 million. Same team. Same budget. Same target quality. The difference was infrastructure, specifically the proxy layer we treated as an afterthought.

Our initial architecture was standard for 2023. Python workers running yt-dlp. A rotating pool of datacenter proxies from a budget provider. Random delays between requests. Header rotation. We considered this sophisticated. It worked for the proof of concept. It worked for the first ten thousand videos. Then the problems started.

YouTube’s detection systems don’t block immediately. They observe. They build confidence scores. Our initial requests passed because our volume was low and our patterns were irregular enough to avoid immediate flags. As we scaled to 5,000 daily downloads, the confidence score accumulated. IP reputation from datacenter ranges. TLS fingerprint consistent with yt-dlp. Request timing that showed statistical regularity despite randomization. No human-like navigation patterns.

The degradation was gradual. Block rates increased from 5% to 15% to 35% over six weeks. We responded with engineering fixes. Longer delays. More proxy rotation. Headless browser configuration. Each fix bought two weeks of improved performance. Then detection adapted. The cycle repeated.

The breaking point came at 200,000 videos. Our block rate hit 70%. Our effective download rate dropped below 500 videos daily. At that velocity, our 2 million target would take three years. Our training schedule assumed completion in fourteen months. The gap was unbridgeable with our current approach.

We evaluated three solutions. Building our own proxy infrastructure was rejected due to operational complexity and legal risk in residential IP sourcing. Multiple datacenter providers in rotation was attempted but failed because platforms detect and blacklist entire ASN ranges regardless of provider. Residential proxy services were selected after evaluating pool size, geographic distribution, session management, and API reliability.

ThorData was selected based on specific capabilities relevant to our use case. The 50 million IP pool provided sufficient distribution for sustained high-volume collection without pattern detection. The 195-country coverage enabled geographic diversity requirements for our model. The sticky session feature maintained consistent IP identity throughout multi-minute high-resolution downloads. The sub-second latency minimized worker idle time.

The migration required two weeks of engineering effort. The configuration was straightforward. The impact was transformative.

import yt_dlp
import requests

THORDATA_PROXY = "http://user:pass@gate.thordata.com:10000"

def download_training_video(video_id, scenario_type, target_quality="720"):
    """
    Download with sticky session for reliability.
    Rotate per video for discovery distribution.
    """
    # Sticky session maintains IP for this download
    session_key = f"train_{scenario_type}_{video_id[:8]}"
    sticky_proxy = f"{THORDATA_PROXY}&session={session_key}"
    
    ydl_opts = {
        'format': f'best[height<={target_quality}]',
        'proxy': sticky_proxy,
        'outtmpl': f'./corpus/{scenario_type}/%(id)s.%(ext)s',
        'writethumbnail': True,
        'writeinfojson': True,
        'retries': 5,
        'fragment_retries': 5,
        'skip_unavailable_fragments': True,
        'quiet': True
    }
    
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(
            f"https://youtube.com/watch?v={video_id}", 
            download=True
        )
        return {
            "file_path": ydl.prepare_filename(info),
            "duration": info.get("duration"),
            "resolution": info.get("resolution"),
            "fps": info.get("fps")
        }

The discovery layer used per-request rotation for search and metadata operations:

def discover_by_scenario(scenario, regions, max_results=5000):
    """
    Search for training videos across geographic regions.
    Per-request rotation prevents search pattern detection.
    """

    videos = [ ]

    
    for region in regions:
        # Each search request uses different IP
        proxy = f"{THORDATA_PROXY}&country={region}"
        
        session = requests.Session()
        session.proxies = {"http": proxy, "https": proxy}
        
        # Platform API or SERP query
        results = session.get(
            "https://serpapi.com/search",
            params={
                "engine": "google",
                "q": f"{scenario} video site:youtube.com",
                "tbm": "vid",
                "num": 100,
                "gl": region
            },
            timeout=30
        ).json()
        

        for item in results.get("video_results", [ ]):

            videos.append({
                "url": item["link"],
                "title": item["title"],
                "region": region,
                "scenario": scenario
            })
    
    return videos

The performance comparison after migration:

MetricDatacenter Proxy PhaseResidential Proxy Phase
Daily download volume500-2,00028,000-35,000
Block rate65-75%0.2-0.4%
Engineering time on evasion60%8%
Geographic coverage12 countries78 countries
Average file completion rate45%99.7%
Time to 2M videos36 months (projected)6.5 weeks (actual)

The geographic expansion was particularly valuable. Our model required understanding of indoor domestic environments across cultures. Datacenter collection produced 73% Western content, 18% East Asian, 9% other. Residential targeting produced proportional global distribution with intentional over-sampling of underrepresented regions. Model performance on non-Western benchmarks improved 31%.

The engineering time reallocation was equally significant. Engineers previously spending 60% of capacity on evasion adaptations redirected to quality filtering, deduplication, format normalization, and metadata enrichment. The data quality improvement from better preprocessing exceeded the quality improvement from higher collection volume.

For teams facing similar collection challenges, the diagnostic framework is straightforward. Measure your current block rate. If it exceeds 10%, your proxy infrastructure is insufficient for your volume targets. Measure your geographic distribution. If any continent represents less than 5% of collection despite being 15% of global population, your proxy infrastructure lacks geographic diversity. Measure engineering time allocation. If evasion exceeds 20% of data engineering capacity, your proxy infrastructure is consuming resources that should improve data quality.

Evaluate ThorData’s residential proxy infrastructure for your collection targets. Review session management for reliable large-file downloads. Explore geographic targeting for diverse training corpora.

The first 200,000 videos taught us what doesn’t work. The remaining 1.8 million taught us what does.