Over 60 million real residential IPs from genuine users across 190+ countries.
Over 60 million real residential IPs from genuine users across 190+ countries.
Your First Plan is on Us!
Get 100% of your first residential proxy purchase back as wallet balance, up to $900.
PROXY SOLUTIONS
Over 60 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
Guaranteed bandwidth — for reliable, large-scale data transfer.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
A powerful web data infrastructure built to power AI models, applications, and agents.
High-speed, low-latency proxies for uninterrupted video data scraping.
Extract video and metadata at scale, seamlessly integrate with cloud platforms and OSS.
6B original videos from 700M unique channels - built for LLM and multimodal model training.
Get accurate and in real-time results sourced from Google, Bing, and more.
Execute scripts in stealth browsers with full rendering and automation
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Get instant access to ready-to-use datasets from popular domains.
PROXY PRICING
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Proxies $/GB
Over 60 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Guaranteed bandwidth — for reliable, large-scale data transfer.
Scrapers $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Data for AI $/GB
A powerful web data infrastructure built to power AI models, applications, and agents.
High-speed, low-latency proxies for uninterrupted video data scraping.
Extract video and metadata at scale, seamlessly integrate with cloud platforms and OSS.
6B original videos from 700M unique channels - built for LLM and multimodal model training.
Pricing $0/GB
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN
首单免费!
首次购买住宅代理可获得100%返现至钱包余额,最高$900。
代理 $/GB
数据采集 $/GB
AI数据 $/GB
定价 $0/GB
产品文档
资源 $/GB
简体中文$/GB
When ChatGPT launched, every developer tried asking it: “Write me a Python script to scrape Amazon.”
The result? Usually a basic script using `requests` and `BeautifulSoup` that gets blocked instantly by Amazon’s anti-bot protection. While useful for beginners, this approach fails in production. The problem isn’t the AI; it’s the context you provide.
In 2026, we don’t just ask ChatGPT to write code; we integrate Large Language Models (LLMs) into the scraping pipeline. In this guide, I will show you the two distinct ways to leverage AI: generating robust, proxy-aware code, and using the OpenAI API to parse unstructured data that traditional selectors cannot handle.
If you ask generic questions, you get generic code. To get a script that actually works in the real world, you need to prompt-engineer for “Resilience.”
Result: A script with no headers, no error handling, and no proxies. It will fail after 5 requests.
Here is the kind of high-quality code ChatGPT generates with that prompt:
from playwright.sync_api import sync_playwright
import time
import random
# Thordata Residential Proxy (Host:Port)
PROXY_SERVER = "http://gate.thordata.com:12345"
PROXY_AUTH = {"username": "user", "password": "pass"}
def run():
with sync_playwright() as p:
# Launch with Proxy Configuration
browser = p.chromium.launch(
proxy={"server": PROXY_SERVER, "username": "user", "password": "pass"}
)
page = browser.new_page()
try:
page.goto("https://example.com", timeout=60000)
# Random delay to mimic human behavior
time.sleep(random.uniform(2, 5))
print(page.title())
except Exception as e:
print(f"Error: {e}")
finally:
browser.close()
if __name__ == "__main__":
run()
This is the exciting part. Sometimes, data is unstructured. Maybe it’s a messy address block like “123 Main St., near the old oak tree, Apt 4B, NY”. Regex will fail here. XPath will fail here.
But LLMs excel at this. We can download the HTML, strip the tags, and ask the OpenAI API to format it into JSON.
Do not send the entire `<html>` body to OpenAI. It will cost a fortune in tokens. Use BeautifulSoup to extract only the text content of the target div, and send that to the API.
import openai
from bs4 import BeautifulSoup
client = openai.OpenAI(api_key="YOUR_OPENAI_KEY")
# Imagine we scraped this messy string from a website
raw_html_content = "<div>Contact: John Doe (Manager) - Phone: 555-0199 or email johnd@corp.com</div>"
# 1. Clean it first (Reduce tokens)
soup = BeautifulSoup(raw_html_content, "html.parser")
clean_text = soup.get_text()
# 2. Ask LLM to format it
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "Extract name, role, phone, and email into JSON."},
{"role": "user", "content": clean_text}
],
response_format={"type": "json_object"}
)
print(response.choices[0].message.content)
# Output: {"name": "John Doe", "role": "Manager", "phone": "555-0199", "email": "johnd@corp.com"}
We are moving towards “Agentic” workflows. Tools like LangChain allow you to create an AI agent that can browse the web, click buttons, and decide what to scrape next.
However, agents are currently slow and expensive per page. For high-volume data extraction (e.g., scraping millions of e-commerce products), the traditional approach of “Optimized Script + Rotating Proxies” remains 100x cheaper and faster.
AI is best used to write the script or clean the specific data points that break your regex rules.
ChatGPT is a force multiplier for data engineers. Use it to generate boilerplate code that includes Thordata Residential Proxies, and use its API to parse the unparsable. But remember: code generated by AI is only as good as the prompt you give it.
Frequently asked questions
Can ChatGPT browse the web to scrape data?
Yes, ChatGPT Plus (with browsing) can read pages, but it is not scalable for bulk scraping. For production, you should use the OpenAI API to parse HTML that you have downloaded using a dedicated scraper.
How do I fix ChatGPT generating outdated scraping code?
ChatGPT’s training data has a cutoff. You must provide specific prompts asking for ‘modern libraries’ (like Playwright instead of Selenium) and explicitly request error handling patterns.
Is it expensive to use LLMs for scraping?
It can be. To reduce costs, never send full HTML tags to the API. Instead, extract only the relevant
text or specific divs before sending them to the LLM for parsing.About the author
Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for bypassing anti-bot protections. He specializes in explaining complex infrastructure concepts like residential proxies and TLS fingerprinting to developer audiences. All code examples in this article have been tested in real-world scraping scenarios.
The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
Run Python in Terminal: Args, Venv & Nohup Guide
Kael Odin Last updated on 2026-01-10 12 min read 📌 Key […]
Unknown
2026-01-13
AutoScraper Python: Smart Scraping & Proxy Guide
Kael Odin Last updated on 2026-01-12 11 min read 📌 Key […]
Unknown
2026-01-13
Excel Web Scraping 2026: Power Query & APIs Guide
Kael Odin Last updated on 2026-01-13 13 min read 📌 Key […]
Unknown
2026-01-13