Your First Plan is on Us!

Get 100% of your first residential proxy purchase back as wallet balance, up to $900.

Start now
EN
English
简体中文
Log inGet started for free

Blog

ChatGPT Web Scraping: AI Code & Parsing Guide

A visualization of an AI brain analyzing HTML code structure

author Kael Odin
Kael Odin
Last updated on
2026-01-13
6 min read
📌 Key Takeaways
  • Two Roles: Use ChatGPT as a Code Generator to write your scripts, and use the OpenAI API as a Data Parser to clean up messy HTML into JSON.
  • Prompt Engineering: Standard prompts produce code that gets banned. You must instruct the LLM to include “retry logic,” “headless mode,” and “proxy configuration.”
  • Cost Efficiency: Don’t send entire HTML pages to the API. Use BeautifulSoup to extract only relevant text nodes before processing with AI to save 90% on tokens.

When ChatGPT launched, every developer tried asking it: “Write me a Python script to scrape Amazon.”

The result? Usually a basic script using `requests` and `BeautifulSoup` that gets blocked instantly by Amazon’s anti-bot protection. While useful for beginners, this approach fails in production. The problem isn’t the AI; it’s the context you provide.

In 2026, we don’t just ask ChatGPT to write code; we integrate Large Language Models (LLMs) into the scraping pipeline. In this guide, I will show you the two distinct ways to leverage AI: generating robust, proxy-aware code, and using the OpenAI API to parse unstructured data that traditional selectors cannot handle.

Part 1: The Code Generator (Writing Better Prompts)

If you ask generic questions, you get generic code. To get a script that actually works in the real world, you need to prompt-engineer for “Resilience.”

The “Bad” Prompt

“Write a Python script to scrape product prices from example.com.”

Result: A script with no headers, no error handling, and no proxies. It will fail after 5 requests.

The “Pro” Prompt

Recommended Prompt Template “Act as a Senior Data Engineer. Write a Python script using Playwright (sync API) to scrape product titles from a URL. The script must include:
1. Proxy integration (using `proxy` argument in launch options).
2. Retry logic (try 3 times before failing).
3. Random delays between actions (human-like behavior).
4. Error handling that prints specific error messages.
Use the variable `THORDATA_PROXY` for the proxy string.”

Here is the kind of high-quality code ChatGPT generates with that prompt:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from playwright.sync_api import sync_playwright
import time
import random

# Thordata Residential Proxy (Host:Port)
PROXY_SERVER = "http://gate.thordata.com:12345"
PROXY_AUTH = {"username": "user", "password": "pass"}

def run():
    with sync_playwright() as p:
        # Launch with Proxy Configuration
        browser = p.chromium.launch(
            proxy={"server": PROXY_SERVER, "username": "user", "password": "pass"}
        )
        page = browser.new_page()
        
        try:
            page.goto("https://example.com", timeout=60000)
            # Random delay to mimic human behavior
            time.sleep(random.uniform(2, 5))
            print(page.title())
        except Exception as e:
            print(f"Error: {e}")
        finally:
            browser.close()

if __name__ == "__main__":
    run()

Part 2: The Data Parser (AI as a Tool)

This is the exciting part. Sometimes, data is unstructured. Maybe it’s a messy address block like “123 Main St., near the old oak tree, Apt 4B, NY”. Regex will fail here. XPath will fail here.

But LLMs excel at this. We can download the HTML, strip the tags, and ask the OpenAI API to format it into JSON.

COST SAVING STRATEGY

Do not send the entire `<html>` body to OpenAI. It will cost a fortune in tokens. Use BeautifulSoup to extract only the text content of the target div, and send that to the API.

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import openai
from bs4 import BeautifulSoup

client = openai.OpenAI(api_key="YOUR_OPENAI_KEY")

# Imagine we scraped this messy string from a website
raw_html_content = "<div>Contact: John Doe (Manager) - Phone: 555-0199 or email johnd@corp.com</div>"

# 1. Clean it first (Reduce tokens)
soup = BeautifulSoup(raw_html_content, "html.parser")
clean_text = soup.get_text()

# 2. Ask LLM to format it
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "Extract name, role, phone, and email into JSON."},
        {"role": "user", "content": clean_text}
    ],
    response_format={"type": "json_object"}
)

print(response.choices[0].message.content)
# Output: {"name": "John Doe", "role": "Manager", "phone": "555-0199", "email": "johnd@corp.com"}

3. The Future: Agentic Scraping

We are moving towards “Agentic” workflows. Tools like LangChain allow you to create an AI agent that can browse the web, click buttons, and decide what to scrape next.

However, agents are currently slow and expensive per page. For high-volume data extraction (e.g., scraping millions of e-commerce products), the traditional approach of “Optimized Script + Rotating Proxies” remains 100x cheaper and faster.

AI is best used to write the script or clean the specific data points that break your regex rules.

Conclusion

ChatGPT is a force multiplier for data engineers. Use it to generate boilerplate code that includes Thordata Residential Proxies, and use its API to parse the unparsable. But remember: code generated by AI is only as good as the prompt you give it.

Get started for free

Frequently asked questions

Can ChatGPT browse the web to scrape data?

Yes, ChatGPT Plus (with browsing) can read pages, but it is not scalable for bulk scraping. For production, you should use the OpenAI API to parse HTML that you have downloaded using a dedicated scraper.

How do I fix ChatGPT generating outdated scraping code?

ChatGPT’s training data has a cutoff. You must provide specific prompts asking for ‘modern libraries’ (like Playwright instead of Selenium) and explicitly request error handling patterns.

Is it expensive to use LLMs for scraping?

It can be. To reduce costs, never send full HTML tags to the API. Instead, extract only the relevant text or specific divs before sending them to the LLM for parsing.

About the author

Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for bypassing anti-bot protections. He specializes in explaining complex infrastructure concepts like residential proxies and TLS fingerprinting to developer audiences. All code examples in this article have been tested in real-world scraping scenarios.

The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.