Fetch real-time data from 100+ websites,No development or maintenance required.
Over 100 million real residential IPs from genuine users across 190+ countries.
SCRAPING SOLUTIONS
Get accurate and in real-time results sourced from Google, Bing, and more.
With 120+ prebuilt and custom scrapers ready for any use case.
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Execute scripts in stealth browsers with full rendering and automation
PROXY INFRASTRUCTURE
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
SCRAPING SOLUTIONS
PROXY INFRASTRUCTURE
DATA FEEDS
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Products $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Data for AI $/GB
Pricing $0/GB
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN $/GB
产品 $/GB
AI数据 $/GB
定价 $0/GB
产品文档 $/GB
资源 $/GB
简体中文 $/GB

Web crawlers have quietly become a cost-effective way for engineering teams, data analysts, and product developers to transform public web pages into structured datasets—without relying on expensive data vendors or resorting to inefficient manual copy-and-paste processes. As long as they’re conducted in compliance with regulations and with a sense of responsibility, Python web crawlers can support a wide range of applications, from price intelligence and product catalog monitoring to content aggregation and competitive analysis.
A web crawler is the process of automatically extracting data from web pages (typically in HTML format) and converting it into a structured format—such as tables, spreadsheets, or databases. Compared to manually opening dozens of pages and copying fields one by one, a script sends HTTP requests, parses the page content, and writes the results to a file or storage layer.
●E-commerce Intelligence: Product Name, Price, Description, Inventory Status,
●Real Estate Analysis: Property Listings, Price Changes,
●Media and News Aggregation: Title, Author, Category, Time
●Community and Content Curation: Quotes, Jokes, Comments, Forum Posts (within permissible limits)
●Dataset Initialization: Provides seed training data for internal NLP classification or annotation pipelines.
import asyncio, aiohttp, time
from datetime import datetime
URLS = [
"https://example.com/page1",
"https://example.com/page2",
]
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)"
}
async def fetch(session, url, timeout=20):
t0 = time.time()
try:
async with session.get(url, headers=HEADERS, timeout=timeout) as r:
text = await r.text()
ms = int((time.time() - t0) * 1000)
print(f"{datetime.utcnow().isoformat()}Z INFO fetch url={url} status={r.status} latency_ms={ms} bytes={len(text)}")
return r.status, text
except Exception as e:
ms = int((time.time() - t0) * 1000)
print(f"{datetime.utcnow().isoformat()}Z ERROR fetch url={url} latency_ms={ms} err={type(e).__name__} msg={e}")
return None, None
async def main():
connector = aiohttp.TCPConnector(limit=20, ssl=False)
async with aiohttp.ClientSession(connector=connector) as session:
await asyncio.gather(*(fetch(session, u) for u in URLS))
if __name__ == "__main__":
asyncio.run(main())
Crawlers often serve as the “data ingestion layer”: web → data cleansing → analysis/dashboard → alerting/automation.
A usable crawler typically consists of several verified components:
requestsIt is the mainstream library for scraping HTML and can:
●Initiate a GET request to the public page.
●Submit a POST request to the login form
●Managing request headers (especially the User-Agent) can make them more browser-like.
●pass
Session()Maintain cookie and session state
BeautifulSoup (bs4) Parse HTML into a searchable tree structure, and you can locate tags as follows:
●Tag name (div,h4,img,a)
●Class and Attributes
●Hierarchical structure (e.g., product card container → title → price)
UselxmlAs a parser, it can improve performance and enhance robustness against “dirty HTML.”
Data extraction is only half the job; the professional process must ultimately produce clean data results.
●Structured field columns (name, price, description, image link, product detail page link)
●Consistent Coding and Cleansing Strategies
●Predictable file naming and directory structure for easy downstream use.
Excel export is particularly common because it’s suitable for both engineers and non-technical stakeholders.
Before writing code, you need to “understand” the page just like a browser does.
Open Chrome DevTools → Inspect Element, then locate:
●Product card: typically appears repeatedly on the page.divContainer
●Title/Name: Commonly used as a title tag (h4,h5) or link tags (a)
●Price: With price-related classspanor title tag
●Image link:imgTags, usually insrc(or lazy-loaded properties such as)data-src) in
●Detail page link: In the form of a href="/product/123"Anchor link
Key best practice: Avoid using selectors that are prone to change (such as paths with excessively deep nesting levels). Instead, prioritize stable class names or recurring structural features.
Most crawlers go through three stages of evolution:
●Send request
●Parse HTML
●Find the card list
●Extract a small number of fields
●The printout is used for verification.
This stage emphasizes correctness and reproducibility, without pursuing speed.
If a site has 7 pages, a common pattern is to cyclically modify URL parameters:
●page=1→ page=2→ … → page=7
Within each pagination iteration, loop through all product cards on the current page and extract the fields.
List pages typically provide only partial information. A more robust pipeline is:
●Scrape the listing page to collect product detail page URLs.
●Request each detail page
●Parse the richer description/specification module
●Merge into the output line
At this point, the simple script has been upgraded to a genuine data ingestion task.
Many websites return links that look like this:
●/images/item123.jpg
●/products/sku-001
These are relative URLs and must be combined with the site’s root domain to form absolute URLs:
●https://example.com+ /products/sku-001→ Absolute URL
Professional crawlers treat URL normalization as a crucial step, because downstream tasks—such as image downloading and detail page scraping—depend on it.
More memory-efficient, cleaner pipeline
When crawling a large number of pages, the “naive approach” often involves storing everything in a list:
●Pagination URL List
●Product URL List
●Extract result row list
It can run, but it’ll soon hit a bottleneck.
A generator produces only one URL or one record at a time:
●Lower memory usage
●Easier to stream write into Excel
●Easier for subsequent integration with queues, databases, or ETL processes.
A clear architecture is:
URL Generator → Page Request → Parsing → Record Generator → Output Writer
This provides a foundation for scaling to tens of thousands of records without having to rewrite the entire codebase.
Web scraping is not just a technical exercise—it also means being a “rule-abiding internet citizen” and doing your best to avoid triggering automated detection systems.
Add a small interval between requests.sleepIt can reduce the load and mitigate the following risks:
●IP Blocking
●CAPTCHA verification
●Response throttling
Servers often mark “unknown clients.” The standard practice is to set:
●User-Agent
●Sometimes add. Accept-Language
●Sometimes add.Referer(Only when it is reasonable and lawful for your process)
Professional crawlers should log and handle:
●429 Too Many Requests(Reduce speed)
●403 Forbidden(Stopped/Rejected)
●5xx(Server is unstable)
Even if the code is lightweight, the design should anticipate failures and be able to recover from them.
The reason Excel is popular is that it:
●Easy to share
●Auditable
●Highly compatible with analytics workflows and BI imports.
●Consistent schema definition:ProductName | Price | Description | ImageURL | ProductURL
●Normalized line breaks and redundant spaces
●Set column widths to enhance readability.
●Incremental write operations (especially when fetching a large number of pages)
This is precisely where openpyxl and xlsxwriter shine: you can generate tables that are ready to use right away, rather than a messy pile of data.
Professional technical readers will take this for granted: being able to grab does not equate to being allowed to grab.
Key operational guidelines:
●Respect where applicabletxt (It’s not a law, but it’s strongly binding.)
●Comply with the website’s Terms of Service and access restrictions.
●Avoid collecting personal data or sensitive information.
●Strict speed limits, and identification when required.
●When an official API is available, prioritize using the API.
In a corporate setting, it’s best to involve legal/compliance as early as possible—especially for large-scale, long-term, and periodic data collection projects.
A robust Python web crawler is not merely about HTML parsing—it’s a complete pipeline. By treating the crawler as an engineering system rather than a one-off script, you can build a reusable data access layer that delivers ongoing value for analytics, product features, and operational monitoring.
Frequently asked questions
What’s the difference between BeautifulSoup and Scrapy? How should I choose between them?
BeautifulSoup is suitable for lightweight parsing and fast pipelines; Scrapy is a full-fledged framework with built-in scheduling, retry mechanisms, and pipelines. Choose Scrapy when you need large-scale web scraping and long-term maintenance.
How can I write the scraped data into a database instead of Excel?
Can Requests capture pages rendered by JavaScript?
Requests can only retrieve static HTML returned by the server. If the content is rendered on the frontend, you may need to call the actual JSON API (which can usually be found in the Network tab of DevTools) or use browser automation tools such as Playwright.
About the author
Xyla is a technical writer who turns complex networking and data topics into practical, easy-to-follow guides, treating content like troubleshooting: start from real scenarios, validate with data, and explain the “why” behind each solution. Outside of work, she’s a Level 2 badminton referee and marathon trainee—finding her best ideas between the court and the finish line.
The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the Thordata blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors or obtain a scraping permit if required.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
How to Scraping Dynamic Websites with Python?
In this article, learn how to ...
Anna Stankevičiūtė
2026-03-03
Scraping Yahoo Finance using Python
Xyla Huxley Last updated on 2026-03-02 10 min read […]
Unknown
2026-03-03
TCP Deep Dive with Wireshark
Xyla Huxley Last updated on 2026-03-03 6 min read TCP i […]
Unknown
2026-03-03
Web Scraping eCommerce Websites with Python: Step-by-Step Guide & Enterprise Alternatives
<–!> <–!> Anna Stankevičiūtė La […]
Unknown
2026-03-03
What Is AI Scraping? Definition, Technology, Applications, and Enterprise-Level Selection Guide
<–!> <–!> Anna Stankevičiūtė La […]
Unknown
2026-03-03
Concurrency vs Parallelism: Core Differences, Application Scenarios, and Practical Guide
<–!> <–!> Anna Stankevičiūtė La […]
Unknown
2026-03-03
Crawl4AI: Open-Source AI Web Crawler with MCP Automation
Xyla Huxley Last updated on 2026-03-03 10 min read AI a […]
Unknown
2026-03-03
Using Wget with Python: A Practical Guide for Reliable, Scalable Web Data Retrieval
Xyla Huxley Last updated on 2026-03-03 10 min read […]
Unknown
2026-03-03
What is a Python Proxy Server? A Complete Guide from Definition to Build
<–!> <–!> Anna Stankevičiūtė La […]
Unknown
2026-03-03