EN
English
简体中文
Log inGet started for free

Blog

blog

web-scraping-with-python-using-requests

Web Scraping with Python using Requests

thordata

author xyla

Xyla Huxley
Last updated on
2026-03-03
6 min read

Web crawlers have quietly become a cost-effective way for engineering teams, data analysts, and product developers to transform public web pages into structured datasets—without relying on expensive data vendors or resorting to inefficient manual copy-and-paste processes. As long as they’re conducted in compliance with regulations and with a sense of responsibility, Python web crawlers can support a wide range of applications, from price intelligence and product catalog monitoring to content aggregation and competitive analysis.

What is a Python web crawler? Why do technical teams use it?

A web crawler is the process of automatically extracting data from web pages (typically in HTML format) and converting it into a structured format—such as tables, spreadsheets, or databases. Compared to manually opening dozens of pages and copying fields one by one, a script sends HTTP requests, parses the page content, and writes the results to a file or storage layer.

Application scenarios

E-commerce Intelligence: Product Name, Price, Description, Inventory Status,

Real Estate Analysis: Property Listings, Price Changes,

Media and News Aggregation: Title, Author, Category, Time

Community and Content Curation: Quotes, Jokes, Comments, Forum Posts (within permissible limits)

Dataset Initialization: Provides seed training data for internal NLP classification or annotation pipelines.

import asyncio, aiohttp, time
from datetime import datetime

URLS = [
    "https://example.com/page1",
    "https://example.com/page2",
]

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)"
}

async def fetch(session, url, timeout=20):
    t0 = time.time()
    try:
        async with session.get(url, headers=HEADERS, timeout=timeout) as r:
            text = await r.text()
            ms = int((time.time() - t0) * 1000)
            print(f"{datetime.utcnow().isoformat()}Z INFO fetch url={url} status={r.status} latency_ms={ms} bytes={len(text)}")
            return r.status, text
    except Exception as e:
        ms = int((time.time() - t0) * 1000)
        print(f"{datetime.utcnow().isoformat()}Z ERROR fetch url={url} latency_ms={ms} err={type(e).__name__} msg={e}")
        return None, None

async def main():
    connector = aiohttp.TCPConnector(limit=20, ssl=False)
    async with aiohttp.ClientSession(connector=connector) as session:
        await asyncio.gather(*(fetch(session, u) for u in URLS))

if __name__ == "__main__":
    asyncio.run(main())

Crawlers often serve as the “data ingestion layer”: web → data cleansing → analysis/dashboard → alerting/automation.

Core Technology Stack for Python Web Scraping

A usable crawler typically consists of several verified components:

HTTP with Python Requests:

requestsIt is the mainstream library for scraping HTML and can:

Initiate a GET request to the public page.

Submit a POST request to the login form

Managing request headers (especially the User-Agent) can make them more browser-like.

pass
Session()Maintain cookie and session state

Parse HTML using BeautifulSoup + lxml

BeautifulSoup (bs4) Parse HTML into a searchable tree structure, and you can locate tags as follows:

Tag name (div,h4,img,a)

Class and Attributes

Hierarchical structure (e.g., product card container → title → price)

UselxmlAs a parser, it can improve performance and enhance robustness against “dirty HTML.”

Export Excel files using openpyxl or xlsxwriter in a real-world pipeline.

Data extraction is only half the job; the professional process must ultimately produce clean data results.

Structured field columns (name, price, description, image link, product detail page link)

Consistent Coding and Cleansing Strategies

Predictable file naming and directory structure for easy downstream use.

Excel export is particularly common because it’s suitable for both engineers and non-technical stakeholders.

How to Check HTML for Web Scrapers

Before writing code, you need to “understand” the page just like a browser does.

Use developer tools to find a stable HTML anchor.

Open Chrome DevTools → Inspect Element, then locate:

Product card: typically appears repeatedly on the page.divContainer

Title/Name: Commonly used as a title tag (h4,h5) or link tags (a)

Price: With price-related classspanor title tag

Image link:imgTags, usually insrc(or lazy-loaded properties such as)data-src) in

Detail page link: In the form of a href="/product/123"Anchor link

Key best practice: Avoid using selectors that are prone to change (such as paths with excessively deep nesting levels). Instead, prioritize stable class names or recurring structural features.

Python Web Crawler Workflow

Most crawlers go through three stages of evolution:

Stable single-page capture

Send request

Parse HTML

Find the card list

Extract a small number of fields

The printout is used for verification.

This stage emphasizes correctness and reproducibility, without pursuing speed.

Pagination extended to the known number of pages

If a site has 7 pages, a common pattern is to cyclically modify URL parameters:

page=1→ page=2→ … → page=7

Within each pagination iteration, loop through all product cards on the current page and extract the fields.

Add a secondary request to the detail page.

List pages typically provide only partial information. A more robust pipeline is:

Scrape the listing page to collect product detail page URLs.

Request each detail page

Parse the richer description/specification module

Merge into the output line

At this point, the simple script has been upgraded to a genuine data ingestion task.

Handle relative URLs

Many websites return links that look like this:

/images/item123.jpg

/products/sku-001

These are relative URLs and must be combined with the site’s root domain to form absolute URLs:

https://example.com+ /products/sku-001→ Absolute URL

Professional crawlers treat URL normalization as a crucial step, because downstream tasks—such as image downloading and detail page scraping—depend on it.

Using generators to improve the efficiency of Python web crawlers

More memory-efficient, cleaner pipeline

When crawling a large number of pages, the “naive approach” often involves storing everything in a list:

Pagination URL List

Product URL List

Extract result row list

It can run, but it’ll soon hit a bottleneck.

Why are generators important in web crawlers?

A generator produces only one URL or one record at a time:

Lower memory usage

Easier to stream write into Excel

Easier for subsequent integration with queues, databases, or ETL processes.

A clear architecture is:
URL Generator → Page Request → Parsing → Record Generator → Output Writer

This provides a foundation for scaling to tens of thousands of records without having to rewrite the entire codebase.

Speed Limits and Anti-Ban Measures: How to Avoid Getting Your Crawlers Banned

Web scraping is not just a technical exercise—it also means being a “rule-abiding internet citizen” and doing your best to avoid triggering automated detection systems.

Add a delay between requests

Add a small interval between requests.sleepIt can reduce the load and mitigate the following risks:

IP Blocking

CAPTCHA verification

Response throttling

Use real request headers (especially the User-Agent).

Servers often mark “unknown clients.” The standard practice is to set:

User-Agent

Sometimes add. Accept-Language

Sometimes add.Referer(Only when it is reasonable and lawful for your process)

Pay attention to status codes and retry safely.

Professional crawlers should log and handle:

429 Too Many Requests(Reduce speed)

403 Forbidden(Stopped/Rejected)

5xx(Server is unstable)

Even if the code is lightweight, the design should anticipate failures and be able to recover from them.

Excel Export for Python Web Scrapers

The reason Excel is popular is that it:

Easy to share

Auditable

Highly compatible with analytics workflows and BI imports.

Best Practices for Excel Export in a Crawling Pipeline

Consistent schema definition:ProductName | Price | Description | ImageURL | ProductURL

Normalized line breaks and redundant spaces

Set column widths to enhance readability.

Incremental write operations (especially when fetching a large number of pages)

This is precisely where openpyxl and xlsxwriter shine: you can generate tables that are ready to use right away, rather than a messy pile of data.

Law and Ethics: What Should Python Web Scrapers Pay Attention To?

Professional technical readers will take this for granted: being able to grab does not equate to being allowed to grab.

Key operational guidelines:

Respect where applicabletxt (It’s not a law, but it’s strongly binding.)

Comply with the website’s Terms of Service and access restrictions.

Avoid collecting personal data or sensitive information.

Strict speed limits, and identification when required.

When an official API is available, prioritize using the API.

In a corporate setting, it’s best to involve legal/compliance as early as possible—especially for large-scale, long-term, and periodic data collection projects.

Summary

A robust Python web crawler is not merely about HTML parsing—it’s a complete pipeline. By treating the crawler as an engineering system rather than a one-off script, you can build a reusable data access layer that delivers ongoing value for analytics, product features, and operational monitoring.

 
Get started for free

Frequently asked questions

What’s the difference between BeautifulSoup and Scrapy? How should I choose between them?

 

BeautifulSoup is suitable for lightweight parsing and fast pipelines; Scrapy is a full-fledged framework with built-in scheduling, retry mechanisms, and pipelines. Choose Scrapy when you need large-scale web scraping and long-term maintenance.

How can I write the scraped data into a database instead of Excel?

 

 

Can Requests capture pages rendered by JavaScript?

 

Requests can only retrieve static HTML returned by the server. If the content is rendered on the frontend, you may need to call the actual JSON API (which can usually be found in the Network tab of DevTools) or use browser automation tools such as Playwright.

About the author

Xyla is a technical writer who turns complex networking and data topics into practical, easy-to-follow guides, treating content like troubleshooting: start from real scenarios, validate with data, and explain the “why” behind each solution. Outside of work, she’s a Level 2 badminton referee and marathon trainee—finding her best ideas between the court and the finish line.

The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the Thordata blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors or obtain a scraping permit if required.