EN
English
简体中文
Log inGet started for free

Blog

blog

web-scraping-ecommerce-websites-with-python-step-by-step-guide-enterprise-alternatives

Web Scraping eCommerce Websites with Python: Step-by-Step Guide & Enterprise Alternatives

<–!>

how to scrape glassdoor

<–!>

author anna

Anna Stankevičiūtė
Last updated on
 
2026-2-26
 
10 min read
 

eCommerce web scraping has become a critical tool for retail brands, market researchers, and pricing analysts to gather competitive pricing data, consumer reviews, and inventory insights. Python is the most popular language for this task due to its lightweight libraries and flexible syntax, but custom scripts often struggle with anti-scraping bans, dynamic content, and compliance risks at scale. This guide provides verified code examples, anti-scraping best practices, and an enterprise-grade alternative to custom scripts.

Pre-Requisites for Web Scraping eCommerce with Python

1. Environment Setup

Developers need Python 3.8+ and core libraries tailored for eCommerce scraping. Install dependencies via pip, referencing official Python documentation for compatibility.

pip install requests beautifulsoup4 selenium webdriver-manager pandas

2. Compliance First

Before scraping any eCommerce website, teams must:

● Review the target site’s robots.txt file to identify allowed/disallowed paths.

● Adhere to GDPR, CCPA, and regional data privacy laws to avoid legal penalties.

● Avoid scraping sensitive personal data (e.g., user contact information) or copyrighted content.

Step-by-Step Web Scraping for eCommerce Websites

1. Scraping Static eCommerce Pages

Static pages (e.g., Amazon product detail pages) can be scraped with requests and BeautifulSoup4 without rendering dynamic content. This code example extracts product title, price, and ASIN, with anti-scraping safeguards:

import requests
from bs4 import BeautifulSoup
import time
import random

# Mimic real browser headers to avoid detection
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9"
}

def scrape_amazon_product(asin):
    url = f"https://www.amazon.com/dp/{asin}"
    try:
        # Add random delay (1-3s) to mimic human behavior
        time.sleep(random.uniform(1, 3))
        response = requests.get(url, headers=HEADERS, timeout=15)
        response.raise_for_status() # Raise error for HTTP 4xx/5xx
        
        soup = BeautifulSoup(response.text, "html.parser")
        # Extract structured data with fallbacks for missing fields
        product_title = soup.find("span", id="productTitle").get_text(strip=True) if soup.find("span", id="productTitle") else "N/A"
        product_price = soup.find("span", class_="a-price-whole").get_text(strip=True) if soup.find("span", class_="a-price-whole") else "N/A"
        
        return {
            "asin": asin,
            "title": product_title,
            "price": product_price,
            "url": url
        }
    except Exception as e:
        print(f"Failed to scrape ASIN {asin}: {str(e)}")
        return None

# Example usage
product_data = scrape_amazon_product("B0C1XKZJZ9")
print(product_data)

2. Scraping Dynamic eCommerce Pages

Dynamic pages (e.g., JD.com’s user reviews) load content via AJAX, requiring a headless browser like Selenium to render content. This code extracts top 5 user reviews without opening a visible browser window:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

def scrape_jd_reviews(product_id):
    # Configure headless Chrome to mimic real user behavior
    options = webdriver.ChromeOptions()
    options.add_argument("--headless=new")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36")
    options.add_argument("--disable-blink-features=AutomationControlled") # Avoid bot detection
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    try:
        driver.get(f"https://item.jd.com/{product_id}.html#comment")
        time.sleep(3) # Wait for dynamic reviews to load
        
        # Extract top 5 verified reviews
        reviews = []
        review_elements = driver.find_elements(By.CSS_SELECTOR, ".comment-item")[:5]
        for elem in review_elements:
            review_text = elem.find_element(By.CSS_SELECTOR, ".comment-con").text.strip()
            reviews.append(review_text)
        
        return {"product_id": product_id, "reviews": reviews}
    except Exception as e:
        print(f"Failed to scrape JD reviews: {str(e)}")
        return None
    finally:
        driver.quit()

# Example usage
reviews = scrape_jd_reviews("100060123456")
print(reviews)

Common Anti-Scraping Challenges in eCommerce

1. IP Banning & Rate Limiting

● Phenomenon: Target site returns 403 Forbidden or requires CAPTCHA.

● Root Cause: Frequent requests from a single IP trigger anti-scraping rules.

● Solution: Rotate IPs via proxy pools, implement exponential backoff for retries.

2. Dynamic Content Rendering

● Phenomenon: requests returns empty HTML for required content

● Root Cause: Content is loaded via React/Vue or AJAX post-page load

● Solution: Use Selenium/Playwright for headless browser rendering

3. Behavior Analysis & CAPTCHA

● Phenomenon: Site detects non-human behavior (e.g., uniform delays, missing cookies)

● Root Cause: Advanced anti-scraping tools (e.g., Cloudflare) analyze user behavior

● Solution: Mimic real user behavior (random delays, cookie persistence) or use enterprise-grade APIs

Why Choose Thordata Web Scraper API Over Custom Python Scripts?

For enterprise teams scraping eCommerce websites at scale, custom Python scripts often fail due to maintenance overhead, anti-scraping bans, and compliance risks. Here’s why Thordata Web Scraper API is a superior alternative:

1. Compliant & Anti-Scraping Ready: Thordata maintains a global pool of compliant residential IPs aligned with GDPR, CCPA, and China’s Personal Information Protection Law, eliminating legal risks. Its intelligent anti-scraping engine automatically adapts to rate limits, CAPTCHAs, and dynamic content, delivering a 99.2% success rate for major platforms like Amazon, JD.com, and Taobao

2. No Maintenance Overhead: Custom scripts require 40% of a developer’s time to update when eCommerce sites change HTML structures or anti-scraping rules. Thordata’s AI-powered content extraction engine automatically detects and adapts to page changes, freeing teams to focus on data analysis instead of script upkeep

3. Cost-Effective Scalability: Building and maintaining a custom proxy pool costs $2,000-$5,000 monthly in IP leases and server fees. Thordata’s pay-as-you-go pricing model (based on successful requests) reduces costs by up to 30% for enterprise-scale scraping

4. Enterprise-Grade SLA & Support: Custom scripts offer no uptime guarantees, disrupting critical operations like real-time pricing optimization. Thordata provides a 99.9% uptime SLA and 24/7 technical support, ensuring reliable data access.

Enterprise-Grade Web Scraping Best Practices

1. Implement Rate Limiting & Retry Logic

● Use exponential backoff (e.g., 1s, 2s, 4s delays) for failed requests instead of fixed delays

● Limit concurrent requests to avoid overwhelming target sites

2. Clean & Validate Scraped Data

● Use Pandas to remove duplicates, standardize data formats, and filter invalid entries

● Implement data validation checks (e.g., ensure price fields are numeric)

3. Monitor Scraping Performance

● Track success rates, response times, and ban rates via tools like Prometheus

● Set up alerts for abnormal activity (e.g., sudden drop in success rate)

 
Get started for free

<--!>

Frequently asked questions

Is web scraping eCommerce websites legal?

 

Web scraping is legal if you comply with target sites’ terms of service and regional data privacy laws. Avoid scraping sensitive data or copyrighted content, and review robots.txt before starting.

How can I avoid getting banned when scraping eCommerce sites?

 

Mimic real user behavior (random delays, realistic User-Agents), rotate IPs, and avoid scraping during peak hours. For enterprise scale, use Thordata Web Scraper API’s anti-scraping safeguards.

What’s the best Python library for eCommerce scraping?

 

Use requests + BeautifulSoup4 for static pages, Selenium for dynamic pages, and Thordata Web Scraper API for enterprise-scale needs.

<--!>

About the author

Anna is a content specialist who thrives on bringing ideas to life through engaging and impactful storytelling. Passionate about digital trends, she specializes in transforming complex concepts into content that resonates with diverse audiences. Beyond her work, Anna loves exploring new creative passions and keeping pace with the evolving digital landscape.

The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.