Fetch real-time data from 100+ websites,No development or maintenance required.
Over 100 million real residential IPs from genuine users across 190+ countries.
SCRAPING SOLUTIONS
Get accurate and in real-time results sourced from Google, Bing, and more.
With 120+ prebuilt and custom scrapers ready for any use case.
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Execute scripts in stealth browsers with full rendering and automation
PROXY INFRASTRUCTURE
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
SCRAPING SOLUTIONS
PROXY INFRASTRUCTURE
DATA FEEDS
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Products $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Data for AI $/GB
Pricing $0/GB
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN $/GB
产品 $/GB
AI数据 $/GB
定价 $0/GB
产品文档 $/GB
资源 $/GB
简体中文 $/GB
BeautifulSoup Tutorial 2026: Parse HTML Data With Python
Content by Kael Odin
HTML pages are everywhere in 2026: product catalogs, job boards, pricing tables, documentation, news sites, and more. If you work with Python, BeautifulSoup is still one of the fastest ways to turn that raw HTML into structured data you can search, analyze, and feed into downstream systems.
This tutorial walks through a complete, copy-paste ready workflow: you’ll start with a small sample HTML file, learn how to parse it with BeautifulSoup, then move on to real HTTP responses, CSS selectors, and exporting data to CSV. Along the way, you’ll see how the same patterns scale to larger web scraping projects powered by managed infrastructure, so you don’t have to maintain brittle scrapers yourself—and how to combine this parser with solid Python basics like those covered in our syntax error and debugging guides.
beautifulsoup4 and requests in a clean Python environmentfind, find_all, and selectWe’ll assume you already have Python 3.10+ installed. If you’re on Windows, make sure you checked the “Add Python to PATH” box during installation so commands like python and pip work in your terminal.
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS / Linux
source .venv/bin/activate
We’ll use beautifulsoup4 for parsing and requests for making HTTP calls. Optionally, you can install lxml for faster parsing:
pip install beautifulsoup4 requests lxml
To understand the basics of BeautifulSoup, we’ll start with a simple, static HTML snippet representing a product list. Save the following content as sample_products.html in your project directory:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Sample Product List</title>
</head>
<body>
<h1>Top Selling Products</h1>
<ul id="products">
<li class="product" data-sku="A100">
<span class="name">Data Center Proxy Plan</span>
<span class="price">49.00</span>
<span class="currency">USD</span>
</li>
<li class="product" data-sku="A200">
<span class="name">Residential Proxy Plan</span>
<span class="price">99.00</span>
<span class="currency">USD</span>
</li>
<li class="product featured" data-sku="A300">
<span class="name">Web Scraper API Bundle</span>
<span class="price">199.00</span>
<span class="currency">USD</span>
</li>
</ul>
</body>
</html>
This HTML is much simpler than a real e-commerce page, but it’s perfect for learning the core BeautifulSoup patterns.
Create a Python file named beautifulsoup_intro.py and paste the following code. This loads your local HTML file and prints out the top-level tags:
from bs4 import BeautifulSoup
HTML_FILE = "sample_products.html"
with open(HTML_FILE, "r", encoding="utf-8") as f:
html = f.read()
soup = BeautifulSoup(html, "html.parser")
print("Document title:", soup.title.string)
print("Main heading:", soup.h1.string)
print("\nAll direct children of <body>:")
for child in soup.body.children:
if getattr(child, "name", None):
print(" -", child.name)
Run it:
python beautifulsoup_intro.py
You should see output similar to:
Document title: Sample Product List
Main heading: Top Selling Products
All direct children of <body>:
- h1
- ul
BeautifulSoup provides several powerful methods for locating elements:
| Method | Use Case | Example |
|---|---|---|
find() |
First match | soup.find("ul", id="products") |
find_all() |
All matches | soup.find_all("li", class_="product") |
select() |
CSS selectors | soup.select("ul#products li.product") |
Let’s parse all products into a list of dictionaries. Create parse_products.py:
from bs4 import BeautifulSoup
from pathlib import Path
HTML_FILE = "sample_products.html"
def parse_products(html: str):
soup = BeautifulSoup(html, "html.parser")
items = []
for li in soup.select("ul#products li.product"):
name_el = li.select_one(".name")
price_el = li.select_one(".price")
currency_el = li.select_one(".currency")
items.append(
{
"sku": li.get("data-sku"),
"name": name_el.get_text(strip=True) if name_el else "",
"price": float(price_el.get_text(strip=True)) if price_el else None,
"currency": currency_el.get_text(strip=True) if currency_el else "",
"featured": "featured" in li.get("class", []),
}
)
return items
def main() -> None:
html = Path(HTML_FILE).read_text(encoding="utf-8")
products = parse_products(html)
print(f"Found {len(products)} products:")
for p in products:
print(f" - {p['sku']}: {p['name']} ({p['price']} {p['currency']})"
+ (" [FEATURED]" if p["featured"] else ""))
if __name__ == "__main__":
main()
Run it and you should get a neatly formatted list of products extracted from your HTML file.
So far we’ve worked with a local file. In real web scraping projects, you’ll usually fetch HTML over HTTP (using requests or a managed scraper) and then pass the response text to BeautifulSoup.
Here’s a minimal example that fetches https://httpbin.org/html and prints the main heading text:
import requests
from bs4 import BeautifulSoup
URL = "https://httpbin.org/html"
resp = requests.get(URL, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.find("h1")
print("Page heading:", title.get_text(strip=True) if title else "(not found)")
Once you’ve parsed HTML into Python objects, you’ll often want to export that data into CSV for analysis in tools like Excel, Google Sheets, or a data warehouse.
Let’s extend our product parser to write a CSV file using pandas:
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd
HTML_FILE = "sample_products.html"
def parse_products(html: str):
soup = BeautifulSoup(html, "html.parser")
items = []
for li in soup.select("ul#products li.product"):
name_el = li.select_one(".name")
price_el = li.select_one(".price")
currency_el = li.select_one(".currency")
items.append(
{
"sku": li.get("data-sku"),
"name": name_el.get_text(strip=True) if name_el else "",
"price": float(price_el.get_text(strip=True)) if price_el else None,
"currency": currency_el.get_text(strip=True) if currency_el else "",
}
)
return items
def main() -> None:
html = Path(HTML_FILE).read_text(encoding="utf-8")
products = parse_products(html)
df = pd.DataFrame(products)
df.to_csv("products.csv", index=False, encoding="utf-8")
print("Exported products.csv with", len(df), "rows")
if __name__ == "__main__":
main()
After running this script, you should see a new products.csv file in your project directory containing the parsed data.
BeautifulSoup’s select() and select_one() methods support a useful subset of CSS selectors. Here are a few patterns you’ll use frequently:
| Pattern | Selector | Description |
|---|---|---|
| By ID | soup.select_one("#products") |
Element with id=”products” |
| By class | soup.select(".product.featured") |
All elements with class “product” and “featured” |
| Tag + class | soup.select("li.product .price") |
All elements with class “price” inside <li class="product"> |
| Attribute | soup.select('li[data-sku="A200"]') |
Product with SKU A200 |
BeautifulSoup is perfect for parsing HTML you already have, but it doesn’t execute JavaScript. If your target pages are heavily dynamic (client-side rendering, infinite scroll, complex anti-bot protections), you’ll need an additional layer to render or fetch HTML reliably.
Many teams choose a hybrid approach: use a managed scraping platform to handle JavaScript rendering, IP rotation, and anti-bot logic, then feed the resulting HTML into BeautifulSoup. This separation lets your Python code stay small and focused on parsing and business logic, while the infrastructure concerns are handled elsewhere.
For example, Thordata provides scraping APIs and tools designed to return clean, structured results from complex targets. You can manage your API tokens, monitor usage, and configure scraping jobs in the Thordata Dashboard, while keeping your parsing logic in Python with BeautifulSoup. To see how Thordata’s Python SDK works in practice, check out the open source repository here: Thordata Python SDK.
.get_text() on None will raise an exception. Always guard with if el or use helper functions."html.parser" to "lxml" (after installing lxml).requests may not include them. Use a headless browser or a managed scraper to get the final HTML.| Goal | BeautifulSoup Pattern |
|---|---|
| Create soup | soup = BeautifulSoup(html, "html.parser") |
| Find first tag | soup.find("h1") |
| Find all tags | soup.find_all("li") |
| Find by id | soup.find("ul", id="products") |
| Find by class | soup.find_all("li", class_="product") |
| CSS selector | soup.select("ul#products li.product .price") |
| Get text | element.get_text(strip=True) |
| Get attribute | element["data-sku"] or element.get("href") |
Frequently asked questions
Is BeautifulSoup easy to learn?
Yes. BeautifulSoup has a relatively low learning curve. If you understand basic Python and HTML, you can start extracting data quickly using methods like find, find_all, and select. The official documentation also includes many examples to help you progress.
Is BeautifulSoup enough for production web scraping?
BeautifulSoup is excellent for parsing HTML and XML, but it doesn’t handle JavaScript rendering, IP rotation, or large-scale crawling. For serious production workloads, you typically combine BeautifulSoup with robust HTTP clients, headless browsers, or managed scraping platforms that provide anti-bot handling and infrastructure.
Should I use BeautifulSoup or Scrapy?
Use BeautifulSoup for smaller tasks and focused HTML parsing when you already have the page content. Scrapy is a full web scraping framework that adds built-in crawling, concurrency, and pipeline features. Many teams start with BeautifulSoup and later adopt Scrapy or managed scraping solutions as their projects grow.
Can I use BeautifulSoup with other data tools?
Absolutely. BeautifulSoup works well with libraries like pandas and SQLAlchemy, and with cloud storage or data warehouses. It’s common to parse HTML with BeautifulSoup, turn the results into a pandas DataFrame, and then export to CSV, Parquet, or a database for downstream analysis.
About the author
Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for web scraping, HTML parsing, and API integrations. His focus is on creating hands-on tutorials that can be copied, run, and adapted to real-world projects.
The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
How to Make HTTP Requests in Node.js With Fetch API (2026)
How to Make HTTP Requests in Node.js With Fetch API (20 […]
Unknown
2026-03-03
How to Scrape Job Postings in 2026: Complete Guide
How to Scrape Job Postings in 2026: Complete Guide Cont […]
Unknown
2026-03-03
Python Syntax Errors: Common Mistakes and How to Fix Them
2026 production guide to under ...
Kael Odin
2026-03-03
How to Scrape Glassdoor Data with Python?
In this tutorial, master how t ...
Anna Stankevičiūtė
2026-03-02
Best B2B Data Providers of 2026: Unlock Hyper-Targeted Leads & Actionable Insights
Yulia Taylor Last updated on 2026-02-08 5 min read In 2 […]
Unknown
2026-03-02
5 Best Etsy Scraper Tools in 2026
This article evaluates the top ...
Yulia Taylor
2026-02-09
What is a Headless Browser? Top 5 Popular Tools
A headless browser is a browse ...
Yulia Taylor
2026-02-07
Best Anti-Detection Browser in 2026
This article mainly introduces ...
Xyla Huxley
2026-02-06
What Is a UDP Proxy? Use Cases and Limits
This article primarily explain ...
Xyla Huxley
2026-02-06