Your First Plan is on Us!

Get 100% of your first residential proxy purchase back as wallet balance, up to $900.

Start now
EN
English
简体中文
Log inGet started for free

Blog

AutoScraper Python: Smart Scraping & Proxy Guide

Python code referencing AutoScraper library with AI neural network background

author Kael Odin
Kael Odin
Last updated on
2026-01-12
11 min read
📌 Key Takeaways
  • Zero-Code Parsing: AutoScraper allows you to scrape data by simply providing examples of what you want, removing the need to inspect HTML or write XPaths.
  • Model Persistency: Learn to save your trained scraping models to files (.save()) so you can reuse them later in production without retraining.
  • Proxy Injection: AutoScraper relies on requests. We show you how to inject Thordata Residential Proxies via request_args to avoid IP bans.

Imagine scraping a website without ever pressing F12 to inspect the source code. Imagine just telling Python: “I want the data that looks like ‘iPhone 15′” and having the script figure out the CSS selectors automatically.

This is the promise of AutoScraper, a “Smart, Automatic, Fast and Lightweight Web Scraper for Python.” It uses a form of fuzzy matching to learn scraping rules from the examples you provide.

While many blogs cover the “Hello World” of AutoScraper, few show how to use it in the real world—where websites ban IPs and structures get complex. In this guide, I’ll take you from the basics to an advanced setup with proxy integration.

1. The “Magic” of Learning by Example

First, install the library. It is lightweight and depends on requests and bs4.

Copy
1
pip install autoscraper

Let’s say we want to scrape book titles from a test site. Instead of looking for div.product_pod h3 a, we just pick a title we see on the page, like “A Light in the Attic”, and give it to AutoScraper.

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from autoscraper import AutoScraper

url = 'https://books.toscrape.com/'

# We want to scrape titles and prices. We give one example of each.
wanted_list = ["A Light in the Attic", "£51.77"]

scraper = AutoScraper()

# The build() function learns the rules
result = scraper.build(url, wanted_list)

print(result)
# Output: ['A Light in the Attic', 'Tipping the Velvet', ..., '£51.77', '£53.74', ...]

2. Grouping Data & Saving Models

The output above is a flat list. In a real project, you want a dictionary where titles are linked to prices. We can ask AutoScraper to group the rules and then save the model for later use.

Copy
1
2
3
4
5
6
7
8
9
10
# Get the rules learned
rules = scraper.get_result_similar(url, grouped=True)

print(rules.keys()) 
# Output: dict_keys(['rule_1ab2', 'rule_8x9y'])

# Alias the rules and save the model
scraper.set_rule_aliases({'rule_1ab2': 'Title', 'rule_8x9y': 'Price'})
scraper.keep_rules(['rule_1ab2', 'rule_8x9y'])
scraper.save('books_model')

3. The Missing Piece: Proxy Integration

This is where most AutoScraper tutorials fail. If you try to run your saved model on Amazon or Google, you will be blocked immediately. AutoScraper uses the requests library internally, which means we can pass arguments to it via request_args.

THORDATA INTEGRATION

By routing traffic through Thordata’s residential gateway, your AutoScraper script becomes virtually undetectable. The proxy handles IP rotation automatically.

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from autoscraper import AutoScraper

# Load the trained model
scraper = AutoScraper()
scraper.load('books_model')

# Thordata Proxy Configuration (Username:Password)
proxies = {
    "http": "http://USER:PASS@gate.thordata.com:12345",
    "https": "http://USER:PASS@gate.thordata.com:12345",
}

# Scrape a new page using the proxy
target_url = 'https://books.toscrape.com/catalogue/page-2.html'

results = scraper.get_result_similar(
    target_url, 
    group_by_alias=True,
    request_args={'proxies': proxies, 'timeout': 10}
)

print(results['Title'][0])
⚠️ Limitations of AutoScraper: AutoScraper is excellent for static HTML. However, it cannot render JavaScript. If the website requires clicking buttons or infinite scrolling, you should switch to Thordata Web Unlocker which handles JS rendering for you.

4. When to Use (and Avoid) AutoScraper

As a data engineer, picking the right tool is half the battle. Here is my verdict on when AutoScraper shines and when it falls short:

Use AutoScraper When… Avoid AutoScraper When…
You need a quick prototype (POC) in 5 minutes. You are scraping a highly dynamic SPA (React/Vue).
The site structure is simple and consistent. You need complex pagination logic or login handling.
You want to avoid learning XPath/CSS selectors. You need high-performance concurrency (Scrapy is better).

Conclusion

AutoScraper is a brilliant tool for “lazy” scraping. It lowers the barrier to entry significantly. However, for enterprise-grade data extraction, it must be paired with robust infrastructure. By adding Thordata Residential Proxies to the mix, you turn this lightweight tool into a capable scraper for moderate workloads.

Get started for free

Frequently asked questions

How does AutoScraper work?

AutoScraper learns parsing rules by comparing your input (e.g., a specific product title) against the page’s HTML structure. It automatically detects the XPaths/CSS selectors needed to find similar items.

Can I use proxies with AutoScraper?

Yes, AutoScraper is built on top of the ‘requests’ library. You can pass a proxy dictionary to the ‘request_args’ parameter in the ‘build’ or ‘get_result’ methods.

Is AutoScraper better than BeautifulSoup?

It is faster for prototyping because you don’t need to inspect HTML code. However, for production-grade scraping where site structures change slightly, BeautifulSoup or Scrapy offers more reliability.

About the author

Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for bypassing anti-bot protections. He specializes in explaining complex infrastructure concepts like residential proxies and TLS fingerprinting to developer audiences. All code examples in this article have been tested in real-world scraping scenarios.

The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.