Over 60 million real residential IPs from genuine users across 190+ countries.
Over 60 million real residential IPs from genuine users across 190+ countries.
PROXY SOLUTIONS
Over 60 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
Guaranteed bandwidth — for reliable, large-scale data transfer.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
A powerful web data infrastructure built to power AI models, applications, and agents.
High-speed, low-latency proxies for uninterrupted video data scraping.
Extract video and metadata at scale, seamlessly integrate with cloud platforms and OSS.
6B original videos from 700M unique channels - built for LLM and multimodal model training.
Get accurate and in real-time results sourced from Google, Bing, and more.
Execute scripts in stealth browsers with full rendering and automation
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Get instant access to ready-to-use datasets from popular domains.
PROXY PRICING
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Proxies $/GB
Over 60 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Guaranteed bandwidth — for reliable, large-scale data transfer.
Scrapers $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Data for AI $/GB
A powerful web data infrastructure built to power AI models, applications, and agents.
High-speed, low-latency proxies for uninterrupted video data scraping.
Extract video and metadata at scale, seamlessly integrate with cloud platforms and OSS.
6B original videos from 700M unique channels - built for LLM and multimodal model training.
Pricing $0/GB
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Starts from
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN
代理 $/GB
数据采集 $/GB
AI数据 $/GB
定价 $0/GB
产品文档
资源 $/GB
简体中文$/GB
engine="google_scholar" parameter that handles CAPTCHAs, parsing, and proxy rotation automatically.Google Scholar is the gold mine of academic research. Whether you are conducting a systematic literature review, analyzing citation trends, or feeding an LLM with scientific papers, the data here is invaluable.
But scraping it is a nightmare. Unlike scraping a standard e-commerce site, Google Scholar defends its data aggressively. If you try to loop through search results with a standard script, you will hit a CAPTCHA wall within seconds.
In this guide, I will show you two methods: the “Manual Way” using Python and proxies for those who want to build it from scratch, and the “Enterprise Way” using the Thordata SDK for instant, scalable results.
Before coding, let’s look at the URL structure. A standard search looks like this:
https://scholar.google.com/scholar?q=machine+learning&hl=en&start=10
q: Your search query.hl: Host language (keep this en for consistent parsing).start: The pagination offset (0, 10, 20…).We need a few robust libraries. requests for HTTP, beautifulsoup4 for HTML parsing, and the thordata-sdk for the enterprise method.
pip install requests beautifulsoup4 lxml pandas thordata-sdk
If you run the following code, it might work once. On the second try, you will get a 429 Error or a 200 OK response containing a CAPTCHA instead of results.
import requests
# This will likely get blocked immediately
headers = {'User-Agent': 'Mozilla/5.0 ...'}
r = requests.get('https://scholar.google.com/scholar?q=AI', headers=headers)
print(r.status_code) # Likely 429 or CAPTCHA html
To succeed manually, you need to route every request through a different IP address. This mimics distinct users searching from different locations. We will use Thordata’s Residential Gateway for this.
import requests
import random
def get_scholar_page(query, start=0):
url = f"https://scholar.google.com/scholar?q={query}&hl=en&start={start}"
# Thordata Proxy Setup (Residential)
proxy = "http://USERNAME:PASSWORD@gate.thordata.com:12345"
proxies = {
"http": proxy,
"https": proxy
}
# Rotate User-Agents
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
]
headers = {"User-Agent": random.choice(user_agents)}
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=10)
if response.status_code == 200:
return response.text
else:
print(f"Blocked: {response.status_code}")
return None
except Exception as e:
print(f"Request failed: {e}")
return None
Google Scholar’s HTML is messy. We need specific CSS selectors to extract titles and citation counts accurately.
from bs4 import BeautifulSoup
def parse_scholar_results(html):
soup = BeautifulSoup(html, 'lxml')
results = []
for item in soup.select('.gs_r.gs_or.gs_scl'):
try:
title_tag = item.select_one('.gs_rt a')
title = title_tag.text if title_tag else "No Title"
link = title_tag['href'] if title_tag else "None"
# Extract "Cited by" count
citation_info = item.select_one('.gs_fl').text
citations = 0
if "Cited by" in citation_info:
# Robust parsing for "Cited by 400"
temp = citation_info.split("Cited by")[1].split()[0]
citations = int(temp)
results.append({
'title': title,
'link': link,
'citations': citations
})
except Exception as e:
print(f"Parsing error: {e}")
continue
return results
Building your own scraper is educational, but maintaining it is expensive. Google changes its HTML structure frequently, and CAPTCHAs evolve. The Thordata Python SDK offers a dedicated google_scholar engine that abstracts away all complexity.
Instead of managing proxies and HTML parsing, you get structured JSON data instantly.
from thordata import ThordataClient
# Initialize with your Scraper Token
client = ThordataClient(scraper_token="YOUR_TOKEN_HERE")
print("🚀 Fetching Google Scholar results...")
# The SDK handles CAPTCHAs and Parsing automatically
results = client.serp_search(
query="Generative Adversarial Networks",
engine="google_scholar",
num=10,
country="us"
)
for paper in results.get("organic_results", []):
print(f"Title: {paper.get('title')}\nLink: {paper.get('link')}\n")
The SDK method is 100% maintenance-free. Thordata’s engineers update the parsing logic whenever Google changes their layout. Plus, you only pay for successful requests, saving money on failed proxy attempts.
Scraping Google Scholar requires more than just code; it requires infrastructure. If you are a student or hobbyist, the manual method with residential proxies is a great way to learn.
However, for enterprise data needs, relying on the Thordata SDK ensures your data pipeline remains stable, scalable, and compliant.
Frequently asked questions
Is it legal to scrape Google Scholar?
Scraping publicly available data is generally legal in many jurisdictions (e.g., HiQ v. LinkedIn), provided you do not harm the site’s infrastructure. However, you must always review your local laws and the site’s Terms of Service.
Why do I get blocked after only 5 requests?
Google Scholar has extremely strict rate limiting based on IP address and TLS fingerprints. To avoid blocks, you must rotate high-quality residential proxies and mimic legitimate browser behavior using the Thordata SDK.
How does the Thordata SDK handle CAPTCHAs?
The Thordata SDK routes requests through our specialized SERP infrastructure. When a CAPTCHA is detected, our system automatically solves it before returning the clean JSON data to your application.
About the author
Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for bypassing anti-bot protections. He specializes in explaining complex infrastructure concepts like residential proxies and TLS fingerprinting to developer audiences. All code examples in this article have been tested in real-world scraping scenarios.
The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
5 Best Etsy Scraper Tools in 2026
This article evaluates the top ...
Yulia Taylor
2026-02-09
What is a Headless Browser? Top 5 Popular Tools
A headless browser is a browse ...
Yulia Taylor
2026-02-07
Best Anti-Detection Browser
Xyla Huxley Last updated on 2025-01-22 10 min read […]
Unknown
2026-02-06
What is a UDP proxy?
Xyla Huxley Last updated on 2025-01-22 10 min read […]
Unknown
2026-02-06
What is Geographic Pricing?
Xyla Huxley Last updated on 2025-01-22 10 min read […]
Unknown
2026-02-05
How to Use Proxies in Python: A Practical Guide
Xyla Huxley Last updated on 2025-01-28 10 min read […]
Unknown
2026-02-05
What Is an Open Proxy? Risks of Free Open Proxies
Xyla Huxley Last updated on 2025-01-22 10 min read […]
Unknown
2026-02-04
What Is a PIP Proxy? How It Works, Types, and Configuration?
Xyla Huxley Last updated on 2025-01-22 10 min read […]
Unknown
2026-02-04
TCP and UDP: What’s Different and How to Choose
Xyla Huxley Last updated on 2026-02-03 10 min read […]
Unknown
2026-02-04