Web scraping with Python is one of the most valuable skills for data professionals in 2026. Whether you need to collect product prices, monitor competitor websites, aggregate job listings, or build datasets for machine learning, Python offers powerful libraries that make scraping accessible to beginners and scalable for experts. This guide walks you through the three most popular scraping tools, provides working code examples, and covers essential topics like handling anti-bot systems, pagination, data storage, and proxy integration.

Tools Overview
Python's ecosystem offers three main approaches to web scraping, each suited to different scenarios:
- BeautifulSoup — A lightweight HTML/XML parser. Best for simple, static pages where you need to extract specific elements. Easy to learn, minimal setup.
- Scrapy — A full-featured scraping framework with built-in support for crawling, pipelines, middleware, and concurrent requests. Best for large-scale projects.
- Playwright — A browser automation library that renders JavaScript. Best for single-page applications (SPAs) and sites that load content dynamically.
| Feature | BeautifulSoup | Scrapy | Playwright |
|---|---|---|---|
| Learning Curve | Easy | Moderate | Moderate |
| JavaScript Support | No | No (without middleware) | Yes |
| Concurrency | Manual (threads) | Built-in | Built-in |
| Speed | Fast (no rendering) | Very fast | Slower (renders pages) |
| Best For | Small projects | Large-scale crawling | JS-heavy sites |

Getting Started with BeautifulSoup
BeautifulSoup is the best starting point for beginners. Install it alongside the requests library:
`bash
pip install beautifulsoup4 requests
`
Here is a complete example that scrapes article titles from a blog:
`python
import requests
from bs4 import BeautifulSoup
url = "https://example.com/blog" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"}
response = requests.get(url, headers=headers, timeout=30) soup = BeautifulSoup(response.text, "html.parser")
articles = soup.find_all("article")
for article in articles:
title = article.find("h2").get_text(strip=True)
link = article.find("a")["href"]
print(f"{title} — {link}")
`
Key points:
- Always set a realistic User-Agent header. The default Python requests User-Agent is blocked by many sites.
- Use html.parser (built-in) or install lxml for faster parsing: pip install lxml, then use BeautifulSoup(html, "lxml").
- find_all() returns a list of matching elements. find() returns the first match or None.
Extracting Structured Data
For product pages or structured content, you often need to extract multiple fields:
`python
products = []
for card in soup.find_all("div", class_="product-card"):
product = {
"name": card.find("h3").get_text(strip=True),
"price": card.find("span", class_="price").get_text(strip=True),
"rating": card.find("span", class_="rating").get_text(strip=True),
"url": card.find("a")["href"],
}
products.append(product)
`
Scraping at Scale with Scrapy
For projects that need to crawl hundreds or thousands of pages, Scrapy is the industry standard:
`bash
pip install scrapy
scrapy startproject myproject
`
Create a spider in myproject/spiders/products.py:
`python
import scrapy
class ProductSpider(scrapy.Spider): name = "products" start_urls = ["https://example.com/products?page=1"]
def parse(self, response): for product in response.css("div.product-card"): yield { "name": product.css("h3::text").get(), "price": product.css("span.price::text").get(), "url": response.urljoin(product.css("a::attr(href)").get()), }
# Follow pagination links
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
`
Run it with: scrapy crawl products -o products.json
Scrapy handles concurrency, retries, rate limiting, and data export automatically. It is significantly faster than BeautifulSoup for large crawls because it processes multiple requests in parallel.
Handling JavaScript-Heavy Sites with Playwright
Many modern websites load content via JavaScript after the initial page load. BeautifulSoup and Scrapy cannot see this content because they do not execute JavaScript. Playwright solves this by controlling a real browser:
`bash
pip install playwright
playwright install chromium
`
`python
from playwright.sync_api import sync_playwright
with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/spa-page")
# Wait for dynamic content to load page.wait_for_selector("div.product-card")
cards = page.query_selector_all("div.product-card") for card in cards: name = card.query_selector("h3").inner_text() price = card.query_selector(".price").inner_text() print(f"{name}: {price}")
browser.close()
`
Playwright is slower than BeautifulSoup because it launches a full browser, but it is the only reliable option for sites built with React, Vue, Angular, or other frontend frameworks.

Handling Anti-Bot Systems
Modern websites deploy anti-bot systems that block scrapers. Here are the most common defenses and how to handle them:
Rate Limiting (429 Errors)
Add delays between requests and implement exponential backoff:
`python
import time
import random
def polite_request(url, session, proxies):
time.sleep(random.uniform(1, 3)) # Random delay
return session.get(url, proxies=proxies, timeout=30)
`
CAPTCHAs
When you encounter CAPTCHAs, the best strategy is to rotate your IP and retry. Most CAPTCHAs are triggered by IP reputation, not individual requests. Using residential proxies dramatically reduces CAPTCHA frequency.
Browser Fingerprinting
For Playwright/Selenium scrapers, use stealth plugins to avoid detection:
`bash
pip install playwright-stealth
`
`python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page) # Apply stealth patches
page.goto("https://protected-site.com")
`
Handling Pagination
Most scraping jobs require navigating through multiple pages. Here are two common patterns:
URL-Based Pagination
`python
base_url = "https://example.com/products?page={}"
all_products = []
for page_num in range(1, 50): url = base_url.format(page_num) response = requests.get(url, headers=headers, proxies=proxies, timeout=30) soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product-card") if not products: # No more results break
for product in products: all_products.append(product.find("h3").get_text(strip=True))
time.sleep(random.uniform(1, 2))
`
Infinite Scroll (JavaScript)
For infinite scroll pages, use Playwright to simulate scrolling:
`python
# Scroll to bottom repeatedly until no new content loads
previous_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait for content to load
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
`
Storing Scraped Data
Once you have collected data, you need to store it. Common options:
CSV (Simple, Universal)
`python
import csv
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
writer.writeheader()
writer.writerows(products)
`
JSON (Structured, API-Friendly)
`python
import json
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, indent=2, ensure_ascii=False)
`
SQLite (Queryable, No Server Needed)
`python
import sqlite3
conn = sqlite3.connect("products.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS products (name TEXT, price TEXT, url TEXT)")
cursor.executemany("INSERT INTO products VALUES (?, ?, ?)",
[(p["name"], p["price"], p["url"]) for p in products])
conn.commit()
conn.close()
`
Proxy Integration
For any serious scraping project, proxy integration is essential. Without proxies, you will get blocked after a few hundred requests on most commercial websites.
Integrate rotating proxies to distribute your requests across thousands of IPs and maintain success rates above 95%. Here is how to add proxies to each tool:
BeautifulSoup/requests: Pass the proxies parameter (see our Python proxy setup guide).
Scrapy: Add proxy middleware in settings.py:
`python
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
}
HTTP_PROXY = "http://user:pass@gate.resproxy.io:7000"
`
Playwright: Set proxy at launch:
`python
browser = p.chromium.launch(proxy={"server": "http://gate.resproxy.io:7000",
"username": "user", "password": "pass"})
`
If you are new to proxies, start with our getting started guide to set up your account in minutes.
For comprehensive Python documentation, see the official Python requests library docs.
FAQ
Which Python library should I start with for web scraping?
Start with BeautifulSoup and the requests library. They are the easiest to learn and handle 70-80% of scraping tasks. Move to Scrapy when you need scale, and Playwright when you need JavaScript rendering.
Is web scraping legal?
Scraping publicly available data is generally legal in most jurisdictions, especially after the 2022 hiQ v. LinkedIn ruling in the US. However, always check the target site's Terms of Service and respect robots.txt directives. Avoid scraping personal data without consent.
How do I avoid getting blocked while scraping?
Use rotating residential proxies, set realistic User-Agent headers, add random delays between requests, and avoid scraping at aggressive rates. Our large-scale scraping tips cover this topic in depth.
How fast can I scrape with Python?
With BeautifulSoup (sequential): 1-5 pages/second. With Scrapy (concurrent): 50-200+ pages/second. With Playwright (browser rendering): 0.5-2 pages/second. Using proxies does not significantly reduce speed — the bottleneck is usually the target site's response time.
Founder & CEO
Founder of ResProxy and JC Media Agency. Over 5 years of experience in proxy infrastructure, digital advertising, and SaaS product development. Building premium proxy solutions for businesses worldwide.