Data Scraping15 min read

Python Web Scraping Guide for Beginners — 2026 Edition

Hieu Nguyen
Python Web Scraping Guide for Beginners — 2026 Edition

Web scraping with Python is one of the most valuable skills for data professionals in 2026. Whether you need to collect product prices, monitor competitor websites, aggregate job listings, or build datasets for machine learning, Python offers powerful libraries that make scraping accessible to beginners and scalable for experts. This guide walks you through the three most popular scraping tools, provides working code examples, and covers essential topics like handling anti-bot systems, pagination, data storage, and proxy integration.

Python web scraping guide for beginners
Python web scraping guide for beginners

Tools Overview

Python's ecosystem offers three main approaches to web scraping, each suited to different scenarios:

  • BeautifulSoup — A lightweight HTML/XML parser. Best for simple, static pages where you need to extract specific elements. Easy to learn, minimal setup.
  • Scrapy — A full-featured scraping framework with built-in support for crawling, pipelines, middleware, and concurrent requests. Best for large-scale projects.
  • Playwright — A browser automation library that renders JavaScript. Best for single-page applications (SPAs) and sites that load content dynamically.
FeatureBeautifulSoupScrapyPlaywright
Learning CurveEasyModerateModerate
JavaScript SupportNoNo (without middleware)Yes
ConcurrencyManual (threads)Built-inBuilt-in
SpeedFast (no rendering)Very fastSlower (renders pages)
Best ForSmall projectsLarge-scale crawlingJS-heavy sites
Web scraping tools and frameworks overview
Web scraping tools and frameworks overview

Getting Started with BeautifulSoup

BeautifulSoup is the best starting point for beginners. Install it alongside the requests library:

`bash pip install beautifulsoup4 requests `

Here is a complete example that scrapes article titles from a blog:

`python import requests from bs4 import BeautifulSoup

url = "https://example.com/blog" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"}

response = requests.get(url, headers=headers, timeout=30) soup = BeautifulSoup(response.text, "html.parser")

articles = soup.find_all("article") for article in articles: title = article.find("h2").get_text(strip=True) link = article.find("a")["href"] print(f"{title} — {link}") `

Key points: - Always set a realistic User-Agent header. The default Python requests User-Agent is blocked by many sites. - Use html.parser (built-in) or install lxml for faster parsing: pip install lxml, then use BeautifulSoup(html, "lxml"). - find_all() returns a list of matching elements. find() returns the first match or None.

Extracting Structured Data

For product pages or structured content, you often need to extract multiple fields:

`python products = [] for card in soup.find_all("div", class_="product-card"): product = { "name": card.find("h3").get_text(strip=True), "price": card.find("span", class_="price").get_text(strip=True), "rating": card.find("span", class_="rating").get_text(strip=True), "url": card.find("a")["href"], } products.append(product) `

Scraping at Scale with Scrapy

For projects that need to crawl hundreds or thousands of pages, Scrapy is the industry standard:

`bash pip install scrapy scrapy startproject myproject `

Create a spider in myproject/spiders/products.py:

`python import scrapy

class ProductSpider(scrapy.Spider): name = "products" start_urls = ["https://example.com/products?page=1"]

def parse(self, response): for product in response.css("div.product-card"): yield { "name": product.css("h3::text").get(), "price": product.css("span.price::text").get(), "url": response.urljoin(product.css("a::attr(href)").get()), }

# Follow pagination links next_page = response.css("a.next-page::attr(href)").get() if next_page: yield response.follow(next_page, self.parse) `

Run it with: scrapy crawl products -o products.json

Scrapy handles concurrency, retries, rate limiting, and data export automatically. It is significantly faster than BeautifulSoup for large crawls because it processes multiple requests in parallel.

Handling JavaScript-Heavy Sites with Playwright

Many modern websites load content via JavaScript after the initial page load. BeautifulSoup and Scrapy cannot see this content because they do not execute JavaScript. Playwright solves this by controlling a real browser:

`bash pip install playwright playwright install chromium `

`python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/spa-page")

# Wait for dynamic content to load page.wait_for_selector("div.product-card")

cards = page.query_selector_all("div.product-card") for card in cards: name = card.query_selector("h3").inner_text() price = card.query_selector(".price").inner_text() print(f"{name}: {price}")

browser.close() `

Playwright is slower than BeautifulSoup because it launches a full browser, but it is the only reliable option for sites built with React, Vue, Angular, or other frontend frameworks.

Integrating proxies with Python scraping scripts
Integrating proxies with Python scraping scripts

Handling Anti-Bot Systems

Modern websites deploy anti-bot systems that block scrapers. Here are the most common defenses and how to handle them:

Rate Limiting (429 Errors)

Add delays between requests and implement exponential backoff:

`python import time import random

def polite_request(url, session, proxies): time.sleep(random.uniform(1, 3)) # Random delay return session.get(url, proxies=proxies, timeout=30) `

CAPTCHAs

When you encounter CAPTCHAs, the best strategy is to rotate your IP and retry. Most CAPTCHAs are triggered by IP reputation, not individual requests. Using residential proxies dramatically reduces CAPTCHA frequency.

Browser Fingerprinting

For Playwright/Selenium scrapers, use stealth plugins to avoid detection:

`bash pip install playwright-stealth `

`python from playwright.sync_api import sync_playwright from playwright_stealth import stealth_sync

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() stealth_sync(page) # Apply stealth patches page.goto("https://protected-site.com") `

Handling Pagination

Most scraping jobs require navigating through multiple pages. Here are two common patterns:

URL-Based Pagination

`python base_url = "https://example.com/products?page={}" all_products = []

for page_num in range(1, 50): url = base_url.format(page_num) response = requests.get(url, headers=headers, proxies=proxies, timeout=30) soup = BeautifulSoup(response.text, "html.parser")

products = soup.find_all("div", class_="product-card") if not products: # No more results break

for product in products: all_products.append(product.find("h3").get_text(strip=True))

time.sleep(random.uniform(1, 2)) `

Infinite Scroll (JavaScript)

For infinite scroll pages, use Playwright to simulate scrolling:

`python # Scroll to bottom repeatedly until no new content loads previous_height = 0 while True: page.evaluate("window.scrollTo(0, document.body.scrollHeight)") page.wait_for_timeout(2000) # Wait for content to load current_height = page.evaluate("document.body.scrollHeight") if current_height == previous_height: break previous_height = current_height `

Storing Scraped Data

Once you have collected data, you need to store it. Common options:

CSV (Simple, Universal)

`python import csv

with open("products.csv", "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=["name", "price", "url"]) writer.writeheader() writer.writerows(products) `

JSON (Structured, API-Friendly)

`python import json

with open("products.json", "w", encoding="utf-8") as f: json.dump(products, f, indent=2, ensure_ascii=False) `

SQLite (Queryable, No Server Needed)

`python import sqlite3

conn = sqlite3.connect("products.db") cursor = conn.cursor() cursor.execute("CREATE TABLE IF NOT EXISTS products (name TEXT, price TEXT, url TEXT)") cursor.executemany("INSERT INTO products VALUES (?, ?, ?)", [(p["name"], p["price"], p["url"]) for p in products]) conn.commit() conn.close() `

Proxy Integration

For any serious scraping project, proxy integration is essential. Without proxies, you will get blocked after a few hundred requests on most commercial websites.

Integrate rotating proxies to distribute your requests across thousands of IPs and maintain success rates above 95%. Here is how to add proxies to each tool:

BeautifulSoup/requests: Pass the proxies parameter (see our Python proxy setup guide).

Scrapy: Add proxy middleware in settings.py: `python DOWNLOADER_MIDDLEWARES = { "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110, } HTTP_PROXY = "http://user:pass@gate.resproxy.io:7000" `

Playwright: Set proxy at launch: `python browser = p.chromium.launch(proxy={"server": "http://gate.resproxy.io:7000", "username": "user", "password": "pass"}) `

If you are new to proxies, start with our getting started guide to set up your account in minutes.

For comprehensive Python documentation, see the official Python requests library docs.

FAQ

Which Python library should I start with for web scraping?

Start with BeautifulSoup and the requests library. They are the easiest to learn and handle 70-80% of scraping tasks. Move to Scrapy when you need scale, and Playwright when you need JavaScript rendering.

Scraping publicly available data is generally legal in most jurisdictions, especially after the 2022 hiQ v. LinkedIn ruling in the US. However, always check the target site's Terms of Service and respect robots.txt directives. Avoid scraping personal data without consent.

How do I avoid getting blocked while scraping?

Use rotating residential proxies, set realistic User-Agent headers, add random delays between requests, and avoid scraping at aggressive rates. Our large-scale scraping tips cover this topic in depth.

How fast can I scrape with Python?

With BeautifulSoup (sequential): 1-5 pages/second. With Scrapy (concurrent): 50-200+ pages/second. With Playwright (browser rendering): 0.5-2 pages/second. Using proxies does not significantly reduce speed — the bottleneck is usually the target site's response time.

Hieu Nguyen

Founder & CEO

Founder of ResProxy and JC Media Agency. Over 5 years of experience in proxy infrastructure, digital advertising, and SaaS product development. Building premium proxy solutions for businesses worldwide.