Data Scraping13 min read

10 Tips for Large-Scale Data Scraping Without Getting Blocked

Hieu Nguyen
10 Tips for Large-Scale Data Scraping Without Getting Blocked

Scaling web scraping from hundreds to millions of requests requires careful planning and the right infrastructure. A scraper that works perfectly on 100 pages can break catastrophically at 100,000 pages — getting blocked, running out of memory, losing data, or racking up unexpected costs. In this guide, we cover 10 proven techniques for building scraping pipelines that scale reliably, maintain high success rates, and avoid the common pitfalls that derail large operations.

Large-scale data scraping infrastructure
Large-scale data scraping infrastructure

Tip 1: Rotate IPs Intelligently

IP rotation is the foundation of large-scale scraping. But "rotate IPs" does not mean "use a random IP for every request." Smart rotation means matching your rotation strategy to the target site's detection patterns.

Per-request rotation works for sites that track IPs on a per-request basis (most e-commerce sites, search engines). Each request comes from a different IP, making it impossible for the target to build a behavioral profile.

Time-based rotation (sticky sessions) works for multi-step workflows where you need to maintain session state — logging in, navigating paginated results, or completing checkout flows.

Failure-triggered rotation means switching IPs only when you detect a block (403, 429, CAPTCHA). This conserves your IP pool by keeping working IPs in use longer.

`python import requests import random

class SmartRotator: def __init__(self, proxy_base, max_failures=3): self.proxy_base = proxy_base self.max_failures = max_failures self.failures = 0 self.session_id = random.randint(100000, 999999)

def get_proxy(self): if self.failures >= self.max_failures: self.session_id = random.randint(100000, 999999) self.failures = 0 return { "http": f"http://user-session-{self.session_id}:pass@{self.proxy_base}", "https": f"http://user-session-{self.session_id}:pass@{self.proxy_base}", }

def fetch(self, url): resp = requests.get(url, proxies=self.get_proxy(), timeout=30) if resp.status_code in (403, 429, 503): self.failures += 1 else: self.failures = 0 return resp `

Tip 2: Respect Rate Limits with Adaptive Delays

Hammering a target site with rapid-fire requests is the fastest way to get blocked. Implement adaptive delays that respond to the target site's signals:

`python import time import random

def adaptive_delay(response, base_delay=1.0): if response.status_code == 429: # Rate limited — back off significantly retry_after = int(response.headers.get("Retry-After", 30)) time.sleep(retry_after) elif response.status_code == 200: # Success — use normal delay with jitter time.sleep(base_delay + random.uniform(0.5, 1.5)) else: # Error — moderate backoff time.sleep(base_delay * 2) `

Key principle: Random jitter makes your request pattern look human. A constant 1-second delay between every request is a fingerprint in itself.

Tip 3: Use Residential Proxies for Protected Sites

Datacenter proxies work fine for unprotected sites, but any target with anti-bot protection (Cloudflare, Akamai, PerimeterX) requires residential proxies. Residential IPs come from real ISPs and pass IP reputation checks that datacenter IPs cannot.

For large-scale operations, per-day pricing models save dramatically over per-GB models. At 100+ GB/month, the difference can be 90% or more. ResProxy's plans start at $0.24/day with unlimited bandwidth — check rotating proxy pricing for details.

Tip 4: Manage Browser Fingerprints

At scale, your browser fingerprint becomes as important as your IP. Anti-bot systems analyze:

  • User-Agent string — Rotate through realistic, current browser User-Agents
  • Accept-Language and Accept-Encoding headers — Match these to your proxy's geographic location
  • TLS fingerprint — Different HTTP libraries create different TLS handshake signatures. curl looks different from Python requests, which looks different from a real Chrome browser
  • JavaScript execution patterns — If using a headless browser, stealth plugins are essential

`python import random

USER_AGENTS = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36", ]

def get_headers(): return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", } `

Tip 5: Handle CAPTCHAs Gracefully

At scale, some CAPTCHAs are inevitable. The best strategy is a tiered approach:

  1. Prevention first — Residential proxies + realistic fingerprints + respectful rate limits prevent 95% of CAPTCHAs
  2. Detection — Monitor for CAPTCHA indicators in responses (check for known CAPTCHA provider scripts, specific HTTP status codes, or response size anomalies)
  3. Rotation on CAPTCHA — When you hit a CAPTCHA, rotate your IP immediately. Often, the same URL works fine from a different IP
  4. CAPTCHA solving as last resort — Services like 2Captcha or Anti-Captcha can solve CAPTCHAs programmatically, but they add cost and latency

Tip 6: Monitor Success Rates in Real Time

You cannot optimize what you do not measure. Build a monitoring dashboard that tracks:

  • Success rate — Percentage of 200 responses vs total requests. Target: above 95%
  • Block rate — Percentage of 403/429/503 responses. If this exceeds 10%, something is wrong
  • Average response time — Increasing latency often precedes blocks
  • Data quality — Are you getting the actual page content, or empty/redirect pages?

`python from collections import defaultdict

class ScrapeMetrics: def __init__(self): self.status_counts = defaultdict(int) self.total_requests = 0

def record(self, status_code): self.status_counts[status_code] += 1 self.total_requests += 1

def success_rate(self): if self.total_requests == 0: return 0 return self.status_counts[200] / self.total_requests * 100

def report(self): print(f"Total requests: {self.total_requests}") print(f"Success rate: {self.success_rate():.1f}%") for code, count in sorted(self.status_counts.items()): print(f" HTTP {code}: {count}") `

Tip 7: Use Sticky Sessions Wisely

Sticky sessions (keeping the same IP for multiple requests) are essential for:

  • Paginated results — Crawling page 1, 2, 3 of search results with the same IP looks natural
  • Login-required content — Maintaining an authenticated session
  • Shopping cart workflows — Add-to-cart and checkout sequences

But do not overuse sticky sessions. Sending 500 requests from the same IP in 10 minutes will get that IP burned. Set a session limit (e.g., 20-50 requests per session) and rotate after.

Tip 8: Implement Retry Logic with Exponential Backoff

Not every failure means you are blocked. Network hiccups, server overload, and temporary rate limits are common. A proper retry strategy handles these gracefully:

`python import time

def fetch_with_backoff(url, session, max_retries=5): for attempt in range(max_retries): try: resp = session.get(url, timeout=30) if resp.status_code == 200: return resp if resp.status_code in (403, 429): wait = min(2 ** attempt + random.uniform(0, 1), 60) print(f"Blocked (HTTP {resp.status_code}). Retrying in {wait:.1f}s...") time.sleep(wait) # Rotate IP for next attempt session.proxies = get_new_proxy() elif resp.status_code >= 500: time.sleep(2 ** attempt) except requests.exceptions.Timeout: time.sleep(2) except requests.exceptions.ConnectionError: time.sleep(5) return None # All retries exhausted `

Key points: Rotate IP on 403/429 errors. Cap the maximum wait time (60 seconds is reasonable). Return None rather than raising an exception so the caller can decide what to do.

Tip 9: Distribute Across Geos

Many websites serve different content and apply different rate limits based on visitor location. Distributing your requests across multiple geographic regions:

  • Spreads the load across different anti-bot infrastructure
  • Reduces the request rate from any single region
  • Gives you access to geo-specific content and pricing
  • Makes your traffic pattern look like organic global traffic

Use your proxy provider's geo-targeting to assign specific countries or cities to different parts of your crawl.

Tip 10: Cache Responses and Deduplicate

At scale, you will inevitably request the same URL multiple times — pagination loops, retry logic, and overlapping crawl batches all cause duplicates. Implement caching to avoid waste:

`python import hashlib import json import os

class ResponseCache: def __init__(self, cache_dir="./cache"): os.makedirs(cache_dir, exist_ok=True) self.cache_dir = cache_dir

def _key(self, url): return hashlib.md5(url.encode()).hexdigest()

def get(self, url): path = os.path.join(self.cache_dir, self._key(url)) if os.path.exists(path): with open(path, "r") as f: return json.load(f) return None

def set(self, url, data): path = os.path.join(self.cache_dir, self._key(url)) with open(path, "w") as f: json.dump(data, f) `

Caching reduces proxy bandwidth usage, speeds up re-runs, and lets you separate the "fetch" phase from the "parse" phase of your pipeline.

Infrastructure Advice for Scale

When you move beyond a single script on your laptop, consider these architectural decisions:

  • Use a job queue (Redis Queue, Celery, or even a simple database table) to manage URLs to scrape. This lets you distribute work across multiple machines and resume after failures.
  • Separate fetching from parsing — Fetch raw HTML and store it. Parse later in a separate step. This lets you re-parse without re-fetching if your parser has bugs.
  • Run scrapers in containers — Docker containers make it easy to scale horizontally and ensure consistent environments.
  • Schedule with cron or Airflow — For recurring scraping jobs (daily price monitoring, weekly competitor analysis), use a scheduler rather than manual execution.

Error Handling Patterns for Production

`python import logging

logging.basicConfig(level=logging.INFO) logger = logging.getLogger("scraper")

def scrape_url(url, session, metrics): try: resp = fetch_with_backoff(url, session) if resp is None: logger.error(f"All retries failed: {url}") metrics.record("FAILED") return None metrics.record(resp.status_code) if resp.status_code == 200: return resp.text logger.warning(f"Unexpected status {resp.status_code}: {url}") return None except Exception as e: logger.exception(f"Unhandled error for {url}: {e}") metrics.record("EXCEPTION") return None `

Never let exceptions crash your scraper silently. Log every error with the URL that caused it so you can debug and retry later.

For proxy setup and integration, check our Python rotating proxy tutorial and rotating proxy plans starting at $0.24/day. New to scraping? Start with our Python web scraping guide for beginners.

For the technical fundamentals of efficient web requests and performance monitoring, see Google's web.dev performance guide.

FAQ

How many requests per second can I make without getting blocked?

There is no universal number — it depends entirely on the target site. Conservative starting points: 1 request/second for heavily protected sites (Amazon, Google), 5-10 requests/second for moderately protected sites, and 20+ requests/second for unprotected sites. Monitor your success rate and adjust.

Do I need a dedicated server for large-scale scraping?

For anything above 100,000 requests/day, yes. A VPS or cloud instance with 2-4 CPU cores and 4-8 GB RAM is sufficient for most Scrapy-based crawlers. For browser-based scraping (Playwright/Puppeteer), you need more resources — at least 4 GB RAM per concurrent browser instance.

How do I handle websites that change their HTML structure?

Build defensive parsers that handle missing elements gracefully. Use CSS selectors rather than XPath when possible (they are more resilient to minor HTML changes). Monitor for parsing failures and alert when the rate exceeds a threshold — this usually indicates a site redesign.

What is the most common mistake in large-scale scraping?

Not implementing proper error handling and monitoring. Teams often build scrapers that work in testing but fail silently in production — returning empty data, hitting rate limits, or getting blocked without anyone noticing. Invest in logging and metrics from day one.

Hieu Nguyen

Founder & CEO

Founder of ResProxy and JC Media Agency. Over 5 years of experience in proxy infrastructure, digital advertising, and SaaS product development. Building premium proxy solutions for businesses worldwide.