Tutorials12 min read

How to Use Proxies with Scrapy — Middleware Configuration Guide

ResProxy Team
How to Use Proxies with Scrapy — Middleware Configuration Guide

Scrapy is Python's most powerful web scraping framework. Its middleware architecture makes it the ideal choice for integrating rotating proxies into large-scale crawling jobs. Unlike browser-based tools, Scrapy operates at the HTTP level, which means lower resource usage, higher throughput, and more precise control over request routing.

This guide walks through configuring proxy middleware in Scrapy from scratch. You will learn how to build a custom middleware, rotate IPs per request, handle authentication, and integrate with ResProxy's rotating endpoint for production-grade scraping.

Scrapy proxy middleware architecture
Scrapy proxy middleware architecture

Why Scrapy Needs Proxy Middleware

Scrapy can send thousands of concurrent requests, which is exactly why you need proxies. Without IP rotation, your scraper will trigger rate limits and bans within seconds on most commercial websites. Scrapy's downloader middleware system lets you inject a proxy into every outgoing request transparently, so your spiders do not need any proxy-related code.

By using rotating residential proxies, each request gets routed through a different real IP address, making your scraper look like thousands of different users browsing the site organically.

Prerequisites

You need Python 3.9 or later and Scrapy installed:

`bash pip install scrapy `

You also need proxy credentials from your provider. If you are using ResProxy, grab your username, password, and gateway address from the dashboard.

Method 1: Simple Proxy via Meta

The quickest way to add a proxy to a Scrapy request is through the meta parameter:

`python import scrapy

class SimpleProxySpider(scrapy.Spider): name = "simple_proxy" start_urls = ["https://httpbin.org/ip"]

def start_requests(self): for url in self.start_urls: yield scrapy.Request( url=url, callback=self.parse, meta={ "proxy": "http://username:password@gate.resproxy.io:7777" }, )

def parse(self, response): self.logger.info(f"Response from IP: {response.text}") `

This works for simple spiders, but it requires adding the meta parameter to every request. For real projects, a middleware is far more maintainable.

Custom Scrapy proxy middleware
Custom Scrapy proxy middleware

Method 2: Custom Proxy Middleware

A custom downloader middleware automatically injects proxy settings into every request. Create a file called middlewares.py in your Scrapy project:

`python import base64 import logging

logger = logging.getLogger(__name__)

class ResProxyMiddleware: """Downloader middleware that routes all requests through a rotating proxy."""

def __init__(self, proxy_url, proxy_user, proxy_pass): self.proxy_url = proxy_url self.proxy_auth = "Basic " + base64.b64encode( f"{proxy_user}:{proxy_pass}".encode() ).decode()

@classmethod def from_crawler(cls, crawler): return cls( proxy_url=crawler.settings.get("RESPROXY_URL", "http://gate.resproxy.io:7777"), proxy_user=crawler.settings.get("RESPROXY_USER", ""), proxy_pass=crawler.settings.get("RESPROXY_PASS", ""), )

def process_request(self, request, spider): request.meta["proxy"] = self.proxy_url request.headers["Proxy-Authorization"] = self.proxy_auth logger.debug(f"Proxying {request.url} through {self.proxy_url}")

def process_response(self, request, response, spider): if response.status == 407: logger.error("Proxy authentication failed — check credentials") return response

def process_exception(self, request, exception, spider): logger.warning(f"Proxy error for {request.url}: {exception}") return None # Let Scrapy retry the request `

Now enable the middleware in your settings.py:

`python # settings.py

DOWNLOADER_MIDDLEWARES = { "myproject.middlewares.ResProxyMiddleware": 350, }

# Proxy configuration RESPROXY_URL = "http://gate.resproxy.io:7777" RESPROXY_USER = "your_username" RESPROXY_PASS = "your_password"

# Recommended Scrapy settings for proxy usage CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_DELAY = 1 DOWNLOAD_TIMEOUT = 30 RETRY_TIMES = 3 RETRY_HTTP_CODES = [407, 429, 500, 502, 503]

# Respect robots.txt (optional) ROBOTSTXT_OBEY = False

# Rotate user agents USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" `

With this setup, every request from every spider in your project automatically goes through the rotating proxy. No changes needed in your spider code.

Method 3: Rotating Proxy Middleware with Failover

For production workloads, you want a middleware that can handle proxy failures and rotate through multiple endpoints:

`python import base64 import random import logging

logger = logging.getLogger(__name__)

class RotatingProxyMiddleware: """Advanced middleware with multiple proxy endpoints and failover."""

def __init__(self, proxy_list): self.proxy_list = proxy_list self.failed_proxies = set()

@classmethod def from_crawler(cls, crawler): proxy_list = crawler.settings.getlist("PROXY_LIST", [ { "url": "http://gate.resproxy.io:7777", "user": "your_username", "pass": "your_password", }, ]) return cls(proxy_list)

def process_request(self, request, spider): available = [ p for p in self.proxy_list if p["url"] not in self.failed_proxies ] if not available: self.failed_proxies.clear() available = self.proxy_list

proxy = random.choice(available) request.meta["proxy"] = proxy["url"] credentials = base64.b64encode( f'{proxy["user"]}:{proxy["pass"]}'.encode() ).decode() request.headers["Proxy-Authorization"] = f"Basic {credentials}"

def process_response(self, request, response, spider): if response.status in (407, 502, 503): proxy_url = request.meta.get("proxy", "") self.failed_proxies.add(proxy_url) logger.warning(f"Proxy {proxy_url} returned {response.status}, marking as failed") return request.replace(dont_filter=True) return response

def process_exception(self, request, exception, spider): proxy_url = request.meta.get("proxy", "") self.failed_proxies.add(proxy_url) logger.warning(f"Proxy {proxy_url} raised exception: {exception}") return request.replace(dont_filter=True) `

This middleware tracks failing proxies and avoids them on subsequent requests, then resets the failed list when all proxies have been tried.

Scrapy proxy settings optimization
Scrapy proxy settings optimization

Integrating with scrapy-rotating-proxies

If you prefer a battle-tested third-party solution, the scrapy-rotating-proxies package handles rotation, banning, and cooldown logic out of the box:

`bash pip install scrapy-rotating-proxies `

`python # settings.py

DOWNLOADER_MIDDLEWARES = { "rotating_proxies.middlewares.RotatingProxyMiddleware": 610, "rotating_proxies.middlewares.BanDetectionMiddleware": 620, }

ROTATING_PROXY_LIST = [ "http://user:pass@gate.resproxy.io:7777", "http://user:pass@gate.resproxy.io:7778", "http://user:pass@gate.resproxy.io:7779", ]

ROTATING_PROXY_PAGE_RETRY_TIMES = 5 ROTATING_PROXY_BACKOFF_BASE = 300 `

Handling Common Errors

Here are the most common proxy-related errors in Scrapy and how to handle them:

ErrorCauseSolution
407 Proxy Auth RequiredWrong credentialsCheck RESPROXY_USER and RESPROXY_PASS
429 Too Many RequestsRate limitedIncrease DOWNLOAD_DELAY
503 Service UnavailableProxy overloadedAdd retry logic or use multiple endpoints
TimeoutSlow proxy or targetIncrease DOWNLOAD_TIMEOUT
ConnectionRefusedProxy downUse failover middleware

Optimizing Performance

Scrapy's async architecture can push thousands of requests per minute through rotating proxies. Here are the key settings to tune:

`python # settings.py — optimized for proxy usage

# Concurrent connections CONCURRENT_REQUESTS = 32 CONCURRENT_REQUESTS_PER_DOMAIN = 16

# Delay between requests to the same domain DOWNLOAD_DELAY = 0.5 RANDOMIZE_DOWNLOAD_DELAY = True # Adds randomness: 0.5x to 1.5x

# Timeout and retries DOWNLOAD_TIMEOUT = 30 RETRY_TIMES = 3

# Auto-throttle (recommended with proxies) AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10 AUTOTHROTTLE_TARGET_CONCURRENCY = 8.0

# Enable HTTP caching to avoid re-fetching unchanged pages HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 3600 `

The AUTOTHROTTLE extension is particularly useful with proxies because it automatically adjusts request speed based on server response times.

Complete Spider Example

Here is a full spider that scrapes product data using the proxy middleware:

`python import scrapy

class ProductSpider(scrapy.Spider): name = "products" start_urls = ["https://example.com/products?page=1"]

custom_settings = { "CONCURRENT_REQUESTS": 16, "DOWNLOAD_DELAY": 1, }

def parse(self, response): for product in response.css("div.product-card"): yield { "name": product.css("h2::text").get(), "price": product.css("span.price::text").get(), "url": response.urljoin(product.css("a::attr(href)").get()), }

next_page = response.css("a.next-page::attr(href)").get() if next_page: yield response.follow(next_page, self.parse) `

Notice how the spider has zero proxy-related code. The middleware handles everything transparently.

Best Practices

  1. Use middleware over meta — Keep proxy logic separate from spider logic
  2. Enable AutoThrottle — Let Scrapy adapt to target site speed automatically
  3. Set reasonable concurrency — 16-32 concurrent requests is a good starting point
  4. Monitor ban rates — If more than 10 percent of requests fail, reduce concurrency or increase delays
  5. Use residential proxies for protected sites — Rotating residential proxies have the highest success rates
  6. Cache responses — Avoid re-fetching pages that have not changed

Getting Started

Create your ResProxy account through the getting started guide, grab your credentials, and paste them into your Scrapy settings. For the full Scrapy documentation including middleware architecture details, visit docs.scrapy.org.

You might also want to check our Python web scraping guide for a broader overview of scraping tools and techniques.

ResProxy Team

Editorial Team

The ResProxy editorial team combines expertise in proxy technology, web scraping, network infrastructure, and online privacy to deliver actionable guides and industry insights.