DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Settings Deep Dive: The Complete Guide (What Actually Matters)

When I first opened Scrapy's settings.py, I panicked. There were like 200 settings. I had no idea which ones mattered and which ones I could ignore.

I spent days tweaking random settings, hoping they'd make my spider faster or more reliable. Most of them did nothing. Some made things worse.

After scraping hundreds of websites, I finally figured out which settings actually matter and which are just noise. Let me save you the trial and error.


The Big Picture: Three Levels of Settings

Scrapy has three levels of settings, and understanding this is crucial:

Level 1: Default Settings (in Scrapy's code)

  • The baseline for everything
  • You never edit these directly
  • Located in Scrapy's source code

Level 2: Project Settings (your settings.py)

  • Apply to ALL spiders in your project
  • Most common place to configure things
  • Located in your project's settings.py

Level 3: Spider Settings (inside spider class)

  • Apply only to that specific spider
  • Override project settings
  • Set via custom_settings attribute

Priority: Spider settings > Project settings > Default settings

# Project settings (settings.py)
DOWNLOAD_DELAY = 2

# Spider settings (overrides project settings)
class FastSpider(scrapy.Spider):
    name = 'fast'
    custom_settings = {
        'DOWNLOAD_DELAY': 0.5  # This spider goes faster
    }
Enter fullscreen mode Exit fullscreen mode

The 15 Settings That Actually Matter

Let me cut through the noise. Here are the settings you'll actually use.

1. BOT_NAME (Identity)

What it does: Identifies your bot in User-Agent and logs.

Default: Your project name

When to change: Always. Make it descriptive.

BOT_NAME = 'my_company_scraper'
Enter fullscreen mode Exit fullscreen mode

Why it matters: Helps websites identify your bot. Be honest about who you are.


2. USER_AGENT (How You Identify Yourself)

What it does: The User-Agent string sent with every request.

Default: Scrapy/VERSION (+https://scrapy.org)

When to change: Always. The default screams "I'm a bot!"

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Many websites block the default Scrapy User-Agent immediately
  • Use a real browser's User-Agent
  • Rotate User-Agents for serious scraping (use a middleware)
# Better: Rotate User-Agents
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
# Use a middleware to rotate these
Enter fullscreen mode Exit fullscreen mode

3. ROBOTSTXT_OBEY (Respect robots.txt)

What it does: Whether to respect robots.txt rules.

Default: True (respects robots.txt)

When to change: Only if you have permission to ignore it.

ROBOTSTXT_OBEY = True  # Be respectful
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Setting this to False doesn't make you invisible
  • Websites can still block you by IP or behavior
  • Respecting robots.txt is just being a good citizen

4. CONCURRENT_REQUESTS (How Many at Once)

What it does: Maximum number of concurrent requests.

Default: 16

When to change: When you need to go faster or slower.

# Go slower (more polite)
CONCURRENT_REQUESTS = 4

# Go faster (more aggressive)
CONCURRENT_REQUESTS = 32
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Higher isn't always faster (websites throttle you)
  • Start at 8-16 for most sites
  • Increase slowly and monitor for blocks
  • Large sites can handle more; small sites can't

5. CONCURRENT_REQUESTS_PER_DOMAIN (Per-Site Limit)

What it does: Maximum concurrent requests to a single domain.

Default: 8

When to change: When scraping multiple domains or being extra polite.

# Be more polite
CONCURRENT_REQUESTS_PER_DOMAIN = 2

# More aggressive (for large sites)
CONCURRENT_REQUESTS_PER_DOMAIN = 16
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • This matters MORE than CONCURRENT_REQUESTS
  • Set lower for small websites
  • Large e-commerce sites can handle 16+
  • News sites often need 2-4

6. DOWNLOAD_DELAY (Time Between Requests)

What it does: Minimum time (in seconds) between requests to same domain.

Default: 0 (no delay)

When to change: Always. Zero delay is aggressive.

# Polite scraping
DOWNLOAD_DELAY = 2

# Aggressive (use with caution)
DOWNLOAD_DELAY = 0.5

# Very polite (for fragile sites)
DOWNLOAD_DELAY = 5
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • This is your #1 defense against getting blocked
  • Random delays are better than fixed delays
  • Use AutoThrottle instead for smart delays
  • Most sites are happy with 1-3 seconds

Better approach:

# Random delay between 1-3 seconds
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Adds ±50% randomness
Enter fullscreen mode Exit fullscreen mode

7. AUTOTHROTTLE (Smart Speed Control)

What it does: Automatically adjusts speed based on site's response time.

Default: Disabled

When to change: Enable for production scrapers.

# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • AutoThrottle is better than fixed delays
  • It automatically slows down if site is struggling
  • Speeds up if site is fast
  • TARGET_CONCURRENCY is how many requests you want in-flight
  • Start with 1.0-2.0 for TARGET_CONCURRENCY

8. COOKIES_ENABLED (Cookie Handling)

What it does: Whether to handle cookies automatically.

Default: True

When to change: Rarely. Usually you want this enabled.

COOKIES_ENABLED = True  # Let Scrapy handle cookies
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Scrapy handles cookies automatically per spider
  • Each spider has its own cookie jar
  • Disable only if cookies cause problems (rare)

9. HTTPCACHE_ENABLED (Speed Up Development)

What it does: Caches responses to avoid re-downloading during development.

Default: False

When to change: Enable during development, disable in production.

# Development
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

# Production
HTTPCACHE_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • This is a lifesaver during development
  • Run once to populate cache, then test infinitely
  • Remember to clear cache when site structure changes
  • Don't deploy with cache enabled!

10. LOG_LEVEL (How Verbose)

What it does: Controls what gets logged.

Default: 'DEBUG'

When to change: Set to INFO for production.

# Development (see everything)
LOG_LEVEL = 'DEBUG'

# Production (only important stuff)
LOG_LEVEL = 'INFO'

# Only problems
LOG_LEVEL = 'WARNING'
Enter fullscreen mode Exit fullscreen mode

Levels: DEBUG > INFO > WARNING > ERROR > CRITICAL


11. ITEM_PIPELINES (Data Processing)

What it does: Enables and orders your pipelines.

Default: {} (no pipelines)

When to change: When you have pipelines.

ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 100,
    'myproject.pipelines.CleaningPipeline': 200,
    'myproject.pipelines.DatabasePipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Lower numbers run first
  • Use multiples of 100 (leaves room to insert pipelines later)
  • Order matters! Clean before saving
  • Set to None to disable a pipeline

12. RETRY Settings (Handling Failures)

What it does: Controls automatic retries of failed requests.

Defaults:

  • RETRY_ENABLED = True
  • RETRY_TIMES = 2
  • RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

When to change: Increase retries for unreliable sites.

# More aggressive retries
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 403]
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • 429 (rate limit) already retries by default
  • Add 403 if you're getting temporarily blocked
  • Don't retry 404s (page truly doesn't exist)
  • Retries happen with exponential backoff

13. FEED_EXPORT_ENCODING (Output Encoding)

What it does: Character encoding for exported files.

Default: 'utf-8'

When to change: When dealing with non-English text.

FEED_EXPORT_ENCODING = 'utf-8'  # Handles all languages
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • utf-8 handles 99% of use cases
  • Windows might need 'utf-8-sig' for Excel compatibility
  • Never use 'ascii' unless you only have English

14. DNS_TIMEOUT (Resolve Timeouts)

What it does: Timeout for DNS lookups.

Default: 60 seconds

When to change: Lower for faster failure detection.

DNS_TIMEOUT = 10  # Fail fast on DNS issues
Enter fullscreen mode Exit fullscreen mode

15. DOWNLOAD_TIMEOUT (Page Load Timeout)

What it does: Timeout for downloading pages.

Default: 180 seconds

When to change: Lower for faster scraping.

DOWNLOAD_TIMEOUT = 30  # 30 seconds max per page
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • 180 seconds is way too long
  • 15-30 seconds is reasonable for most sites
  • If pages take longer, the site might be blocking you

Command-Line Overrides

You can override settings from the command line:

# Change log level
scrapy crawl myspider -s LOG_LEVEL=WARNING

# Change download delay
scrapy crawl myspider -s DOWNLOAD_DELAY=5

# Multiple settings
scrapy crawl myspider -s LOG_LEVEL=INFO -s DOWNLOAD_DELAY=2
Enter fullscreen mode Exit fullscreen mode

Spider-Level Settings (Per-Spider Configuration)

Override settings for specific spiders:

class FastSpider(scrapy.Spider):
    name = 'fast'

    custom_settings = {
        'DOWNLOAD_DELAY': 0.5,
        'CONCURRENT_REQUESTS': 32,
        'LOG_LEVEL': 'WARNING'
    }
Enter fullscreen mode Exit fullscreen mode
class SlowSpider(scrapy.Spider):
    name = 'slow'

    custom_settings = {
        'DOWNLOAD_DELAY': 5,
        'CONCURRENT_REQUESTS': 2,
        'AUTOTHROTTLE_ENABLED': True
    }
Enter fullscreen mode Exit fullscreen mode

Settings That DON'T Matter (Usually)

These settings exist but you'll rarely touch them:

  • SPIDER_LOADER_CLASS (only for advanced customization)
  • STATS_CLASS (default is fine)
  • TELNETCONSOLE_ENABLED (debugging feature)
  • MEMDEBUG_ENABLED (only for memory debugging)
  • DEPTH_PRIORITY (default works)

My Recommended Settings Template

Here's my starting point for most projects:

# settings.py

# Identity
BOT_NAME = 'my_scraper'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Politeness
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Or use AutoThrottle instead
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Cookies
COOKIES_ENABLED = True

# Retries
RETRY_ENABLED = True
RETRY_TIMES = 3

# Timeouts
DNS_TIMEOUT = 10
DOWNLOAD_TIMEOUT = 30

# Development only
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0

# Logging
LOG_LEVEL = 'INFO'

# Pipelines (add yours)
ITEM_PIPELINES = {
    # 'myproject.pipelines.MyPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Settings for Different Scenarios

Scenario 1: Fast Scraping (Big Sites)

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5
AUTOTHROTTLE_ENABLED = False  # We're going fast
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Polite Scraping (Small Sites)

CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Development/Testing

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
LOG_LEVEL = 'DEBUG'
CONCURRENT_REQUESTS = 1  # Easier to debug
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Production

HTTPCACHE_ENABLED = False
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 5
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Setting Everything

# DON'T do this
COOKIES_DEBUG = True
SPIDER_MIDDLEWARES_BASE = {...}
DOWNLOADER_MIDDLEWARES_BASE = {...}
# ... 50 more settings
Enter fullscreen mode Exit fullscreen mode

Most settings have good defaults. Only change what you need.

Mistake #2: No Delays

# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 64
Enter fullscreen mode Exit fullscreen mode

This is asking to get blocked.

Mistake #3: Cache in Production

# BAD (stale data in production)
HTTPCACHE_ENABLED = True
Enter fullscreen mode Exit fullscreen mode

Cache is for development only!


Quick Reference

Speed (from slow to fast)

# Very slow/polite
DOWNLOAD_DELAY = 5
CONCURRENT_REQUESTS = 2

# Normal
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 8

# Fast
DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 32
Enter fullscreen mode Exit fullscreen mode

Essential Settings Checklist

  • [ ] Set USER_AGENT to real browser
  • [ ] Set DOWNLOAD_DELAY or enable AUTOTHROTTLE
  • [ ] Set CONCURRENT_REQUESTS appropriately
  • [ ] Enable HTTPCACHE for development
  • [ ] Disable HTTPCACHE for production
  • [ ] Set LOG_LEVEL to INFO for production
  • [ ] Configure ITEM_PIPELINES if you have any

Summary

The 5 settings you'll change most:

  1. USER_AGENT (always)
  2. DOWNLOAD_DELAY or AUTOTHROTTLE (always)
  3. CONCURRENT_REQUESTS (based on site)
  4. HTTPCACHE_ENABLED (dev vs prod)
  5. LOG_LEVEL (dev vs prod)

Start simple:

  • Use my recommended template
  • Adjust delays based on site response
  • Add settings as you need them
  • Don't prematurely optimize

Remember:

  • Most settings have good defaults
  • Slower scraping = more reliable scraping
  • Test before deploying
  • Monitor and adjust

Happy scraping! 🕷️

Top comments (0)