Muhammad Ikramullah Khan

Posted on Dec 30

Scrapy Settings Deep Dive: The Complete Guide (What Actually Matters)

#webdev #beginners #programming #python

When I first opened Scrapy's settings.py, I panicked. There were like 200 settings. I had no idea which ones mattered and which ones I could ignore.

I spent days tweaking random settings, hoping they'd make my spider faster or more reliable. Most of them did nothing. Some made things worse.

After scraping hundreds of websites, I finally figured out which settings actually matter and which are just noise. Let me save you the trial and error.

The Big Picture: Three Levels of Settings

Scrapy has three levels of settings, and understanding this is crucial:

Level 1: Default Settings (in Scrapy's code)

The baseline for everything
You never edit these directly
Located in Scrapy's source code

Level 2: Project Settings (your settings.py)

Apply to ALL spiders in your project
Most common place to configure things
Located in your project's settings.py

Level 3: Spider Settings (inside spider class)

Apply only to that specific spider
Override project settings
Set via custom_settings attribute

Priority: Spider settings > Project settings > Default settings

# Project settings (settings.py)
DOWNLOAD_DELAY = 2

# Spider settings (overrides project settings)
class FastSpider(scrapy.Spider):
    name = 'fast'
    custom_settings = {
        'DOWNLOAD_DELAY': 0.5  # This spider goes faster
    }

The 15 Settings That Actually Matter

Let me cut through the noise. Here are the settings you'll actually use.

1. BOT_NAME (Identity)

What it does: Identifies your bot in User-Agent and logs.

Default: Your project name

When to change: Always. Make it descriptive.

BOT_NAME = 'my_company_scraper'

Why it matters: Helps websites identify your bot. Be honest about who you are.

2. USER_AGENT (How You Identify Yourself)

What it does: The User-Agent string sent with every request.

Default: Scrapy/VERSION (+https://scrapy.org)

When to change: Always. The default screams "I'm a bot!"

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

What the docs don't tell you:

Many websites block the default Scrapy User-Agent immediately
Use a real browser's User-Agent
Rotate User-Agents for serious scraping (use a middleware)

# Better: Rotate User-Agents
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
# Use a middleware to rotate these

3. ROBOTSTXT_OBEY (Respect robots.txt)

What it does: Whether to respect robots.txt rules.

Default: True (respects robots.txt)

When to change: Only if you have permission to ignore it.

ROBOTSTXT_OBEY = True  # Be respectful

What the docs don't tell you:

Setting this to False doesn't make you invisible
Websites can still block you by IP or behavior
Respecting robots.txt is just being a good citizen

4. CONCURRENT_REQUESTS (How Many at Once)

What it does: Maximum number of concurrent requests.

Default: 16

When to change: When you need to go faster or slower.

# Go slower (more polite)
CONCURRENT_REQUESTS = 4

# Go faster (more aggressive)
CONCURRENT_REQUESTS = 32

What the docs don't tell you:

Higher isn't always faster (websites throttle you)
Start at 8-16 for most sites
Increase slowly and monitor for blocks
Large sites can handle more; small sites can't

5. CONCURRENT_REQUESTS_PER_DOMAIN (Per-Site Limit)

What it does: Maximum concurrent requests to a single domain.

Default: 8

When to change: When scraping multiple domains or being extra polite.

# Be more polite
CONCURRENT_REQUESTS_PER_DOMAIN = 2

# More aggressive (for large sites)
CONCURRENT_REQUESTS_PER_DOMAIN = 16

What the docs don't tell you:

This matters MORE than CONCURRENT_REQUESTS
Set lower for small websites
Large e-commerce sites can handle 16+
News sites often need 2-4

6. DOWNLOAD_DELAY (Time Between Requests)

What it does: Minimum time (in seconds) between requests to same domain.

Default: 0 (no delay)

When to change: Always. Zero delay is aggressive.

# Polite scraping
DOWNLOAD_DELAY = 2

# Aggressive (use with caution)
DOWNLOAD_DELAY = 0.5

# Very polite (for fragile sites)
DOWNLOAD_DELAY = 5

What the docs don't tell you:

This is your #1 defense against getting blocked
Random delays are better than fixed delays
Use AutoThrottle instead for smart delays
Most sites are happy with 1-3 seconds

Better approach:

# Random delay between 1-3 seconds
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Adds ±50% randomness

7. AUTOTHROTTLE (Smart Speed Control)

What it does: Automatically adjusts speed based on site's response time.

Default: Disabled

When to change: Enable for production scrapers.

# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

What the docs don't tell you:

AutoThrottle is better than fixed delays
It automatically slows down if site is struggling
Speeds up if site is fast
TARGET_CONCURRENCY is how many requests you want in-flight
Start with 1.0-2.0 for TARGET_CONCURRENCY

8. COOKIES_ENABLED (Cookie Handling)

What it does: Whether to handle cookies automatically.

Default: True

When to change: Rarely. Usually you want this enabled.

COOKIES_ENABLED = True  # Let Scrapy handle cookies

What the docs don't tell you:

Scrapy handles cookies automatically per spider
Each spider has its own cookie jar
Disable only if cookies cause problems (rare)

9. HTTPCACHE_ENABLED (Speed Up Development)

What it does: Caches responses to avoid re-downloading during development.

Default: False

When to change: Enable during development, disable in production.

# Development
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

# Production
HTTPCACHE_ENABLED = False

What the docs don't tell you:

This is a lifesaver during development
Run once to populate cache, then test infinitely
Remember to clear cache when site structure changes
Don't deploy with cache enabled!

10. LOG_LEVEL (How Verbose)

What it does: Controls what gets logged.

Default: 'DEBUG'

When to change: Set to INFO for production.

# Development (see everything)
LOG_LEVEL = 'DEBUG'

# Production (only important stuff)
LOG_LEVEL = 'INFO'

# Only problems
LOG_LEVEL = 'WARNING'

Levels: DEBUG > INFO > WARNING > ERROR > CRITICAL

11. ITEM_PIPELINES (Data Processing)

What it does: Enables and orders your pipelines.

Default: {} (no pipelines)

When to change: When you have pipelines.

ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 100,
    'myproject.pipelines.CleaningPipeline': 200,
    'myproject.pipelines.DatabasePipeline': 300,
}

What the docs don't tell you:

Lower numbers run first
Use multiples of 100 (leaves room to insert pipelines later)
Order matters! Clean before saving
Set to None to disable a pipeline

12. RETRY Settings (Handling Failures)

What it does: Controls automatic retries of failed requests.

Defaults:

RETRY_ENABLED = True
RETRY_TIMES = 2
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

When to change: Increase retries for unreliable sites.

# More aggressive retries
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 403]

What the docs don't tell you:

429 (rate limit) already retries by default
Add 403 if you're getting temporarily blocked
Don't retry 404s (page truly doesn't exist)
Retries happen with exponential backoff

13. FEED_EXPORT_ENCODING (Output Encoding)

What it does: Character encoding for exported files.

Default: 'utf-8'

When to change: When dealing with non-English text.

FEED_EXPORT_ENCODING = 'utf-8'  # Handles all languages

What the docs don't tell you:

utf-8 handles 99% of use cases
Windows might need 'utf-8-sig' for Excel compatibility
Never use 'ascii' unless you only have English

14. DNS_TIMEOUT (Resolve Timeouts)

What it does: Timeout for DNS lookups.

Default: 60 seconds

When to change: Lower for faster failure detection.

DNS_TIMEOUT = 10  # Fail fast on DNS issues

15. DOWNLOAD_TIMEOUT (Page Load Timeout)

What it does: Timeout for downloading pages.

Default: 180 seconds

When to change: Lower for faster scraping.

DOWNLOAD_TIMEOUT = 30  # 30 seconds max per page

What the docs don't tell you:

180 seconds is way too long
15-30 seconds is reasonable for most sites
If pages take longer, the site might be blocking you

Command-Line Overrides

You can override settings from the command line:

# Change log level
scrapy crawl myspider -s LOG_LEVEL=WARNING

# Change download delay
scrapy crawl myspider -s DOWNLOAD_DELAY=5

# Multiple settings
scrapy crawl myspider -s LOG_LEVEL=INFO -s DOWNLOAD_DELAY=2

Spider-Level Settings (Per-Spider Configuration)

Override settings for specific spiders:

class FastSpider(scrapy.Spider):
    name = 'fast'

    custom_settings = {
        'DOWNLOAD_DELAY': 0.5,
        'CONCURRENT_REQUESTS': 32,
        'LOG_LEVEL': 'WARNING'
    }

class SlowSpider(scrapy.Spider):
    name = 'slow'

    custom_settings = {
        'DOWNLOAD_DELAY': 5,
        'CONCURRENT_REQUESTS': 2,
        'AUTOTHROTTLE_ENABLED': True
    }

Settings That DON'T Matter (Usually)

These settings exist but you'll rarely touch them:

SPIDER_LOADER_CLASS (only for advanced customization)
STATS_CLASS (default is fine)
TELNETCONSOLE_ENABLED (debugging feature)
MEMDEBUG_ENABLED (only for memory debugging)
DEPTH_PRIORITY (default works)

My Recommended Settings Template

Here's my starting point for most projects:

# settings.py

# Identity
BOT_NAME = 'my_scraper'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Politeness
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Or use AutoThrottle instead
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Cookies
COOKIES_ENABLED = True

# Retries
RETRY_ENABLED = True
RETRY_TIMES = 3

# Timeouts
DNS_TIMEOUT = 10
DOWNLOAD_TIMEOUT = 30

# Development only
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0

# Logging
LOG_LEVEL = 'INFO'

# Pipelines (add yours)
ITEM_PIPELINES = {
    # 'myproject.pipelines.MyPipeline': 300,
}

Settings for Different Scenarios

Scenario 1: Fast Scraping (Big Sites)

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5
AUTOTHROTTLE_ENABLED = False  # We're going fast

Scenario 2: Polite Scraping (Small Sites)

CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True

Scenario 3: Development/Testing

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
LOG_LEVEL = 'DEBUG'
CONCURRENT_REQUESTS = 1  # Easier to debug

Scenario 4: Production

HTTPCACHE_ENABLED = False
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 5

Common Mistakes

Mistake #1: Setting Everything

# DON'T do this
COOKIES_DEBUG = True
SPIDER_MIDDLEWARES_BASE = {...}
DOWNLOADER_MIDDLEWARES_BASE = {...}
# ... 50 more settings

Most settings have good defaults. Only change what you need.

Mistake #2: No Delays

# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 64

This is asking to get blocked.

Mistake #3: Cache in Production

# BAD (stale data in production)
HTTPCACHE_ENABLED = True

Cache is for development only!

Quick Reference

Speed (from slow to fast)

# Very slow/polite
DOWNLOAD_DELAY = 5
CONCURRENT_REQUESTS = 2

# Normal
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 8

# Fast
DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 32

Essential Settings Checklist

[ ] Set USER_AGENT to real browser
[ ] Set DOWNLOAD_DELAY or enable AUTOTHROTTLE
[ ] Set CONCURRENT_REQUESTS appropriately
[ ] Enable HTTPCACHE for development
[ ] Disable HTTPCACHE for production
[ ] Set LOG_LEVEL to INFO for production
[ ] Configure ITEM_PIPELINES if you have any

Summary

The 5 settings you'll change most:

USER_AGENT (always)
DOWNLOAD_DELAY or AUTOTHROTTLE (always)
CONCURRENT_REQUESTS (based on site)
HTTPCACHE_ENABLED (dev vs prod)
LOG_LEVEL (dev vs prod)

Start simple:

Use my recommended template
Adjust delays based on site response
Add settings as you need them
Don't prematurely optimize

Remember:

Most settings have good defaults
Slower scraping = more reliable scraping
Test before deploying
Monitor and adjust

Happy scraping! 🕷️

DEV Community

Scrapy Settings Deep Dive: The Complete Guide (What Actually Matters)

The Big Picture: Three Levels of Settings

The 15 Settings That Actually Matter

1. BOT_NAME (Identity)

2. USER_AGENT (How You Identify Yourself)

3. ROBOTSTXT_OBEY (Respect robots.txt)

4. CONCURRENT_REQUESTS (How Many at Once)

5. CONCURRENT_REQUESTS_PER_DOMAIN (Per-Site Limit)

6. DOWNLOAD_DELAY (Time Between Requests)

7. AUTOTHROTTLE (Smart Speed Control)

8. COOKIES_ENABLED (Cookie Handling)

9. HTTPCACHE_ENABLED (Speed Up Development)

10. LOG_LEVEL (How Verbose)

11. ITEM_PIPELINES (Data Processing)

12. RETRY Settings (Handling Failures)

13. FEED_EXPORT_ENCODING (Output Encoding)

14. DNS_TIMEOUT (Resolve Timeouts)

15. DOWNLOAD_TIMEOUT (Page Load Timeout)

Command-Line Overrides

Spider-Level Settings (Per-Spider Configuration)

Settings That DON'T Matter (Usually)

My Recommended Settings Template

Settings for Different Scenarios

Scenario 1: Fast Scraping (Big Sites)

Scenario 2: Polite Scraping (Small Sites)

Scenario 3: Development/Testing

Scenario 4: Production

Common Mistakes

Mistake #1: Setting Everything

Mistake #2: No Delays

Mistake #3: Cache in Production

Quick Reference

Speed (from slow to fast)

Essential Settings Checklist

Summary

Top comments (0)