When I first opened Scrapy's settings.py, I panicked. There were like 200 settings. I had no idea which ones mattered and which ones I could ignore.
I spent days tweaking random settings, hoping they'd make my spider faster or more reliable. Most of them did nothing. Some made things worse.
After scraping hundreds of websites, I finally figured out which settings actually matter and which are just noise. Let me save you the trial and error.
The Big Picture: Three Levels of Settings
Scrapy has three levels of settings, and understanding this is crucial:
Level 1: Default Settings (in Scrapy's code)
- The baseline for everything
- You never edit these directly
- Located in Scrapy's source code
Level 2: Project Settings (your settings.py)
- Apply to ALL spiders in your project
- Most common place to configure things
- Located in your project's settings.py
Level 3: Spider Settings (inside spider class)
- Apply only to that specific spider
- Override project settings
- Set via
custom_settingsattribute
Priority: Spider settings > Project settings > Default settings
# Project settings (settings.py)
DOWNLOAD_DELAY = 2
# Spider settings (overrides project settings)
class FastSpider(scrapy.Spider):
name = 'fast'
custom_settings = {
'DOWNLOAD_DELAY': 0.5 # This spider goes faster
}
The 15 Settings That Actually Matter
Let me cut through the noise. Here are the settings you'll actually use.
1. BOT_NAME (Identity)
What it does: Identifies your bot in User-Agent and logs.
Default: Your project name
When to change: Always. Make it descriptive.
BOT_NAME = 'my_company_scraper'
Why it matters: Helps websites identify your bot. Be honest about who you are.
2. USER_AGENT (How You Identify Yourself)
What it does: The User-Agent string sent with every request.
Default: Scrapy/VERSION (+https://scrapy.org)
When to change: Always. The default screams "I'm a bot!"
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
What the docs don't tell you:
- Many websites block the default Scrapy User-Agent immediately
- Use a real browser's User-Agent
- Rotate User-Agents for serious scraping (use a middleware)
# Better: Rotate User-Agents
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
# Use a middleware to rotate these
3. ROBOTSTXT_OBEY (Respect robots.txt)
What it does: Whether to respect robots.txt rules.
Default: True (respects robots.txt)
When to change: Only if you have permission to ignore it.
ROBOTSTXT_OBEY = True # Be respectful
What the docs don't tell you:
- Setting this to
Falsedoesn't make you invisible - Websites can still block you by IP or behavior
- Respecting robots.txt is just being a good citizen
4. CONCURRENT_REQUESTS (How Many at Once)
What it does: Maximum number of concurrent requests.
Default: 16
When to change: When you need to go faster or slower.
# Go slower (more polite)
CONCURRENT_REQUESTS = 4
# Go faster (more aggressive)
CONCURRENT_REQUESTS = 32
What the docs don't tell you:
- Higher isn't always faster (websites throttle you)
- Start at 8-16 for most sites
- Increase slowly and monitor for blocks
- Large sites can handle more; small sites can't
5. CONCURRENT_REQUESTS_PER_DOMAIN (Per-Site Limit)
What it does: Maximum concurrent requests to a single domain.
Default: 8
When to change: When scraping multiple domains or being extra polite.
# Be more polite
CONCURRENT_REQUESTS_PER_DOMAIN = 2
# More aggressive (for large sites)
CONCURRENT_REQUESTS_PER_DOMAIN = 16
What the docs don't tell you:
- This matters MORE than CONCURRENT_REQUESTS
- Set lower for small websites
- Large e-commerce sites can handle 16+
- News sites often need 2-4
6. DOWNLOAD_DELAY (Time Between Requests)
What it does: Minimum time (in seconds) between requests to same domain.
Default: 0 (no delay)
When to change: Always. Zero delay is aggressive.
# Polite scraping
DOWNLOAD_DELAY = 2
# Aggressive (use with caution)
DOWNLOAD_DELAY = 0.5
# Very polite (for fragile sites)
DOWNLOAD_DELAY = 5
What the docs don't tell you:
- This is your #1 defense against getting blocked
- Random delays are better than fixed delays
- Use AutoThrottle instead for smart delays
- Most sites are happy with 1-3 seconds
Better approach:
# Random delay between 1-3 seconds
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True # Adds ±50% randomness
7. AUTOTHROTTLE (Smart Speed Control)
What it does: Automatically adjusts speed based on site's response time.
Default: Disabled
When to change: Enable for production scrapers.
# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
What the docs don't tell you:
- AutoThrottle is better than fixed delays
- It automatically slows down if site is struggling
- Speeds up if site is fast
-
TARGET_CONCURRENCYis how many requests you want in-flight - Start with 1.0-2.0 for TARGET_CONCURRENCY
8. COOKIES_ENABLED (Cookie Handling)
What it does: Whether to handle cookies automatically.
Default: True
When to change: Rarely. Usually you want this enabled.
COOKIES_ENABLED = True # Let Scrapy handle cookies
What the docs don't tell you:
- Scrapy handles cookies automatically per spider
- Each spider has its own cookie jar
- Disable only if cookies cause problems (rare)
9. HTTPCACHE_ENABLED (Speed Up Development)
What it does: Caches responses to avoid re-downloading during development.
Default: False
When to change: Enable during development, disable in production.
# Development
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Never expire
# Production
HTTPCACHE_ENABLED = False
What the docs don't tell you:
- This is a lifesaver during development
- Run once to populate cache, then test infinitely
- Remember to clear cache when site structure changes
- Don't deploy with cache enabled!
10. LOG_LEVEL (How Verbose)
What it does: Controls what gets logged.
Default: 'DEBUG'
When to change: Set to INFO for production.
# Development (see everything)
LOG_LEVEL = 'DEBUG'
# Production (only important stuff)
LOG_LEVEL = 'INFO'
# Only problems
LOG_LEVEL = 'WARNING'
Levels: DEBUG > INFO > WARNING > ERROR > CRITICAL
11. ITEM_PIPELINES (Data Processing)
What it does: Enables and orders your pipelines.
Default: {} (no pipelines)
When to change: When you have pipelines.
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 100,
'myproject.pipelines.CleaningPipeline': 200,
'myproject.pipelines.DatabasePipeline': 300,
}
What the docs don't tell you:
- Lower numbers run first
- Use multiples of 100 (leaves room to insert pipelines later)
- Order matters! Clean before saving
- Set to
Noneto disable a pipeline
12. RETRY Settings (Handling Failures)
What it does: Controls automatic retries of failed requests.
Defaults:
RETRY_ENABLED = TrueRETRY_TIMES = 2RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
When to change: Increase retries for unreliable sites.
# More aggressive retries
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 403]
What the docs don't tell you:
- 429 (rate limit) already retries by default
- Add 403 if you're getting temporarily blocked
- Don't retry 404s (page truly doesn't exist)
- Retries happen with exponential backoff
13. FEED_EXPORT_ENCODING (Output Encoding)
What it does: Character encoding for exported files.
Default: 'utf-8'
When to change: When dealing with non-English text.
FEED_EXPORT_ENCODING = 'utf-8' # Handles all languages
What the docs don't tell you:
- utf-8 handles 99% of use cases
- Windows might need 'utf-8-sig' for Excel compatibility
- Never use 'ascii' unless you only have English
14. DNS_TIMEOUT (Resolve Timeouts)
What it does: Timeout for DNS lookups.
Default: 60 seconds
When to change: Lower for faster failure detection.
DNS_TIMEOUT = 10 # Fail fast on DNS issues
15. DOWNLOAD_TIMEOUT (Page Load Timeout)
What it does: Timeout for downloading pages.
Default: 180 seconds
When to change: Lower for faster scraping.
DOWNLOAD_TIMEOUT = 30 # 30 seconds max per page
What the docs don't tell you:
- 180 seconds is way too long
- 15-30 seconds is reasonable for most sites
- If pages take longer, the site might be blocking you
Command-Line Overrides
You can override settings from the command line:
# Change log level
scrapy crawl myspider -s LOG_LEVEL=WARNING
# Change download delay
scrapy crawl myspider -s DOWNLOAD_DELAY=5
# Multiple settings
scrapy crawl myspider -s LOG_LEVEL=INFO -s DOWNLOAD_DELAY=2
Spider-Level Settings (Per-Spider Configuration)
Override settings for specific spiders:
class FastSpider(scrapy.Spider):
name = 'fast'
custom_settings = {
'DOWNLOAD_DELAY': 0.5,
'CONCURRENT_REQUESTS': 32,
'LOG_LEVEL': 'WARNING'
}
class SlowSpider(scrapy.Spider):
name = 'slow'
custom_settings = {
'DOWNLOAD_DELAY': 5,
'CONCURRENT_REQUESTS': 2,
'AUTOTHROTTLE_ENABLED': True
}
Settings That DON'T Matter (Usually)
These settings exist but you'll rarely touch them:
-
SPIDER_LOADER_CLASS(only for advanced customization) -
STATS_CLASS(default is fine) -
TELNETCONSOLE_ENABLED(debugging feature) -
MEMDEBUG_ENABLED(only for memory debugging) -
DEPTH_PRIORITY(default works)
My Recommended Settings Template
Here's my starting point for most projects:
# settings.py
# Identity
BOT_NAME = 'my_scraper'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Politeness
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
# Or use AutoThrottle instead
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Cookies
COOKIES_ENABLED = True
# Retries
RETRY_ENABLED = True
RETRY_TIMES = 3
# Timeouts
DNS_TIMEOUT = 10
DOWNLOAD_TIMEOUT = 30
# Development only
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
# Logging
LOG_LEVEL = 'INFO'
# Pipelines (add yours)
ITEM_PIPELINES = {
# 'myproject.pipelines.MyPipeline': 300,
}
Settings for Different Scenarios
Scenario 1: Fast Scraping (Big Sites)
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5
AUTOTHROTTLE_ENABLED = False # We're going fast
Scenario 2: Polite Scraping (Small Sites)
CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
Scenario 3: Development/Testing
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
LOG_LEVEL = 'DEBUG'
CONCURRENT_REQUESTS = 1 # Easier to debug
Scenario 4: Production
HTTPCACHE_ENABLED = False
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 5
Common Mistakes
Mistake #1: Setting Everything
# DON'T do this
COOKIES_DEBUG = True
SPIDER_MIDDLEWARES_BASE = {...}
DOWNLOADER_MIDDLEWARES_BASE = {...}
# ... 50 more settings
Most settings have good defaults. Only change what you need.
Mistake #2: No Delays
# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 64
This is asking to get blocked.
Mistake #3: Cache in Production
# BAD (stale data in production)
HTTPCACHE_ENABLED = True
Cache is for development only!
Quick Reference
Speed (from slow to fast)
# Very slow/polite
DOWNLOAD_DELAY = 5
CONCURRENT_REQUESTS = 2
# Normal
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 8
# Fast
DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 32
Essential Settings Checklist
- [ ] Set USER_AGENT to real browser
- [ ] Set DOWNLOAD_DELAY or enable AUTOTHROTTLE
- [ ] Set CONCURRENT_REQUESTS appropriately
- [ ] Enable HTTPCACHE for development
- [ ] Disable HTTPCACHE for production
- [ ] Set LOG_LEVEL to INFO for production
- [ ] Configure ITEM_PIPELINES if you have any
Summary
The 5 settings you'll change most:
- USER_AGENT (always)
- DOWNLOAD_DELAY or AUTOTHROTTLE (always)
- CONCURRENT_REQUESTS (based on site)
- HTTPCACHE_ENABLED (dev vs prod)
- LOG_LEVEL (dev vs prod)
Start simple:
- Use my recommended template
- Adjust delays based on site response
- Add settings as you need them
- Don't prematurely optimize
Remember:
- Most settings have good defaults
- Slower scraping = more reliable scraping
- Test before deploying
- Monitor and adjust
Happy scraping! 🕷️
Top comments (0)