Web scraping in 2025 is harder than it used to be. Sites that once responded to a simple HTTP request now combine fingerprinting, behavioral checks, JavaScript-heavy frontends, and stricter rate limits.
That does not make scraping impossible, but it does change the work. A parser that looks fine on paper can still fail because the browser fingerprint looks wrong, the IP reputation is weak, or the site is returning a challenge page instead of the content you expected.
This guide walks through seven problems that show up repeatedly in modern scraping systems, along with the practical trade-offs behind each one.
Why Web Scraping Has Become an Arms Race
So why is scraping such a nightmare now? Well, website owners aren't stupid. They've got real reasons to lock things down. Competitors are stealing their pricing strategies. Bots are hammering their servers and driving up hosting costs. People are scraping entire sites and republishing the content elsewhere. I get it— if I ran a site, I'd be paranoid too. So they fight back. Hard.
Here's the crazy part: the average commercial website in 2025 is running at least 3-5 different anti-bot systems simultaneously. Browser fingerprinting, TLS fingerprinting, behavioral analysis, honeypot traps, rate limiting that adapts in real-time—the works. Some sites are even using machine learning models trained on millions of bot interactions. They're basically playing 4D chess while we're still figuring out checkers.
The Modern Anti-Bot Stack
- • Cloudflare Bot Management: Common on commercial sites with aggressive bot protection
- • PerimeterX/HUMAN: Behavioral analysis and device fingerprinting
- • DataDome: Real-time bot detection with ML
- • Akamai Bot Manager: Enterprise-grade protection
- • reCAPTCHA v3: Invisible scoring system
- • Custom WAF rules: Site-specific detection logic
And get this—these anti-bot systems cost sites thousands of dollars per month. They're literally paying more to keep you out than most of us spend on our scraping infrastructure. That tells you how serious they are about this. And that's exactly why your basic Python script keeps failing.
Challenge #1: The CAPTCHA Nightmare
CAPTCHAs are still one of the first signals developers think about, but the visible puzzle is usually only the last stage of detection. By the time you see it, the site has already decided your traffic looks suspicious.
The new generation of CAPTCHAs, like reCAPTCHA v3 and hCaptcha, are sneaky. They work invisibly in the background, watching everything you do. Mouse movements, typing patterns, how your browser looks, your IP's reputation—they're analyzing dozens of signals. And if anything seems off (like, say, you're making requests from a datacenter IP with suspiciously consistent timing), boom—you're blocked before you even see a puzzle.
# What happens when you ignore CAPTCHAs import requests from bs4 import BeautifulSoup url = "https://target-site.com/products" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') products = soup.find_all('div', class_='product') print(f"Found {len(products)} products") # Output: Found 0 products # Why? You got served a CAPTCHA challenge page instead of product data # Your script has no idea it's been blocked
CAPTCHA solving services can help in some setups, but they add latency, cost, and another failure point. On tougher targets, they also do not solve the underlying issue: the request already looks automated.
The Real Solution: Prevention Over Solving
The more reliable approach is prevention. If the browser fingerprint, IP quality, and request timing look reasonable, you often avoid the challenge entirely. That is usually more stable than trying to solve puzzles after detection has already happened.
# ScrapingBot - CAPTCHAs simply don't appear curl "https://api.scrapingbot.io/v1/scrape" \ -H "x-api-key: YOUR_KEY" \ -d "url=https://target-site.com/products" \ -d "render_js=true" \ -d "premium_proxy=true" { "success": true, "html": "... actual product data ...", "statusCode": 200 } # Goal: receive the page content without tripping a challenge
In practice, that usually means better IPs, browser rendering with a believable fingerprint, and request timing that does not look mechanical.
Challenge #2: IP Blocking and Ban Hammers
IP blocking is old school, but don't let that fool you—it's gotten way more sophisticated. Websites aren't just counting how many requests come from your IP anymore. They're checking your IP's reputation, instantly identifying datacenter IP ranges, tracking behavior patterns across multiple IPs, and sharing blocklists with other sites. It's like a permanent record that follows you around the internet.
This is also where costs start to climb. A proxy pool that looks healthy at the start of a project can degrade quickly once a target begins scoring and blocking traffic more aggressively.
Types of IP Bans You'll Face
- 🚫 Hard bans: Your IP is completely blocked, sometimes permanently
- 🚫 Soft bans: You get served fake/stale data or endless loading
- 🚫 Rate limits: Throttled to 1 request per minute or worse
- 🚫 Subnet bans: Your entire IP range gets blacklisted
- 🚫 Geo-blocks: Datacenter IPs from certain regions auto-banned
Why Datacenter Proxies Usually Fail
Datacenter proxies are cheap and fast, which is why many teams start there. The downside is that many sites can classify those IP ranges quickly, so the cheapest network is not always the one that stays usable.
# Managing proxies manually = nightmare fuel import requests import random import time from datetime import datetime proxies = load_proxy_list() # 1000 proxies you paid for working_proxies = proxies.copy() failed_proxies = [] def test_proxy(proxy): """Test if a proxy is still working""" try: response = requests.get( 'https://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5 ) return response.status_code == 200 except: return False for url in urls_to_scrape: attempt = 0 success = False while attempt < 5 and not success: if len(working_proxies) == 0: print("CRITICAL: No working proxies left!") break proxy = random.choice(working_proxies) try: response = requests.get( url, proxies={'http': proxy, 'https': proxy}, timeout=10, headers={'User-Agent': random_user_agent()} ) if response.status_code == 200: success = True parse_data(response) elif response.status_code == 403: # Proxy is banned, remove it working_proxies.remove(proxy) failed_proxies.append({ 'proxy': proxy, 'failed_at': datetime.now(), 'reason': 'banned' }) else: # Other error, try different proxy pass except requests.exceptions.Timeout: # Proxy too slow, mark as unreliable working_proxies.remove(proxy) except Exception as e: # Connection error, proxy likely dead working_proxies.remove(proxy) attempt += 1 time.sleep(random.uniform(3, 8)) if not success: print(f"Failed to scrape {url} after 5 attempts") # Monitor proxy health proxy_health = len(working_proxies) / len(proxies) * 100 print(f"Proxy pool health: {proxy_health:.1f}%") if len(working_proxies) < 50: print("WARNING: Running out of proxies! Need to buy more...") # Emergency: buy more proxies or pause scraping # This code requires: # - Constant monitoring # - Regular proxy replacement # - Error handling for dozens of edge cases # - Database to track proxy performance # - Alerts when proxy pool degrades
Residential proxies usually improve survivability, but they also add cost and operational overhead. You still have to manage sessions, geography, retries, and the quality of the pool itself.
Managed services are useful here because they absorb that operational work. Instead of maintaining the pool yourself, you work against an API and let the provider handle replacement, rotation, and retry strategy.
Challenge #3: JavaScript Rendering Hell
A lot of modern sites depend heavily on JavaScript. If you send a plain HTTP request, you often get a shell page back instead of the data you expected to parse.
The Headless Browser Trap
The obvious answer is to switch to a headless browser. That works, but it also changes the economics of the scraper. Browsers consume far more memory and CPU than simple HTTP clients, and they add a new detection surface.
A basic Puppeteer setup is enough to render the page, but it is usually not enough to stay undetected:
# Basic Puppeteer scraper - will get detected const puppeteer = require('puppeteer'); async function scrapeWithPuppeteer(url) { const browser = await puppeteer.launch({ headless: true, // Detectable! args: ['--no-sandbox'] }); const page = await browser.newPage(); // Set basic user agent await page.setUserAgent( 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' ); try { await page.goto(url, { waitUntil: 'networkidle0' }); // Wait for content to load await page.waitForSelector('.product-list', { timeout: 10000 }); // Extract data const data = await page.evaluate(() => { const products = []; document.querySelectorAll('.product-item').forEach(item => { products.push({ name: item.querySelector('.product-name')?.textContent, price: item.querySelector('.price')?.textContent }); }); return products; }); await browser.close(); return data; } catch (error) { await browser.close(); // Common errors you'll see: // - TimeoutError: Waiting for selector timed out // - Navigation failed because page crashed // - Access denied (you've been detected) throw error; } } // Problems with this approach: // 1. navigator.webdriver is true (instant detection) // 2. Missing browser plugins (Chrome extensions, PDF viewer) // 3. No WebGL fingerprint // 4. Wrong canvas fingerprint // 5. Inconsistent screen dimensions // 6. No audio context // 7. Suspicious timing (too fast/consistent)
To avoid detection, you need stealth plugins and proper configuration:
# Stealth Puppeteer - much better, but still complex const puppeteer = require('puppeteer-extra'); const StealthPlugin = require('puppeteer-extra-plugin-stealth'); const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker'); // Add stealth plugins puppeteer.use(StealthPlugin()); puppeteer.use(AdblockerPlugin({ blockTrackers: true })); async function scrapeWithStealth(url) { const browser = await puppeteer.launch({ headless: 'new', args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-accelerated-2d-canvas', '--disable-gpu', '--window-size=1920x1080', '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' ] }); const page = await browser.newPage(); // Set viewport to match window size await page.setViewport({ width: 1920, height: 1080 }); // Set additional headers await page.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Connection': 'keep-alive', }); // Randomize some behaviors to appear human await page.evaluateOnNewDocument(() => { // Override the navigator properties Object.defineProperty(navigator, 'webdriver', { get: () => false, }); // Add Chrome runtime window.chrome = { runtime: {}, }; // Override permissions const originalQuery = window.navigator.permissions.query; window.navigator.permissions.query = (parameters) => ( parameters.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : originalQuery(parameters) ); }); try { // Navigate with random delays await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 }); // Random mouse movements (appear human) await page.mouse.move(100, 100); await page.mouse.move(200, 200); // Random scroll await page.evaluate(() => { window.scrollBy(0, Math.floor(Math.random() * 500) + 100); }); // Wait with random delay await page.waitForTimeout(Math.random() * 2000 + 1000); const data = await page.evaluate(() => { // Extract data... return extractProducts(); }); await browser.close(); return data; } catch (error) { await browser.close(); throw error; } } // Better, but still requires: // - Managing browser instances // - Handling crashes and memory leaks // - Load balancing across multiple browsers // - Monitoring and auto-restart on failures // - Keeping stealth plugins updated
Real Costs of Running Your Own Browsers
The hard part is that rendering the page is only half the problem. Many sites also inspect browser features closely enough to tell the difference between a default automation setup and a normal user session.
Challenge #4: Rate Limiting and Throttling
Even if the requests are rendering correctly and the IPs look acceptable, rate limiting can still stop the scraper. Sites can track request volume per IP, session, user agent, and time window.
The frustrating part is that every site sets those limits differently, and most of them do not tell you where the line is. You usually discover it by watching success rates fall off.
# Naive approach - instant ban import requests urls = [f"https://site.com/page/{i}" for i in range(1000)] for url in urls: response = requests.get(url) parse_data(response) # Banned after request #15 # Why? 15 requests in 3 seconds is obviously a bot
# Adaptive rate limiter that learns from responses import time import random from datetime import datetime class AdaptiveRateLimiter: def __init__(self, initial_delay=2.0): self.delay = initial_delay self.success_count = 0 self.failure_count = 0 self.last_request_time = None def record_success(self): self.success_count += 1 self.failure_count = 0 # Speed up if consistently successful if self.success_count > 10: self.delay = max(1.0, self.delay * 0.95) self.success_count = 0 def record_failure(self, status_code): self.failure_count += 1 self.success_count = 0 if status_code == 429: # Too Many Requests self.delay = min(30.0, self.delay * 2.0) elif status_code == 403: # Forbidden self.delay = min(60.0, self.delay * 3.0) if self.failure_count > 3: self.delay = min(120.0, self.delay * 2.0) def wait(self): if self.last_request_time: elapsed = time.time() - self.last_request_time sleep_time = max(0, self.delay - elapsed) if sleep_time > 0: # Add random jitter (±20%) jitter = random.uniform(-0.2, 0.2) * sleep_time time.sleep(sleep_time + jitter) self.last_request_time = time.time() # Usage rate_limiter = AdaptiveRateLimiter(initial_delay=3.0) for url in urls_to_scrape: rate_limiter.wait() try: response = requests.get(url) if response.status_code == 200: rate_limiter.record_success() process_data(response) else: rate_limiter.record_failure(response.status_code) except Exception as e: rate_limiter.record_failure(500) # Still requires: # - Per-domain rate limiting # - IP-based tracking # - Retry-After header handling # - Exponential backoff
This is one place where managed infrastructure helps. Instead of implementing per-target timing and retry rules yourself, you can rely on a provider to absorb some of that variability.
Challenge #5: Ever-Changing Website Structures
Oh, this one's my favorite. You spend a week building the perfect scraper. Your CSS selectors are beautiful. Your parsing logic is bulletproof. You're a genius. It works perfectly! For exactly two weeks. Then the site does a minor update, changes one class name, and suddenly your scraper is returning nothing but errors.
I've seen scrapers break because a site changed a class name from "product-title" to "product_title" (yes, just a dash to underscore). Or they reorganized their DOM structure. Or they renamed an ID. These tiny changes mean hours of debugging and updating code, and if you're running multiple scrapers? Multiply that pain by however many sites you're scraping.
AI to the Rescue
This is where AI actually becomes useful (for once). Instead of writing fragile CSS selectors that break when the wind blows, you just tell the AI what you want in plain English and let it figure out where to find it:
# AI extraction - works even when HTML changes curl "https://api.scrapingbot.io/v1/scrape" \ -H "x-api-key: YOUR_KEY" \ -d "url=https://ecommerce-site.com/product/12345" \ -d "render_js=true" \ -d "ai_query=extract the product name, price, rating, and availability" { "success": true, "ai_result": { "product_name": "Wireless Bluetooth Headphones", "price": "$89.99", "rating": "4.5", "availability": "In Stock" } } # Site changes HTML? AI adapts automatically. # No code updates needed.
The AI understands what you're asking for semantically, not just looking for specific CSS classes. Site changes their HTML? AI adapts. It's like having a human who can actually read the page instead of blindly following CSS selector instructions.
Challenge #6: Handling Cookies and Sessions
And then there's the joy of session management. A lot of sites need you to maintain state—login cookies, shopping cart sessions, user preferences. You can't just fire off stateless requests and call it a day. You need to juggle cookies, session tokens, CSRF tokens, and sometimes even localStorage. Now try doing that for thousands of concurrent sessions without losing your mind.
# Managing sessions manually import requests from requests.cookies import RequestsCookieJar class SessionManager: def __init__(self): self.sessions = {} def get_or_create_session(self, site_id): if site_id not in self.sessions: session = requests.Session() # Set headers that persist across requests session.headers.update({ 'User-Agent': 'Mozilla/5.0 ...', 'Accept': 'text/html,application/xhtml+xml...', 'Accept-Language': 'en-US,en;q=0.9', }) self.sessions[site_id] = session return self.sessions[site_id] def scrape_with_session(self, url, site_id): session = self.get_or_create_session(site_id) # First request might set cookies response = session.get(url) # Subsequent requests automatically include cookies if 'login' in response.url: # Need to handle login flow csrf_token = extract_csrf_token(response.text) login_data = { 'username': 'user', 'password': 'pass', 'csrf_token': csrf_token } login_response = session.post( 'https://site.com/login', data=login_data ) # Now cookies are set for authenticated requests response = session.get(url) return response # Challenges: # - Sessions expire and need refresh # - CSRF tokens change frequently # - Some sites use LocalStorage (not accessible via requests) # - Rate limiting applies per-session # - Need to detect when session is invalid
ScrapingBot handles all this session stuff automatically. Custom cookies, session persistence, and localStorage support when you're using browser rendering are all built in. That means you do not have to maintain a separate session management layer just to keep requests stable.
# ScrapingBot handles sessions easily curl "https://api.scrapingbot.io/v1/scrape" \ -H "x-api-key: YOUR_KEY" \ -d "url=https://site.com/account/orders" \ -d "render_js=true" \ -d "cookies=session_id=abc123;user_token=xyz789" # Cookies are automatically maintained across requests # No session management infrastructure needed
Challenge #7: Legal and Ethical Considerations
Legal questions are part of the work too. What is acceptable depends on the data, the jurisdiction, the way the scraper accesses the site, and the way the data is used afterward. This is not legal advice, but it is worth treating the legal review as part of the system design.
Key Legal Considerations
- • Public vs. Private Data: Scraping public data is generally safer than scraping behind logins
- • Terms of Service: Many sites prohibit automated access in their ToS
- • robots.txt: Respect these directives when possible
- • Copyright: Don't republish copyrighted content without permission
- • Privacy Laws: GDPR, CCPA, and similar laws protect personal data
- • Server Load: Don't hammer servers with excessive requests
The safest habit is to treat data collection as a compliance problem as well as an engineering problem. If the use case matters to the business, it is worth getting a real legal review early.
Real-World Example: E-Commerce Price Monitoring
A good example is competitor price monitoring on a large e-commerce site. That kind of target often combines several of the problems above at the same time:
- ✗ Cloudflare protection
- ✗ reCAPTCHA v3
- ✗ JavaScript-rendered prices
- ✗ Aggressive rate limiting
- ✗ Weekly HTML structure changes
A DIY stack for that workflow usually means paid proxies, browser infrastructure, retry logic, and regular maintenance every time the site changes. The exact cost varies, but the pattern is consistent: the scraper itself becomes an internal product that needs ongoing support.
A managed service changes that trade-off by moving most of the anti-detection work and infrastructure maintenance behind an API.
The code difference is usually the easiest way to see the trade-off:
# Complete DIY e-commerce scraper (simplified) import requests from bs4 import BeautifulSoup from selenium import webdriver import time import random from datetime import datetime import logging class ProductScraper: def __init__(self): self.proxy_pool = load_proxies() # $300/month self.working_proxies = self.proxy_pool.copy() self.rate_limiter = AdaptiveRateLimiter() self.session_manager = SessionManager() self.captcha_solver = CaptchaSolver() # $100/month def scrape_product(self, url, retries=3): for attempt in range(retries): try: # Get working proxy if len(self.working_proxies) == 0: logging.error("No working proxies!") return None proxy = random.choice(self.working_proxies) # Respect rate limits self.rate_limiter.wait() # Launch browser with proxy options = webdriver.ChromeOptions() options.add_argument(f'--proxy-server={proxy}') options.add_argument('--headless') driver = webdriver.Chrome(options=options) try: driver.get(url) time.sleep(random.uniform(2, 5)) # Check for CAPTCHA if 'captcha' in driver.page_source.lower(): logging.warning(f"CAPTCHA detected on {url}") captcha_solution = self.captcha_solver.solve(driver) if not captcha_solution: self.working_proxies.remove(proxy) driver.quit() continue # Wait for dynamic content WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "product")) ) # Extract data html = driver.page_source soup = BeautifulSoup(html, 'html.parser') product = { 'name': soup.select_one('.product-name')?.text, 'price': soup.select_one('.price')?.text, 'availability': soup.select_one('.stock')?.text, 'scraped_at': datetime.now() } driver.quit() self.rate_limiter.record_success() return product except TimeoutException: logging.error(f"Timeout on {url}") driver.quit() self.working_proxies.remove(proxy) except Exception as e: logging.error(f"Error: {e}") driver.quit() except Exception as e: logging.error(f"Fatal error: {e}") return None def scrape_catalog(self, urls): results = [] failed = [] for url in urls: product = self.scrape_product(url) if product: results.append(product) else: failed.append(url) # Monitor proxy health health = len(self.working_proxies) / len(self.proxy_pool) if health < 0.2: logging.critical("Proxy pool critically low!") # Need to buy more proxies... return results, failed # This requires: # - 800+ lines of additional code (error handling, retry logic, monitoring) # - Database to track proxy performance and failures # - Monitoring dashboard and alerts # - Regular updates when sites change # - Dedicated server (m5.2xlarge: $280/month) # - 20+ hours/month maintenance
Yeah, that's... a lot. Now compare it to this:
# Complete ScrapingBot e-commerce scraper import requests API_KEY = "your_key" BASE_URL = "https://api.scrapingbot.io/v1/scrape" def scrape_product(url): response = requests.get(BASE_URL, params={ "url": url, "render_js": "true", "premium_proxy": "true", "ai_query": "extract product name, price, and availability" }, headers={"x-api-key": API_KEY}) data = response.json() return data["ai_result"] if data["success"] else None def scrape_catalog(urls): results = [] for url in urls: product = scrape_product(url) if product: results.append(product) return results # Minimal integration example # No proxies to manage # No browsers to run # No CAPTCHAs to solve # No infrastructure to maintain # Works reliably at scale
A Practical Cost Comparison
| Component | DIY Solution | ScrapingBot |
|---|---|---|
| Development Time | 3-6 weeks | 30 minutes |
| Proxy Costs | $300-1,000/month | Included |
| CAPTCHA Solving | $100-500/month | Included |
| Server/Infrastructure | $200-800/month | $0 |
| Maintenance Hours | 15-30/month | 0/month |
| Success Rate | Varies widely by site and maintenance effort | Depends on the provider and target |
| Dealing with Site Updates | Manual fixes required | AI adapts automatically |
| Total Monthly Cost | $1,500-5,000+ | $49-249 |
The Build vs. Buy Decision
Most engineering teams can build their own scraping stack. The harder question is whether they should. For many products, the custom logic that matters lives in the data pipeline and the business logic, not in maintaining anti-detection infrastructure.
Once the target sites become difficult enough, the infrastructure starts to look like a product of its own. That can be justified, but it should be a deliberate decision rather than an accidental side effect of a data collection project.
When you compare approaches, it helps to count engineer time and maintenance burden alongside vendor cost. That usually gives a more honest picture than comparing request prices alone.
Getting Started: A Practical Roadmap
If you want to test the managed route, start small and measure the output against a real target:
Your First ScrapingBot Project
-
1
Sign up and get 100 free credits No credit card required. Test on your actual target sites.
-
2
Start with the playground Test different options (JS rendering, proxies, AI extraction) to see what works.
-
3
Integrate with your code We have SDKs for Python, Node.js, PHP, and simple cURL examples.
-
4
Scale up gradually Start small, monitor results, then increase volume as you validate data quality.
Final Thoughts: Focus on What Actually Matters
You can absolutely build your own scraping infrastructure. For some teams, that will be the right call. But it is worth being honest about the maintenance burden before you commit to it.
The important question is not whether the stack is buildable. It is whether building it helps the product enough to justify the maintenance and operational cost.
"Focus engineering time on the part of the system that creates value. The more scraping infrastructure turns into routine maintenance, the more reasonable it is to outsource it."
Modern scraping means solving a cluster of interconnected problems: CAPTCHAs, IP rotation, JavaScript rendering, rate limiting, and constant frontend change. Those are real engineering problems, but they are not always the problems your team needs to own directly.
If the goal is to ship a data product quickly, using a managed service can be a sensible shortcut. If the goal is to control every layer yourself, the earlier sections in this article give you a clearer picture of what that decision really entails.