Have run into exactly this before. Wrote a scraper that retrieved results from a trivia league website. Tried to be a polite scraper (<1 request per second) but the site still crashed - even with 5 seconds of sleep between requests. They were doing something weird with DB connection management (maybe just forgetting to close it and letting it timeout? I remember figuring it out but it's been quite a while) and so after N very reasonably spaced queries the site would reproducibly start throwing an uncaught MAX_DB_CONNECTIONS_EXCEEDED and just be down for everybody everywhere who might've wanted to use it.
It seems like you could easily hit those scaling issues by manually browsing the website. While I agree that it sucks to take down a site by scraping, in that specific case it sounds like the performance issues are their fault and not yours. That said, once I realized the effect my scraping had, I would (hopefully) cease my scraping.
So the thing is, I could totally believe they never saw this traffic pattern under normal load. I'd expect bar trivia scores in a certain mid-sized US city are one of those niche things where you have a very low number of uniques but each unique then pokes around on 9 or 10 pages while they're there. The fact that the site didn't crash during normal browsing was what originally led me to speculate they were maintaining an open DB connection per session. If that was indeed the issue, I could totally imagine they'd only rarely (never?) had 100+ "concurrent..ish" unique visitors.
Ok, then why couldn't you revise your scraper so that it did everything in a single session, to avoid this problem?
To me, for private, personal use, a scraper should emulate a normal human browser as much as possible to avoid causing site problems and to avoid detection. If what you're doing can be done in the background, or by a cron process at some odd hour, it doesn't have to be fast at all, and you can set the timings to be similar to a normal human.
Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.
I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.
That is a classic connection pooling/lifecycle bug, and usually one that gets caught in the first few days of having multiple people utilizing a product/service, worst case.
If someone's production site, thats been around for while, had a bug like this that can be caused by what you describe, I'd love to see how many real users they have. I'm sure its possible under certain circumstances, but its definitely bad engineering that would be caused by literally any traffic.
You can avoid triggering this in your scraper by activating a cookie jar. Pretty simple most of the time. Even commandline cURL and wget support it. I'm sure you figured that out already, but just for anyone who's wondering. ;)
That said, while obviously you want to avoid triggering the bug since it offlines your data source, this is definitely in the site's court to fix and could easily be triggered by normal usage. Some people browse with cookies disabled, especially since the EU passed its "cookie law", requiring sites to get consent before storing a cookie on visitors' machines. If you've started to notice more sites talking about cookies over the last year, that's why. [0]
>Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.
Could also be something like storing hibernates second level cache in session. Unfortunately I've seen this, a significant chunk of the database was being copied into each users session.