Have run into exactly this before. Wrote a scraper that retrieved results from a...

tshaddox · on Aug 23, 2016

It seems like you could easily hit those scaling issues by manually browsing the website. While I agree that it sucks to take down a site by scraping, in that specific case it sounds like the performance issues are their fault and not yours. That said, once I realized the effect my scraping had, I would (hopefully) cease my scraping.

chucksmash · on Aug 24, 2016

So the thing is, I could totally believe they never saw this traffic pattern under normal load. I'd expect bar trivia scores in a certain mid-sized US city are one of those niche things where you have a very low number of uniques but each unique then pokes around on 9 or 10 pages while they're there. The fact that the site didn't crash during normal browsing was what originally led me to speculate they were maintaining an open DB connection per session. If that was indeed the issue, I could totally imagine they'd only rarely (never?) had 100+ "concurrent..ish" unique visitors.

Grishnakh · on Aug 25, 2016

Ok, then why couldn't you revise your scraper so that it did everything in a single session, to avoid this problem?

To me, for private, personal use, a scraper should emulate a normal human browser as much as possible to avoid causing site problems and to avoid detection. If what you're doing can be done in the background, or by a cron process at some odd hour, it doesn't have to be fast at all, and you can set the timings to be similar to a normal human.

chucksmash · on Aug 23, 2016

Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.

jdc0589 · on Aug 23, 2016

That is a classic connection pooling/lifecycle bug, and usually one that gets caught in the first few days of having multiple people utilizing a product/service, worst case.

If someone's production site, thats been around for while, had a bug like this that can be caused by what you describe, I'd love to see how many real users they have. I'm sure its possible under certain circumstances, but its definitely bad engineering that would be caused by literally any traffic.

cookiecaper · on Aug 23, 2016

You can avoid triggering this in your scraper by activating a cookie jar. Pretty simple most of the time. Even commandline cURL and wget support it. I'm sure you figured that out already, but just for anyone who's wondering. ;)

That said, while obviously you want to avoid triggering the bug since it offlines your data source, this is definitely in the site's court to fix and could easily be triggered by normal usage. Some people browse with cookies disabled, especially since the EU passed its "cookie law", requiring sites to get consent before storing a cookie on visitors' machines. If you've started to notice more sites talking about cookies over the last year, that's why. [0]

[0] http://ec.europa.eu/ipg/basics/legal/cookies/index_en.htm

flukus · on Aug 23, 2016

>Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

Could also be something like storing hibernates second level cache in session. Unfortunately I've seen this, a significant chunk of the database was being copied into each users session.