With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.
:
This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).
Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.
100% agree, when scraping it should always be done respectfully.
- If they provide a API, then use it.
- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).
- If you can get cached data from somewhere that works, then use that.
Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.
The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.
Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”
If you post that data on a public domain, that is publicly available. It's like writing that info on a cardboard and putting it in the town square and then saying 'why you people steal my data!'
I disagree because there is a difference between posting something publicly for humans and posting something publicly for bots/large scale analysis. I'm ok with my employer possibly being able to see whether I am looking for a new job or not on LinkedIn if that means they would need to have a human looking at my LinkedIn page. I am not ok with them training some ML algorithm to monitor my LinkedIn page to determine how likely I am to leave the company at all times.
Another danger is when public but not easily accessible data is able to deanonymize datasets which is probably the norm rather than the exception for anonymized datasets. Sure there are technical measures to make it better, but at the end of the day I think a lot of privacy is about respecting social boundaries and not breaking these protection measures even if technically possible. Most of the time, these measures are really about keeping honest people honest and not about stopping dedicated attackers.
No they don't, Google and Bing respect robots.txt. Most websites would open it up to them because they need the traffic, so it's a type of scraping that is beneficial.
Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.
"It's public" is a legal defense, not an ethical one. It's public for readers, not for scrapers. It's public within the original context of the website, which may include monetization.
Photographing every page of a book and then reading it that way may be legally allowed, but it's still unethical.
I have somebody in our neighborhood that instead of paying for private trash, takes tiny bags of his private trash to the park and dumps it into the public trash cans.
No. robots.txt is not something that is defined and enforced by the law. Just because someone came up with some 'recommendation' like robots.txt does not mean this is the law
> Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.
I suppose it just comes to down to your own morals, but I see nothing at all unethical about scraping a site for personal use provided that it's done gently enough to avoid DoS or disruption. The idea that saving webpages to read later is parasitic or unethical if a website uses robot.txt to discourage commercial scrapers and data-mining goes way too far.
You're really taking the most innocent stance possible on scraping.
The article talks of large scale scraping, which includes all kinds of bypassing tools, proxies, hardware, or commercial services that abstract this away.
This industrial scale level of scraping is not the same thing as you saving a local copy of 3 web pages. The scale is a million times bigger and for sure it will not be for personal use.
What you fail to acknowledge is that Bing Google etcetera have an effective monopoly on search. They can afford to respect robots.txt because everyone wants them to scrape their site.
The first mover advantage is so huge in this case that without allowing scraping, it's hard to understand how anyone could ever compete with these monoliths.
I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!
My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.
On an eCommerce site I'm responsible for I changed some links from a GET to a POST. "BingPreview" continued hitting those links with GET requests, polluting my logs with 100s of "method not allowed" entries. So I blocked that UA from those links, nothing changed. Banned the bot all together, still hitting my site. This went on for well over a year.
What does that mean exactly? An actual user can't be involved because the links that trigger a GET simply aren't there anymore. Therefore I assume it's a bot hitting faulty links it finds in its cache.
this sort of implies that the 'ethics' would end up meaning that you shouldn't scrape if it is not wanted, although I suppose there can be ethics or other than commercial requirements that mean that you should.
:
This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).
Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.