Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/



I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.


Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?


Common Crawl is missing far too many URLs for it to be useful in a real world scenario.


But can't you add to their index?


No. You can add to the Wayback Machine at web.archive.org via their "save page now" interface... Common Crawl is attempting to be a sample of the web, and doesn't take url suggestions.


I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org


web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit


Hey! I was using your tool a couple months ago. It was super helpful for my project.


Thanks! I rarely hear from users, great to hear from you!


They do and its better than common crawl's by my testing.


That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.


That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: