Time for me to advocate again for people to use Common Crawl. Please don't slam ...

dewey · on Jan 12, 2022

I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.

weird-eye-issue · on Jan 12, 2022

Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?

LunaSea · on Jan 12, 2022

Common Crawl is missing far too many URLs for it to be useful in a real world scenario.

Chris2048 · on Jan 12, 2022

But can't you add to their index?

wumpus · on Jan 12, 2022

No. You can add to the Wayback Machine at web.archive.org via their "save page now" interface... Common Crawl is attempting to be a sample of the web, and doesn't take url suggestions.

mycall · on Jan 12, 2022

I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org

wumpus · on Jan 12, 2022

web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit

kevinsundar · on Jan 12, 2022

Hey! I was using your tool a couple months ago. It was super helpful for my project.

wumpus · on Jan 13, 2022

Thanks! I rarely hear from users, great to hear from you!

kevinsundar · on Jan 12, 2022

They do and its better than common crawl's by my testing.

joe_91 · on Jan 12, 2022

That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.

jimkri · on Jan 12, 2022

That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website