> open-source the entire search index db and accompanying webserver software, ma...

41b696ef1113 · on Aug 25, 2022

I was curious, so as a point of comparison, the latest Common Crawl [0] is 3.1 billion pages and 370 TB uncompressed. I would presume that Bing would be significantly larger given commercial interests.

[0]: https://commoncrawl.org/connect/blog/

unknownaccount · on Aug 25, 2022

If somehow Google and AskJeeves worked perfectly fine 20 years ago for millions of monthly users, I find it hard to believe a modern powerful computer lacks the resources to support a search engine for a single person.

jamesrr39 · on Aug 25, 2022

What is the largest hard disk one can buy nowadays? I found a WD Gold 20TB. You'd need 19 of them plugged into your computer just to hold the uncompressed archive from Common Crawl.

unknownaccount · on Aug 25, 2022

Yet somehow search engines like Google and AskJeeves existed and worked alright 20+ years ago on hardware 1/1000th as powerful as it is today.

newuserisnew · on Aug 25, 2022

firstly Google was founded in 1998 that is 23 years ago.

Secondly from 2000 - 2018 the internet went form having ~17.000.000 unique domains to having ~1.600.000.000 unique domains. see: https://www.internetlivestats.com/total-number-of-websites/

The performance for desktop computers have actually not increased as much as you would think: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-...

Your assumption is correct if you look at supercomputers, where the fastest in the world in 1999 could produce ~2.3 TFLOPS and in 2018 it could produce 122 PFLOPS which is around 5000 times the increase in FLOPS.

But i doubt most of the people you would want to go through this index has access to a super computer.

tsimionescu · on Aug 25, 2022

I wouldn't be surprised if the indexed subset of Facebook alone were more than 1000x larger than all of the indexed web 20 years ago. The web in general has probably expanded many millions or hundreds of millions of times.

unknownaccount · on Aug 25, 2022

Personally I wouldn’t mind if trash/spam sites like Facebook/Twitter were omitted from the database. As well as non-English content, being as though I only speak English. Remove trash/spam/non-english from the db and the size of that 300TB will be cut down substantially to the point it is feasible for a single person to store. After all, even if somebody wanted to store the whole 300TB db would cost about $4000 in hard drives which is not as totally out-of-reach as some people here are making it seem.

denysvitali · on Aug 25, 2022

I think the web didn't have the same amount of websites 20 years ago...

rpdillon · on Aug 25, 2022

That was a very different internet. Search engines aren't something you build once and then you just have them. Constant, extensive work is necessary. It's quite literally a global-scale task to do this effectively.