Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms.
 help



The amount of private data that is locked up inside private internal databases is huge. This is especially true of regulated industries. There is a wealth of data - financial data showing how to budget for things, pricing data on various products that are B2B, standard operating procedures at mature companies that have gone through various revisions, designs for manufacturing plants so people don't keep reinventing and making the same mistakes again, and on and on.

I think it's implied that they're not talking about private data when they say they've run out.

fair. I want to +1 the fact that there is a large amount of data unseen by LLMs.

I think there are post training tweaks that can be done with corporate data to help fit an AI to a specific corporation. But I don’t think that private data will deliver us AGI. The knowledge for AGI is out in the world, not hidden inside corporations. Private data brings us knowledge of the XYZ project status and the division ABC budget and whether Bob wants a chocolate cake for his going away dinner or not.

I'm not seeing it the same way. Businesses in various industries have several types of moats - money, knowledge, experience, skills, etc. There is ton of competitive intelligence hidden in private data.

Its one of the reasons you can't use chatGPT and start manufacturing chips or vaccines, or anti-cancer medication. The gap between publicly available data that informs academic "core science" research versus specific product-based knowledge that shows you how to make a successful drug candidate that can withstand regulatory scrutiny or be a safe and effective drug for the worlds population.

We could iterate so quickly if this private data set was democratized.


The Vatican Library contains roughly 1.1 million printed books and around 75,000 codices, only a small percentage of which have been digitised.

Reddit alone contains about the same quantity of text (~10 billion posts * 10 words per post, vs 1 million books * 100k words per book). Messaging and document platforms (google docs, slack, discord, telegram, etc.) probably each have 1-3 orders of magnitude more than reddit. To your/GP's point though, those private platforms probably haven't been slurped up by LLMs yet.

Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle.

1.1 million being 0.000000001% implies a total count of 1e17 books in the world - the real number is closer to 1e8.

You’re missing the point. And we’re not just talking about books, whatever that might mean. We’re talking about all documents ever made. Every magazine article, every blog and web page, every Word doc, etc. I’m pretty sure that whatever is in the Vatican archives is tiny by comparison. Given the age of the Vatican archives, I can also guarantee that many of those “books” are nothing more than page fragments. Very few will be full codices or long scrolls. Many will date before the printing press when document production was slow and laborious.

What makes you believe that most things have been digitised in the first place?

has the whole youtube been indexed?

I’m sure Gemini has done it at some level. Google was pretty much founded on the assumption that more data is better. That has driven them to build or buy data sets that they can mine (Gmail, YouTube, etc.).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: