The amount of private data that is locked up inside private internal databases is huge. This is especially true of regulated industries. There is a wealth of data - financial data showing how to budget for things, pricing data on various products that are B2B, standard operating procedures at mature companies that have gone through various revisions, designs for manufacturing plants so people don't keep reinventing and making the same mistakes again, and on and on.
I think there are post training tweaks that can be done with corporate data to help fit an AI to a specific corporation. But I don’t think that private data will deliver us AGI. The knowledge for AGI is out in the world, not hidden inside corporations. Private data brings us knowledge of the XYZ project status and the division ABC budget and whether Bob wants a chocolate cake for his going away dinner or not.
I'm not seeing it the same way. Businesses in various industries have several types of moats - money, knowledge, experience, skills, etc. There is ton of competitive intelligence hidden in private data.
Its one of the reasons you can't use chatGPT and start manufacturing chips or vaccines, or anti-cancer medication. The gap between publicly available data that informs academic "core science" research versus specific product-based knowledge that shows you how to make a successful drug candidate that can withstand regulatory scrutiny or be a safe and effective drug for the worlds population.
We could iterate so quickly if this private data set was democratized.
Reddit alone contains about the same quantity of text (~10 billion posts * 10 words per post, vs 1 million books * 100k words per book). Messaging and document platforms (google docs, slack, discord, telegram, etc.) probably each have 1-3 orders of magnitude more than reddit. To your/GP's point though, those private platforms probably haven't been slurped up by LLMs yet.
Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle.
You’re missing the point. And we’re not just talking about books, whatever that might mean. We’re talking about all documents ever made. Every magazine article, every blog and web page, every Word doc, etc. I’m pretty sure that whatever is in the Vatican archives is tiny by comparison. Given the age of the Vatican archives, I can also guarantee that many of those “books” are nothing more than page fragments. Very few will be full codices or long scrolls. Many will date before the printing press when document production was slow and laborious.
I’m sure Gemini has done it at some level. Google was pretty much founded on the assumption that more data is better. That has driven them to build or buy data sets that they can mine (Gmail, YouTube, etc.).