You are speaking of some future state, however likely that may be, this doesn't exist now and exists on the corporate platforms now. Obtaining this information by extracting it from ActivityPub streams is also a far cry from getting the information from a centralized platform that is eager to sell it to anyone with money.
It has happened and more than once that I have found a fediverse post when doing a google search. So webcrawlers are already scooping the data up. And I mean why shouldn't they? It's a publicly posted (micro)blog. There is no robots.txt telling them to stay away? The servers are handing it out for free to anyone who asks.
The only missing part is using it to build profiles and join profiles. It may already exist, but I haven't heard anyone doing it... yet.
I beg to differ. There is no technological hurdle to prevent this. Only privacy-through-obscurity, and who knows how long that will last?