Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Reddit's Sale of User Data for AI Training Draws FTC Investigation (wired.com)
131 points by thunderbong on March 16, 2024 | hide | past | favorite | 53 comments


When I signed up for Reddit AI wasn’t even a thing. Now everything I’ve ever written is being mined against my wishes to build machine that I do not wish for my information to be used to train. And I have no option to opt out, at all.

This isn’t a good situation, and if I had known I was going to end up training AI for free, I never would have started with Reddit. This was never an intended use.

I hope they put some restrictions on companies like Reddit. They’re moving far beyond what anyone ever intended to provide them, and these use cases are probably something most users wouldn’t want if they had a choice.


Did you know that HN has an API? Technically all your HN posts could be downloaded and used for AI training as well.

I don’t think we’ll be safe anywhere.


HN is hostile to the idea of user control of one's own content, so there was a signal from the start not to use it for more than anonymous throwaway comments. But Reddit allowed editing and deletion of comments so people used it more freely for years before the trust was broken. Now there's little confidence that deleted comments won't continue to be used.


This is exactly aligned with my point. Expectations matter. Social contracts matter.

Legality is always the argument that tyrants hide behind.


>Did you know that HN has an API? Technically all your HN posts could be downloaded and used for AI training as well.

It already has been used by LLMs


dang will get his own layer...


Why are you posting on the permanent public web if you don't want it to be read forever? You put up fliers at the local bar and are mad someone came by, took pictures, and made a scrapbook of it.

Trust me, I'm as annoyed as anyone about AI siphoning our culture to create bland grey replicas but people need to remember what websites are. They are other people's property, filled with cameras and surrounded by windows. When you go there and store your stuff, put up notes on the bulletin board, and whisper secrets to your friends, you're doing that on someone else's property, and they are watching you very closely. You are doing this in public.


Putting up fliers, choosing to be public is vastly, vastly different from using information previously provided under a different pretext for an entirely unwanted and unintended purpose, especially when it is to train digital intelligences that have inherently massive problematic social costs without explicit consent.

Put another way, it’s fine with me if someone wants to use the content that I provided to drive value on a social website.

It’s not OK with me that someone uses the content that I provided to replicate the author of what was written (without permission).

One thing to really understand about artificial intelligence is that it is in someways, essentially a reverse engineering of the person providing the content. I did not authorize a digital twin to be created from the content that I provided. I did not authorize a digital copy of myself, but that is what all of these companies are doing.

A fundamental stripping of the authorship to the point where you can replicate a proxy of the author and not just use the derivatives of that author.

This is a fundamental difference, because no other technology has ever been able to go back up the food chain and replicate the author in a way that allows for not only replication of the content and its relative functions, but a complete photocopying of the ideas and thinking and other mechanisms and patterns of the person that’s creating it.

It’s a profound difference of people have a hard time seeing.n

I gave permission to use the content I provided. I did not give permission for someone to act as me or make a proxy of me based on what I have provided.

I consented for my media to be used, but not me and that’s a huge difference.


While I philosophically agree with you, you're using a lot of words to describe what you felt but not what you actually agreed to.

What you consented to is an ever changing EULA that says something new every few months.

The "pretext" was that you thought you owned what you were posting but in reality you agreed that you didn't.

You don't sign a contract to put up fliers in public. You thought you were posting on the public commons but you were leaving your posts on someone's property after they warned you they were going to use them for whatever they wanted.

I personally hate this reality and scorn and avoid the cloud and SaaS as much as I can in my personal life. I deleted my accounts everywhere, and deleted all my content everywhere. Corporate cloud is shitty now.

We need free, open, peer to peer, opt in, civilian run network infrastructure like yesterday.


The old reddit at least (I don't use the new UI, as it is absolutely horrible) had option "allow my data to be used for research purposes" it wasn't for AI but I would imagine anyone who opted out from it, would also not want to have their data shared to AI.


Doesn't the Reddit TOS give them a license to do basically whatever they want with your submissions?


And this was actually a significant controversy some years back.

A lot of users, including myself - though definitely still a small minority - quit Reddit over the increasingly aggressive monetisation attempts; some of us used automated history editors to prevent our posts from being scraped or searched (though it almost certainly will not remove them from back end databases).

We didn't know that this was the specific way our content was going to be used, but we knew it was probably going to be something we wouldn't like.

We have now been proved right, and if you decided, back then, to stay and keep contributing to the multi-billion-dollar site for free, now you know why the Cassandras were raising such a fuss.


Yea count me the fool on this one.

But like is said earlier, legality is always the argument that tyrants hide behind.


There are scripts to delete all your Reddit posts.


Which is great, assuming you trust that a "delete" actually results in the data being removed from Reddit's possession...


There is right to be forgotten for EU, And probably state by state laws in the US. but it might be too little too late.


After I did this, they just restored all my previous posts!


I thought this too at the time. I ran the scripts, check my profile to make sure they were gone, then deleted my account. Went back the next day an a lot of my posts were back. And I couldn’t delete them anymore since I killed the account.

But I just checked, and they are gone. I can search google for my old username, and just references to my u/ by others is seen.

You might want to check again; I think there was some cache or mirroring that the deletes took a long time to propagate through.


Yea done it twice and both times my account magically was restored


Your sentiments definitely resonate, but what about an alternative path forward:

We collectively agree to abolish all notions of intellectual property in recognition of the fact that everyone is standing on the shoulders of giants.

As it stands, IP law is only used against the average person in favor of the wealthy exploiters in charge.

Obviously this move would lead to cascading fundamental changes to society, but I for one welcome a deviation from this late stage capitalist hellscape which only seems to be getting worse.


[flagged]


His writings do seem precient. His methods of expressing his distaste for societies direction still suck though.


Not least of which because they were woefully ineffective while they were inflicting suffering and fear. Still, one can sympathise with his hopeless, impotent rage more, the longer one sticks around on this planet.


He did teach us that nothing will change because we don't want it to. I think that, at least, is significant.


AI is a lot older than Reddit.


Deep Learning, as in, "efficiently trained neural networks utilizing gradient descent" really isn't


Companies have been syphoning this data for free for years to train their models. Why would the proprietor of the data not be able to sell it? Assuming of course their terms of use plainly and clearly state that the data belongs to them


things changed, and they want a piece of the cake now.

they get the chance to adapt to new stuff that's happening and profit from it, however the only ones left without a say are the users that generated that content.

I think it would be fair to say that previously accepted terms and conditions shouldn't ethically count for this, and user data previously generated should be shared for AI training only on an opt in basis.


> user data previously generated should be shared for AI training only on an opt in basis.

This seems obviously backwards and a boon to enormous corporations.

If you want to build a search engine, you have to index everything possible, or it won't be able to find that and then it's useless. Transacting with each individual person in the world would only be possible for megacorps -- it's already expensive enough to index everything if you don't have to do that. It's also perverse that someone could have a right to prevent someone else from providing true information about them. If you're standing there making a false claim, you shouldn't have a right to prevent someone else from proving you wrong just because the evidence is from your own past.

But search engines have been ML since even before LLMs, and LLMs are fundamentally the same. How can someone have a default right to deny the public access to facts?

Now, maybe there are some things you want to keep private, and then if you share them with someone you want to bind them to not sharing that information with others, like an NDA. But that's opt-out, not opt-in, and you can't really do that for things that are public.


The EU disagrees, and have implemented the right to data erasure from search indexes. In the GDPR and applicable to Google since 2016, it says here.

https://en.wikipedia.org/wiki/Right_to_be_forgotten


That's opt-out, not opt-in.


Is it? It's not "like an NDA". You get the data erased (if it is agreed that your privacy is more important than the public interest) after the fact of the data being made public.


Which is why it's controversial, and inconsistent with free speech.

If you want to be forgotten, you change your name and move somewhere else. That should be a right, because it puts it into the hands of the person who wants it to happen. You should be able to e.g. change your social security number, without the government telling anyone what your old one was. Now you're "forgotten" and get to start over.

How do you have a right to selectively suppress information about your past? That's just a boon to liars and scammers because they can do it continuously at no cost to themselves.

And even then, it's still something you have to explicitly ask for, not something that happens by default, notwithstanding that you can do it retroactively.


Right, getting the data erased is the opt-out part. Opt-in would be if they had to give permission to crawl in the first place, which is what folks are suggesting for AI training.


> user data previously generated should be shared for AI training only on an opt in basis.

Wouldn't all of the populate models and applications (chatgpt, copiolot, etc) have to be burned to the ground and start clean? My understanding was that pretty much every one of these were trained on "stolen" data, i.e. without prior agreement with the authors / content creators.


I think the law was clear before that any data on public internet which doesn't require signing the TOS or login is publicly scrapable. People have been doing it for decades for things like market sentiment analysis and even reselling the data for the same. Why would AI wave suddenly change it.


I don't think that would be a concern if a proper law would be passed.


> Why would the proprietor of the data not be able to sell it? Assuming of course their terms of use plainly and clearly state that the data belongs to them

And assuming we collectively (as “we the people ”) grant them the ability to actually own this data. There's plenty of good reasons to consider that the data doesn't belong to reddit, that it's just public stuff that they happen to host in exchange for user traffic and advertising revenues.


Legally maybe that all works, but ethically the data doesn't belong more to Reddit than Google.

They should find a better economic model than selling user data.


I'll just start an online therapy company, and when i have enough data, I'll just change the ToS and sell it all to the highest bidder. See, if I do this right before IPO, it's good!


Change all people and their names to the opposite gender (Alex -> Alexa, Joan -> John) and you’ll be 100% HIPAA compliant.


You have to fuzz their age, zip3, and other demographic factors too.


Perhaps more frustratingly, I’ve found that even when I manually went through my Reddit history and edited my posts and then later did the same and deleted them all, they still exist in the original unedited form. I can find old posts via google search for my username, or posts I commented on, despite Reddit self telling me they have no history of my posts. I also, rarely, get a necro reply where someone replies to a very old post I wrote, which reminds me that these are still out there and discoverable, despite my attempts to edit and delete the information.

Reddit has no respect for its users. Your posts are not your posts. They belong to Reddit. As an EU citizen, it feels like what they are doing is probably illegal, but there is no recourse for the average user.


Reddit only associates the last 1000 comments with your profile. I’ve also spent the last few months regularly checking Google for another stray comment of mine and edit and delete it.

However, if you get a Reddit data takeout, all your posts and comments should show up in there.


> I can find old posts via google search for my username

This has nothing to do with Reddit I believe. Data for such results is stored in Google's indexes, created by scraping Reddit back when your original comments were in place.


Oh, I mean I can actually go to Reddit and read them there, along with my username


I even wonder if this can be reported to the whatever EU commission or group monitors and regulated data usage. I'm pretty ignorant on the matter.


AI looks at TRILLIONS of words. Doubt a post complaining about it’s impossible to get a girlfriend is going to add much to a LLM. That seems to about 50% of the posts.

AKA today on Reddit: My GF say we have to wait for 4 years before we have sex. She left me this week for a guy she met on a plane, she has not left this guys hotel room in a week.

What do I do?

And the “advice” is endless posts.


By the same argument, it should be legal to steal candy bars from the grocery store since “they have so many other things”


Some hypermarkets do ignore small shopliftings until a certain threshold. One can feel free to steal 0.0001 of a penny if they afford their own transaction cost and no police will persecute this either.



This might be one of the few times a big tech name you know is actually selling your data. Most of the time when people say that, they mean Google and Facebook, but they actually keep your data to themselves and use it for ad targeting.


Does anyone know why they stopped with the awards thing? I really thought it was a clever way of monetizing and I still regularly feel like buying a reddit comment a reward.

Was it really that they were embarrassed by how little money it was making?


I thought it was a really bad idea... I would rather send someone cash or crypto...

Glad they stopped it, but I was not aware of it because I stopped using reddit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: