As someone who has been aggressively cataloging "data" (posts, comments, subreddits, etc.) from Reddit and, importantly in this context, keeping those records relatively up-to-date, it's absolutely astonishing how much spam there is.
I hash every string with a SimHash and perform a Hamming distance query against those hashes for any hash that belongs to more than 3 accounts, i.e., any full string (> 42 characters) which was posted as a post title, post body, comment body, or account "description" by more than 3 accounts.
Regularly, this exposes huge networks of both fresh accounts and what I have to assume are stolen, credentialed "aged" accounts being used to spam that just recycle the same or very similar (Hamming distance < 5 on strings > 42 characters) titles/bodies. We're talking thousands of accounts over months just posting the same content over and over to the same range of subreddits.
I'm just some random Laravel enjoyer, and I've automated the 'banning' of these accounts (really, I flag the strings, and any account that posts them is then flagged).
This doesn't even touch on the media... (I've basically done the same thing with hashing the media to detect duplicate or very, very similar content via pHash). Thousands and thousands of accounts are spamming the same images over and over and over.
From my numbers, 59% of the content on Reddit is spam, and 51% of the accounts are spam, and that's not including the media-flagged spammers.
They don't seem to care about the spam, or they're completely inept. With the resources at their disposal, there's such a huge portion of this that should be able to be moderated before it ever reaches the API/live.
> They don't seem to care about the spam, or they're completely inept
They don't care, at least that is the conclusion I've reached after repeatedly reporting content farms. I think they drive engagement anyway so it's good for business at the end of day. I did not do a thorough study like you apparently did, but anecdotally from the popular subreddits, I've spent enough time on Reddit (unfortunately) to recognize rehashed content that's reposted periodically to 'mine' karma. At some point, Reddit will be just a bunch of spam bots talking to each other and upvoting each other's content while humans will be spectators, or their content will be buried. Either way, this will be great for Reddit as it's good for business (ad impressions). It will be bad for Google as they're training their AIs based off spam and they will notice..
> Couldn’t Reddit solve this simply by making karma worthless, like HN?
Why would they solve this? These bots create engagement. Until some bot farm is weaponized to organize some sort of IRL 'bad thing', I bet nobody at Reddit actually gives a fuck.
Thanks, but after reading that I still don’t get it. What makes an account with 10,000 karma worth money as opposed to one with 100 karma? Do high karma accounts get more prominent display, bolder text, or something else that’s actually worth money?
> What makes an account with 10,000 karma worth money as opposed to one with 100 karma
Astroturfing. Say you're on the /r/AppleWatch and you see a post asking for some watch band recommendations. People start posting said recommendations and you check to see who's posting what. You will instinctively trust the 3 yo account with 10k karma, very popular on /r/pics vs the 2 month old account with 100 karma. That's one example for why some of these accounts are actually bought and sold https://www.epicnpc.com/forums/reddit-accounts.1277/
Political activism is another scenario for these sockpuppet accounts with heavy karma. I've noticed A LOT of pro-Palestinian and pro-Russian propaganda from accounts with huge amounts of karma quickly gained from a handful of posts on mainstream reddits like /r/pics or /r/funny which are reposted content from 3-4 years ago.
Higher karma accounts are less likely to get soft-locked out of communities ("comment/post deleted: you must have x karma to post here") and probably "look more authentic". That perception is probably still subliminally there for a bit longer.
There are commercial influencing operations on Reddit, but I think what you're describing doesn't really affect the usual user experience.
I suspect that the objective of these bulk spamming operations isn't to promote stuff on the platform, but to mess with other apps. LLMs trained on Reddit content, search engines that rank Reddit posts highly, etc.
Some time shortly before the API changes I did see _a lot_ of the spam content was clearly aimed towards prompting for comments to answer all kinds of rather generic questions about various life experiences. I can only imagine what it was used to train.
Now... so much onlyfans. The onlyfans spam dwarfs the rest (I should mention that there's quite a few political/news subreddits I just flat out ignore due to the amount of spam and astroturfing - so likely there's quite a bit there that I'm not seeing)
> Some copypasta probably needs more sophisticated filtering.
I did eventually have to build a process to function as an 'allowlist' for legitimate strings that would otherwise have all the characteristics of spam.
Really, it's very subreddit dependent. I process a 'spam score' for each subreddit daily and there exists quite a bit of variance. Most of the larger subreddits are pretty well moderated.
This is so interesting. How do you manage to catalog all the subreddits in existence? Is there a page which lists them all? I assume the process from there is retrieving by most recent
> How do you manage to catalog all the subreddits in existence?
Oh, I don't want to give the wrong impression. I'm not cataloging anywhere near _all_ subreddits. Or all of anything. More or less I started one day with one subreddit and built a system that just churns through what's there. The API is limited and there's only so many creative ways to request the data (while staying within TOS) - as I've wanted to remain able to function I've made sure to stay within the boundaries set forth.
Rather than try to get _everything_ (there's services out there that have databases of a lot of past/current reddit data) that ends up stale data (which may be useful for a content farm) I'm interested instead in a relatively accurate picture.
This project initially grew out of an interest in building an automated moderation bot to help out subreddits being spammed with content from accounts that are so obviously spam when the content was posted that it's astounding it ever makes it live. A few months into developing the initial crawling/database/hashing setup and getting things all tuned up they announced the API changes and I lost all interest in the moderation aspects but had enjoyed using it as a test bed for learning new things. (I came into this having no idea what a hamming distance was)
Despite any PR answer they probably provided to cover themselves, this is precisely the reason they altered the readability of data like e.g. the number of downvotes of a post.
They want to hide certain correlations around controversial posts from long held accounts who otherwise have a good post and comment karma. If one had access to timestamps too then one could essentially see some of the astroturfing and intentionally directed bot spam clearly. Too much data would allow one to see such corruption in very clear ways. And even analyze the rhetorical anti-communication hostility tactics.
Simultaneously this gives them an excuse to ban all accounts productive in communication (via some slight breaking of some rule) with the explanation that they are too swamped to make discerning decisions.
I hash every string with a SimHash and perform a Hamming distance query against those hashes for any hash that belongs to more than 3 accounts, i.e., any full string (> 42 characters) which was posted as a post title, post body, comment body, or account "description" by more than 3 accounts.
Regularly, this exposes huge networks of both fresh accounts and what I have to assume are stolen, credentialed "aged" accounts being used to spam that just recycle the same or very similar (Hamming distance < 5 on strings > 42 characters) titles/bodies. We're talking thousands of accounts over months just posting the same content over and over to the same range of subreddits.
I'm just some random Laravel enjoyer, and I've automated the 'banning' of these accounts (really, I flag the strings, and any account that posts them is then flagged).
This doesn't even touch on the media... (I've basically done the same thing with hashing the media to detect duplicate or very, very similar content via pHash). Thousands and thousands of accounts are spamming the same images over and over and over.
From my numbers, 59% of the content on Reddit is spam, and 51% of the accounts are spam, and that's not including the media-flagged spammers.
They don't seem to care about the spam, or they're completely inept. With the resources at their disposal, there's such a huge portion of this that should be able to be moderated before it ever reaches the API/live.