Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who has been aggressively cataloging "data" (posts, comments, subreddits, etc.) from Reddit and, importantly in this context, keeping those records relatively up-to-date, it's absolutely astonishing how much spam there is.

I hash every string with a SimHash and perform a Hamming distance query against those hashes for any hash that belongs to more than 3 accounts, i.e., any full string (> 42 characters) which was posted as a post title, post body, comment body, or account "description" by more than 3 accounts.

Regularly, this exposes huge networks of both fresh accounts and what I have to assume are stolen, credentialed "aged" accounts being used to spam that just recycle the same or very similar (Hamming distance < 5 on strings > 42 characters) titles/bodies. We're talking thousands of accounts over months just posting the same content over and over to the same range of subreddits.

I'm just some random Laravel enjoyer, and I've automated the 'banning' of these accounts (really, I flag the strings, and any account that posts them is then flagged).

This doesn't even touch on the media... (I've basically done the same thing with hashing the media to detect duplicate or very, very similar content via pHash). Thousands and thousands of accounts are spamming the same images over and over and over.

From my numbers, 59% of the content on Reddit is spam, and 51% of the accounts are spam, and that's not including the media-flagged spammers.

They don't seem to care about the spam, or they're completely inept. With the resources at their disposal, there's such a huge portion of this that should be able to be moderated before it ever reaches the API/live.



> They don't seem to care about the spam, or they're completely inept

They don't care, at least that is the conclusion I've reached after repeatedly reporting content farms. I think they drive engagement anyway so it's good for business at the end of day. I did not do a thorough study like you apparently did, but anecdotally from the popular subreddits, I've spent enough time on Reddit (unfortunately) to recognize rehashed content that's reposted periodically to 'mine' karma. At some point, Reddit will be just a bunch of spam bots talking to each other and upvoting each other's content while humans will be spectators, or their content will be buried. Either way, this will be great for Reddit as it's good for business (ad impressions). It will be bad for Google as they're training their AIs based off spam and they will notice..


> Either way, this will be great for Reddit as it's good for business (ad impressions).

They probably have a slush fund to pay the spammers to boost traffic numbers


I’m not much of a Redditor. Why do spammers want to “mine” karma? What makes it valuable?

Couldn’t Reddit solve this simply by making karma worthless, like HN?


See https://www.reddit.com/r/KarmaBotKillers/wiki/index/

> Couldn’t Reddit solve this simply by making karma worthless, like HN?

Why would they solve this? These bots create engagement. Until some bot farm is weaponized to organize some sort of IRL 'bad thing', I bet nobody at Reddit actually gives a fuck.


Thanks, but after reading that I still don’t get it. What makes an account with 10,000 karma worth money as opposed to one with 100 karma? Do high karma accounts get more prominent display, bolder text, or something else that’s actually worth money?


> What makes an account with 10,000 karma worth money as opposed to one with 100 karma

Astroturfing. Say you're on the /r/AppleWatch and you see a post asking for some watch band recommendations. People start posting said recommendations and you check to see who's posting what. You will instinctively trust the 3 yo account with 10k karma, very popular on /r/pics vs the 2 month old account with 100 karma. That's one example for why some of these accounts are actually bought and sold https://www.epicnpc.com/forums/reddit-accounts.1277/

Political activism is another scenario for these sockpuppet accounts with heavy karma. I've noticed A LOT of pro-Palestinian and pro-Russian propaganda from accounts with huge amounts of karma quickly gained from a handful of posts on mainstream reddits like /r/pics or /r/funny which are reposted content from 3-4 years ago.


Higher karma accounts are less likely to get soft-locked out of communities ("comment/post deleted: you must have x karma to post here") and probably "look more authentic". That perception is probably still subliminally there for a bit longer.


Got it. So ironically, the subreddits are kind of causing the problem themselves by insisting on karma thresholds.


They really don't have much choice. It's likely one of the most effective ways to minimize spam from 5 min old accounts.


There are commercial influencing operations on Reddit, but I think what you're describing doesn't really affect the usual user experience.

I suspect that the objective of these bulk spamming operations isn't to promote stuff on the platform, but to mess with other apps. LLMs trained on Reddit content, search engines that rank Reddit posts highly, etc.


Some time shortly before the API changes I did see _a lot_ of the spam content was clearly aimed towards prompting for comments to answer all kinds of rather generic questions about various life experiences. I can only imagine what it was used to train.

Now... so much onlyfans. The onlyfans spam dwarfs the rest (I should mention that there's quite a few political/news subreddits I just flat out ignore due to the amount of spam and astroturfing - so likely there's quite a bit there that I'm not seeing)


Is this catalog project open source by chance?


It's not but there's a handful of catalogs out there that are _massive_ databases of reddit content.


How long have you been doing this?

And have you accounted for reddit-isms, such as posting huge chains of "Cat" or "nice"?


> How long have you been doing this?

A few years.

> And have you accounted for reddit-isms, such as posting huge chains of "Cat" or "nice"?

Like others have said, the 42 character limit is a bit of a sweet spot that I've found where basically everything is spam.

I do take a look at strings under 42 characters but have to take a bit of a manual approach there as natural repeated strings start breaking through.


I think the comment memes usually fall short of the 42 cutoff. Some copypasta probably needs more sophisticated filtering.


> Some copypasta probably needs more sophisticated filtering.

I did eventually have to build a process to function as an 'allowlist' for legitimate strings that would otherwise have all the characteristics of spam.


A 42 character min will filter those out, no?


Are these posts and comments being upvoted or just being sentinto a void that no one is seeing?


Both. Some of this stuff is _so obviously_ spam and yet there's hundreds of comments from all kinds of... not smart people. Just feasting on the bait.

Then there's lots that are just spammed into the abyss, accounts banned or suspended shortly after.


Do you work for Reddit?

I rarely read Reddit, but when I do I rarely see outright spam (I see plenty of noise).


Really, it's very subreddit dependent. I process a 'spam score' for each subreddit daily and there exists quite a bit of variance. Most of the larger subreddits are pretty well moderated.


This is so interesting. How do you manage to catalog all the subreddits in existence? Is there a page which lists them all? I assume the process from there is retrieving by most recent


> How do you manage to catalog all the subreddits in existence?

Oh, I don't want to give the wrong impression. I'm not cataloging anywhere near _all_ subreddits. Or all of anything. More or less I started one day with one subreddit and built a system that just churns through what's there. The API is limited and there's only so many creative ways to request the data (while staying within TOS) - as I've wanted to remain able to function I've made sure to stay within the boundaries set forth.

Rather than try to get _everything_ (there's services out there that have databases of a lot of past/current reddit data) that ends up stale data (which may be useful for a content farm) I'm interested instead in a relatively accurate picture.

This project initially grew out of an interest in building an automated moderation bot to help out subreddits being spammed with content from accounts that are so obviously spam when the content was posted that it's astounding it ever makes it live. A few months into developing the initial crawling/database/hashing setup and getting things all tuned up they announced the API changes and I lost all interest in the moderation aspects but had enjoyed using it as a test bed for learning new things. (I came into this having no idea what a hamming distance was)


https://www.reddit.com/subreddits/

Not sure if this list all, but should be a good start.


Despite any PR answer they probably provided to cover themselves, this is precisely the reason they altered the readability of data like e.g. the number of downvotes of a post.

They want to hide certain correlations around controversial posts from long held accounts who otherwise have a good post and comment karma. If one had access to timestamps too then one could essentially see some of the astroturfing and intentionally directed bot spam clearly. Too much data would allow one to see such corruption in very clear ways. And even analyze the rhetorical anti-communication hostility tactics.

Simultaneously this gives them an excuse to ban all accounts productive in communication (via some slight breaking of some rule) with the explanation that they are too swamped to make discerning decisions.


could you spill which subreddits have the most garbage in?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: