As someone who has been aggressively cataloging "data" (posts, comments, subredd...

maximinus_thrax · on March 21, 2024

> They don't seem to care about the spam, or they're completely inept

They don't care, at least that is the conclusion I've reached after repeatedly reporting content farms. I think they drive engagement anyway so it's good for business at the end of day. I did not do a thorough study like you apparently did, but anecdotally from the popular subreddits, I've spent enough time on Reddit (unfortunately) to recognize rehashed content that's reposted periodically to 'mine' karma. At some point, Reddit will be just a bunch of spam bots talking to each other and upvoting each other's content while humans will be spectators, or their content will be buried. Either way, this will be great for Reddit as it's good for business (ad impressions). It will be bad for Google as they're training their AIs based off spam and they will notice..

bibliotekka · on March 21, 2024

> Either way, this will be great for Reddit as it's good for business (ad impressions).

They probably have a slush fund to pay the spammers to boost traffic numbers

ryandrake · on March 22, 2024

I’m not much of a Redditor. Why do spammers want to “mine” karma? What makes it valuable?

Couldn’t Reddit solve this simply by making karma worthless, like HN?

maximinus_thrax · on March 22, 2024

See https://www.reddit.com/r/KarmaBotKillers/wiki/index/

> Couldn’t Reddit solve this simply by making karma worthless, like HN?

Why would they solve this? These bots create engagement. Until some bot farm is weaponized to organize some sort of IRL 'bad thing', I bet nobody at Reddit actually gives a fuck.

ryandrake · on March 22, 2024

Thanks, but after reading that I still don’t get it. What makes an account with 10,000 karma worth money as opposed to one with 100 karma? Do high karma accounts get more prominent display, bolder text, or something else that’s actually worth money?

maximinus_thrax · on March 22, 2024

> What makes an account with 10,000 karma worth money as opposed to one with 100 karma

Astroturfing. Say you're on the /r/AppleWatch and you see a post asking for some watch band recommendations. People start posting said recommendations and you check to see who's posting what. You will instinctively trust the 3 yo account with 10k karma, very popular on /r/pics vs the 2 month old account with 100 karma. That's one example for why some of these accounts are actually bought and sold https://www.epicnpc.com/forums/reddit-accounts.1277/

Political activism is another scenario for these sockpuppet accounts with heavy karma. I've noticed A LOT of pro-Palestinian and pro-Russian propaganda from accounts with huge amounts of karma quickly gained from a handful of posts on mainstream reddits like /r/pics or /r/funny which are reposted content from 3-4 years ago.

bongobingo1 · on March 22, 2024

Higher karma accounts are less likely to get soft-locked out of communities ("comment/post deleted: you must have x karma to post here") and probably "look more authentic". That perception is probably still subliminally there for a bit longer.

ryandrake · on March 22, 2024

Got it. So ironically, the subreddits are kind of causing the problem themselves by insisting on karma thresholds.

Implicated · on March 22, 2024

They really don't have much choice. It's likely one of the most effective ways to minimize spam from 5 min old accounts.

quatrefoil · on March 22, 2024

There are commercial influencing operations on Reddit, but I think what you're describing doesn't really affect the usual user experience.

I suspect that the objective of these bulk spamming operations isn't to promote stuff on the platform, but to mess with other apps. LLMs trained on Reddit content, search engines that rank Reddit posts highly, etc.

Implicated · on March 22, 2024

Some time shortly before the API changes I did see _a lot_ of the spam content was clearly aimed towards prompting for comments to answer all kinds of rather generic questions about various life experiences. I can only imagine what it was used to train.

Now... so much onlyfans. The onlyfans spam dwarfs the rest (I should mention that there's quite a few political/news subreddits I just flat out ignore due to the amount of spam and astroturfing - so likely there's quite a bit there that I'm not seeing)

Group_B · on March 21, 2024

Is this catalog project open source by chance?

Implicated · on March 22, 2024

It's not but there's a handful of catalogs out there that are _massive_ databases of reddit content.

jtriangle · on March 21, 2024

How long have you been doing this?

And have you accounted for reddit-isms, such as posting huge chains of "Cat" or "nice"?

Implicated · on March 22, 2024

> How long have you been doing this?

A few years.

> And have you accounted for reddit-isms, such as posting huge chains of "Cat" or "nice"?

Like others have said, the 42 character limit is a bit of a sweet spot that I've found where basically everything is spam.

I do take a look at strings under 42 characters but have to take a bit of a manual approach there as natural repeated strings start breaking through.

raisedbyninjas · on March 21, 2024

I think the comment memes usually fall short of the 42 cutoff. Some copypasta probably needs more sophisticated filtering.

Implicated · on March 22, 2024

> Some copypasta probably needs more sophisticated filtering.

I did eventually have to build a process to function as an 'allowlist' for legitimate strings that would otherwise have all the characteristics of spam.

lazyasciiart · on March 21, 2024

A 42 character min will filter those out, no?

lasc4r · on March 22, 2024

Are these posts and comments being upvoted or just being sentinto a void that no one is seeing?

Implicated · on March 22, 2024

Both. Some of this stuff is _so obviously_ spam and yet there's hundreds of comments from all kinds of... not smart people. Just feasting on the bait.

Then there's lots that are just spammed into the abyss, accounts banned or suspended shortly after.

wolverine876 · on March 22, 2024

Do you work for Reddit?

I rarely read Reddit, but when I do I rarely see outright spam (I see plenty of noise).

Implicated · on March 22, 2024

Really, it's very subreddit dependent. I process a 'spam score' for each subreddit daily and there exists quite a bit of variance. Most of the larger subreddits are pretty well moderated.

Threadbare · on March 21, 2024

This is so interesting. How do you manage to catalog all the subreddits in existence? Is there a page which lists them all? I assume the process from there is retrieving by most recent

Implicated · on March 22, 2024

> How do you manage to catalog all the subreddits in existence?

Oh, I don't want to give the wrong impression. I'm not cataloging anywhere near _all_ subreddits. Or all of anything. More or less I started one day with one subreddit and built a system that just churns through what's there. The API is limited and there's only so many creative ways to request the data (while staying within TOS) - as I've wanted to remain able to function I've made sure to stay within the boundaries set forth.

Rather than try to get _everything_ (there's services out there that have databases of a lot of past/current reddit data) that ends up stale data (which may be useful for a content farm) I'm interested instead in a relatively accurate picture.

This project initially grew out of an interest in building an automated moderation bot to help out subreddits being spammed with content from accounts that are so obviously spam when the content was posted that it's astounding it ever makes it live. A few months into developing the initial crawling/database/hashing setup and getting things all tuned up they announced the API changes and I lost all interest in the moderation aspects but had enjoyed using it as a test bed for learning new things. (I came into this having no idea what a hamming distance was)

agmater · on March 21, 2024

https://www.reddit.com/subreddits/

Not sure if this list all, but should be a good start.

arisbe__ · on March 22, 2024

Despite any PR answer they probably provided to cover themselves, this is precisely the reason they altered the readability of data like e.g. the number of downvotes of a post.

They want to hide certain correlations around controversial posts from long held accounts who otherwise have a good post and comment karma. If one had access to timestamps too then one could essentially see some of the astroturfing and intentionally directed bot spam clearly. Too much data would allow one to see such corruption in very clear ways. And even analyze the rhetorical anti-communication hostility tactics.

Simultaneously this gives them an excuse to ban all accounts productive in communication (via some slight breaking of some rule) with the explanation that they are too swamped to make discerning decisions.

SSLy · on March 21, 2024

could you spill which subreddits have the most garbage in?