Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running



I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.


Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.


What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.

Are there downsides to this? Sure, but imo AI is useful.


It's just repackaged Google results masquerading as an 'answer.' PageRank pulled results and displayed the first 10 relevant links and the LLM pulls tokens and displays the first relevant tokens to the query.

Just prompt it.


1. LLMs can translate text far better than any previous machine translation system. They can even do so for relatively small languages that typically had poor translation support. We all remember how funny text would get when you did English -> Japanese -> English. With LLMs you can do that (and even use a different LLM for the second step) and the texts remain very close.

2. Audio-input capable LLMs can transcribe audio far better than any previous system I've used. They easily understood my speech without problems. Youtube's old closed captioning system want anywhere close to as good and Microsoft's was unusable for me. LLMs have no such problems (makes me wonder if my speech patterns are in the training data since I've made a lot of YouTube videos and that's why they work so well for me).

3. You can feed LLMs local files (and run the LLM locally). Even if it is "just" pagerank, it's local pagerank now.

4. I can ask an LLM questions and then clarify what I wanted in natural language. You can't really refine a Google search in such a way. Trying to explain a Google search with more details usually doesn't help.

5. Iye mkx kcu kx VVW dy nomszrob dohd. Qyyqvo nyocx'd ny drkd pyb iye. - Google won't tell you what this means without you knowing what it is.

LLMs aren't magic, but I think they can do a whole bunch of things we couldn't really do before. Or at least we couldn't have a machine do those things well.


I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance


Generalizing with "everything", "all", etc exclusive markers is exactly the kind of black/white divide you're arguing against. What happened to your nuanced reality within a single sentence? Not everything is black and white, but some situations are.


The person he's replying to argued against putting things on a spectrum. Does that not imply painting everything in black and white? Thus his response seems perfectly sensible to me.


He argued against putting things in a spectrum in many instances where that would be wrong, including the case under the question. What's your argument against that idea? LLM'ed too much lately?


He argued against and the response presented a counterargument. Both were based around social costs and used the same wording (ie "everything").

You made a specious dismissal. Now you're making personal attacks. Perhaps it's actually you who is having difficulty reasoning properly here?


I miss the www where the .html was written in vim or notepad.


It still can be. Do it. Go make your website in M$ Frontpage, for all I care


Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/


heck yeah B)


Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.


ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site


For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.


This is great! The name reference also made me smile.


Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.


> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.


[flagged]


The library's archive is not a service provided by the newspaper


So? If the newspaper's website is willing to serve the documents, what's the problem?

The point is, if you're pleading with others to respect ""intellectual property"" then you're a worm serving corporate interests against your own.


I may be a worm but at least I respect that others might have a different take on how best to make creative work an attainable way of life since before copyright law it was basically "have a wealthy patron who steered if not outright commissioned what you would produce"


> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?


Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.

It's a DDoS.


You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.


When every AI company does it from multiple data centers... yes it's distributed.


Uncoordinated DDoS, when multiple search and AI companies are hammering your server.


> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping


Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?


It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.


*> Why is the default not to just don't do anything?

Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.


I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.


> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".


What's a database after all.


All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.

If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.


"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.


> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?


Love AI so can't be that. Not devs website owners. Yes ask AI for stats.


It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?


The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.


One euro is marginal for me for someone else it is their daily meal.


"They are rare edge cases" are we on the same internet?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: