Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the things that a lot of LLM scrapers are fetching are git repositories. They could just use git clone to fetch everything at once. But instead, they fetch them commit by commit. That's about as static as you can get, and it is absolutely NOT a non-issue.
 help



No... Basically all git servers have to generate the file contents, diffs etc. on-demand because they don't store static pages for every single possible combination of view parameters. Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental. You could pre-render everything statically, but that could take up gigabytes or more for any repo of non-trivial size.

> Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental

This is wrong. Git does store full copies.


git stores files as objects, which are stored as full copies, unless those objects are stored in packfiles and are deltified, in which case they're stored as deltas. https://codewords.recurse.com/issues/three/unpacking-git-pac...

Thank you for the insights.

... which, in the context that is being discussed, is unusual.

that's a pretty niche issue, but fairly easy to solve.

Prebuild statically the most common commits (last XX) and heavily rate limit deeper ones


1. that doesn't appear to match the fetching patterns of the scrapers at all

2. 1M independent IPs hitting random commits from across a 25 year history is not, in fact, "easy to solve". It is addressable, but not easy ...

3. why should I have to do anything at all to deal with these scrapers? why is the onus not on them to do the right thing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: