One of the things that a lot of LLM scrapers are fetching are git repositories. ...

LoganDark · 2026-02-27T23:29:26 1772234966

No... Basically all git servers have to generate the file contents, diffs etc. on-demand because they don't store static pages for every single possible combination of view parameters. Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental. You could pre-render everything statically, but that could take up gigabytes or more for any repo of non-trivial size.

KolmogorovComp · 2026-02-27T23:40:58 1772235658

> Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental

This is wrong. Git does store full copies.

meatmanek · 2026-02-28T00:21:37 1772238097

git stores files as objects, which are stored as full copies, unless those objects are stored in packfiles and are deltified, in which case they're stored as deltas. https://codewords.recurse.com/issues/three/unpacking-git-pac...

KolmogorovComp · 2026-02-28T10:37:52 1772275072

Thank you for the insights.

PaulDavisThe1st · 2026-02-28T15:39:25 1772293165

... which, in the context that is being discussed, is unusual.

neoromantique · 2026-02-27T23:38:58 1772235538

that's a pretty niche issue, but fairly easy to solve.

Prebuild statically the most common commits (last XX) and heavily rate limit deeper ones

PaulDavisThe1st · 2026-02-28T15:40:29 1772293229

1. that doesn't appear to match the fetching patterns of the scrapers at all

2. 1M independent IPs hitting random commits from across a 25 year history is not, in fact, "easy to solve". It is addressable, but not easy ...

3. why should I have to do anything at all to deal with these scrapers? why is the onus not on them to do the right thing?