Well, basically everything on the arXiv has the .tex available, but that doesn't...

Well, basically everything on the arXiv has the .tex available, but that doesn't seem to make the problem much better. The problem isn't getting raw access to the text. It's possible to copy-past from PDF's with labor too, or to examine the inside of it (which is its own typesetting system, like .tex). The problem is that this data is very difficult to parse.