Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to vectorize text for retrieval-augmented generation or to create LLM training datasets.

Much credit to the tika-python project for making the Python bindings!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: