Put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to vectorize text for retrieval-augmented generation or to create LLM training datasets.
Much credit to the tika-python project for making the Python bindings!
Much credit to the tika-python project for making the Python bindings!