Pdf To Text Python Packages

docling

Get your documents ready for gen AI

17.6M 63K 4K

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.3M 15K 1K

docling-slim

Get your documents ready for gen AI

4.3M 63K 4K

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

146K 875 99

pdf-oxide-fips

8K 875 99

olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 9 1

markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

861 210 18

churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

481 51 4

tikara

The metadata and text content extractor for almost every file type.

425 10 0

bangla-pdf-ocr

Bangla PDF to text converter that works on Windows, macOS, and Linux without any extra downloads or configurations.

364 21 3

unstructured-cpu

320 15K 1K

docling-google-ocr

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

299 63K 4K

docling-enhanced

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

168 63K 4K

pcu-pdf

PDF parser component (Apache Tika) for PCU project

137 1 0

mseep-docling

Get your documents ready for gen AI

130 63K 4K

extended-docling

Get your documents ready for gen AI

128 63K 4K

pcu-io

IO management for PCU project

121 0 0