PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Pdf To Text Python Packages

Python packages with the GitHub topic pdf-to-text. Sorted by relevance, with stars and monthly downloads.
docling-project
docling

Get your documents ready for gen AI

7.2M 60K 4K
Unstructured-IO
unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.4M 15K 1K
docling-project
docling-slim

Get your documents ready for gen AI

1.2M 60K 4K
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

132K 756 82
yfedoseev
pdf-oxide-fips

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

4K 756 82
Unstructured-IO
unstructured-cpu

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

3K 15K 1K
Hugues-DTANKOUO
olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 8 0
shoryasethia
markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

674 204 18
stanford-oval
churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

459 43 4
asiff00
bangla-pdf-ocr

A package to extract Bengali text from PDFs using OCR

327 21 3
baughmann
tikara

The metadata and text content extractor for almost every file type.

278 9 0
DS4SD
docling-google-ocr

Get your documents ready for gen AI

230 60K 4K
docling-project
docling-enhanced

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

97 60K 4K
zevio
pcu-io

IO management for PCU project

94 0 0
DS4SD
extended-docling

Get your documents ready for gen AI

87 60K 4K
zevio
pcu-pdf

PDF parser component (Apache Tika) for PCU project

84 1 0
docling-project
mseep-docling

Get your documents ready for gen AI

74 60K 4K
    • Data from PyPI, GitHub, ClickHouse, and BigQuery