PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Document Parsing Python Packages

Python packages with the GitHub topic document-parsing. Sorted by relevance, with stars and monthly downloads.
docling-project
docling

Get your documents ready for gen AI

7.2M 60K 4K
Unstructured-IO
unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.4M 15K 1K
PaddlePaddle
paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.2M 78K 10K
docling-project
docling-slim

Get your documents ready for gen AI

1.1M 60K 4K
opendataloader-project
opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

134K 21K 2K
Topdu
openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

6K 1K 127
NameetP
pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

3K 63 7
opendataloader-project
langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

3K 33 4
Unstructured-IO
unstructured-cpu

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

3K 15K 1K
retospect
acatome-extract

PDF extraction pipeline for acatome — Marker/fitz, metadata, block chunking

2K 0 0
flamehaven01
flamehaven-filesearch

Self-hosted RAG search engine — 34 formats, BM25+hybrid search, multi-LLM (Gemini/OpenAI/Claude/Ollama), FastAPI + Docker, production-ready in 3 min

2K 101 13
nanonets
docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 132
Hugues-DTANKOUO
olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 8 0
ENDEVSOLS
longparser

Privacy-first document intelligence engine — parse PDFs, DOCX, PPTX, XLSX & CSV into AI-ready chunks for RAG pipelines. Includes HITL review, 3-layer memory chat, and a production FastAPI server.

1K 26 2
PaddlePaddle
fadoudou2

Awesome OCR toolkits based on PaddlePaddle(8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices)

481 78K 10K
stanford-oval
churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

464 43 4
PaddlePaddle
paddleocrwordleveldetection

Awesome OCR toolkits based on PaddlePaddle (8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embeded and IoT devices

296 78K 10K
baughmann
tikara

The metadata and text content extractor for almost every file type.

279 9 0
alexvargashn
doc23

Powerful Python library to convert documents (PDF, DOCX, TXT) into structured JSON trees for legal, institutional, and NLP applications.

251 0 0
DS4SD
docling-google-ocr

Get your documents ready for gen AI

232 60K 4K
PaddlePaddle
langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

231 78K 10K
PaddlePaddle
je-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

231 78K 10K
Kyros-Groupe-Ltd
pdfstructx

Intelligent PDF parser with font-aware structure detection, table extraction, and multi-column support

217 0 0
anyparser
anyparser-crewai

Anyparser CrewAI Integration

209 2 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery