Document Parsing Python Packages

docling

Get your documents ready for gen AI

17.6M 63K 4K

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.3M 15K 1K

docling-slim

Get your documents ready for gen AI

4.3M 63K 4K

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.5M 85K 11K

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

142K 26K 2K

openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

8K 1K 132

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

4K 44 5

pdfmuse

Deterministic PDF/DOCX parser for RAG/LLMs — Rust core with byte-identical Python/Node/WASM bindings

4K 3 1

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

4K 73 12

ethos-pdf

Open-source verifier for document citations, source evidence, RAG, and agents, with built-in document parsing.

2K 5 0

docstrange

Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

2K 1K 135

langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2K 85K 11K

terbium-parse

A god-level algorithmic multi-file parser (PDF/PPTX/XLSX/CSV) that scores its own confidence and only reaches for AI when it is genuinely stuck.

2K 0 0

grits-metric

GriTS metric for table extraction

1K 2 0

acatome-extract

PDF extraction pipeline for acatome — Marker/fitz, metadata, block chunking

1K 0 0

olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 9 1

flamehaven-filesearch

Self-hosted RAG search engine — 34 formats, BM25+hybrid search, multi-LLM (Gemini/OpenAI/Claude/Ollama), FastAPI + Docker, production-ready in 3 min

915 106 14

fadoudou2

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

531 85K 11K

doctape

Chop large PDFs into page windows, convert each with docling, and reassemble to Markdown

505 1 0

churro-ocr

OCR and page detection for historical documents

481 49 4

longparser

Privacy-first document intelligence engine — parse PDFs, DOCX, PPTX, XLSX & CSV into AI-ready chunks for RAG pipelines. Includes HITL review, 3-layer memory chat, and a production FastAPI server.

442 29 2

opendataloader-pdf-frankin

A Python wrapper for the opendataloader-pdf Java CLI.

426 26K 2K

tikara

The metadata and text content extractor for almost every file type.

425 10 0

llama-index-readers-pdfmuse

Deterministic PDF/DOCX parser for RAG/LLMs — Rust core with byte-identical Python/Node/WASM bindings

386 3 1