Document Processing Python Packages

llama-cloud

Python SDK for OCR and document parsing in the cloud with LlamaParse

30.1M 45 11

liteparse

A fast, helpful, and open-source document parser

274K 11K 760

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

146K 875 99

pyrhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

44K 101 14

docling-ocr-onnxtr

OnnxTR OCR plugin for Docling

36K 21 1

pdf-mcp

An MCP server that lets Claude Code and other AI agents work through large PDFs without overflowing their context — search by meaning or keyword, read only the pages that matter, and cleanly pull out tables, images, and scanned text, even from multi-column and Japanese layouts.

11K 73 7

openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

8K 1K 132

pdf-oxide-fips

8K 875 99

office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

5K 55 8

asset-aware-mcp

Asset-Aware MCP Server — AI Agent precisely accesses tables, figures, sections from PDFs + .docx round-trip editing (DFM) with 46 tools / 13 resources, segmentation export, layout overlay, OCR preprocessing, knowledge graph (LightRAG)

4K 0 2

bargain

Low-Cost LLM-Powered Data Processing with Theoretical Guarantees

3K 41 9

hwpkit

Read, fill, and edit Korean HWP (Hancom Office) documents in Python. Extract text for LLM / RAG pipelines, fill government & university forms programmatically, and rewrite the binary without corrupting it.

3K 1 0

pdf-email-optimizer

Claude & Codex agent skill and CLI that shrinks PDFs to email-safe sizes while preserving visual quality — photo brochures, design exports, scans, and reports.

3K 1 0

qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

3K 44 26

dm2xcod

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

3K 8 1

deepseek-ocr-cli

CLI tool for OCR using DeepSeek-OCR model via Ollama. Local processing with zero cloud dependencies.

3K 12 2

docling-graph

Transform unstructured documents into validated, rich and queryable knowledge graphs.

2K 169 25

appmod-catalog-blueprints

Library of composable well-architected CDK solution blueprints for real-world use cases — AI agents, document processing, serverless applications and more. Deploy or customize on AWS in minutes.

2K 15 1