PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Document Processing Python Packages

Python packages with the GitHub topic document-processing. Sorted by relevance, with stars and monthly downloads.
run-llama
llama-cloud

Python SDK for OCR and document parsing in the cloud with LlamaParse

21.3M 29 7
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

127K 756 82
awslabs
pyrhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

44K 103 14
run-llama
liteparse

A fast, helpful, and open-source document parser

37K 5K 341
felixdittrich92
docling-ocr-onnxtr

OnnxTR OCR plugin for Docling

33K 20 0
Luizhcrs
template-engine-ia

Document normalization engine — learn a template from examples and convert any document automatically via LLM

12K 1 0
u9401066
asset-aware-mcp

Asset-Aware MCP Server — AI Agent precisely accesses tables, figures, sections from PDFs + .docx round-trip editing (DFM) with 46 tools / 13 resources, segmentation export, layout overlay, OCR preprocessing, knowledge graph (LightRAG)

11K 0 1
jztan
pdf-mcp

MCP server that lets Claude Code and other AI agents read large PDFs without hitting context limits. Chunked reading, hybrid search, OCR, table and image extraction, SQLite cache.

7K 35 5
Topdu
openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

6K 1K 127
yfedoseev
pdf-oxide-fips

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

4K 756 82
docling-project
docling-graph

Transform unstructured documents into validated, rich and queryable knowledge graphs.

3K 149 23
martin-papy
qdrant-loader-core

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

3K 41 24
MantisAI
sieves

Plug-and-play document AI with zero-shot models.

3K 125 8
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
martin-papy
qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

2K 41 24
martin-papy
qdrant-loader-mcp-server

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

2K 41 24
cdklabs
appmod-catalog-blueprints

Library of composable well-architected CDK solution blueprints for real-world use cases — AI agents, document processing, serverless applications and more. Deploy or customize on AWS in minutes.

2K 14 1
yfedoseev
office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

2K 14 1
Mellow-Artificial-Intelligence
openextract

Extract structured data from documents, images, audio, and video using LLMs

2K 16 2
KimSeogyu
dm2xcod

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

1K 5 1
axzml
pdfmark-ai

Convert PDF files to high-quality Markdown using LLM vision models

1K 1 0
stevereiner
flexible-graphrag-mcp

Python, LlamaIndex, LangChain, Docker Compose: 15 Property Graph, 4 RDF , 10 Vector, OpenSearch, Elasticsearch, Alfresco DBs. 13 data sources (9 auto-sync), KG auto-building, Ontologies, LLMs, Docling or LlamaParse doc processing, GraphRAG, RAG only, Hybrid Search, AI Chat. TypeScript React, Vue, Angular frontends, FastAPI REST backend, MCP Server.

1K 127 28
stevereiner
flexible-graphrag

Python, LlamaIndex, LangChain, Docker Compose: 15 Property Graph, 4 RDF , 10 Vector, OpenSearch, Elasticsearch, Alfresco DBs. 13 data sources (9 auto-sync), KG auto-building, Ontologies, LLMs, Docling or LlamaParse doc processing, GraphRAG, RAG only, Hybrid Search, AI Chat. TypeScript React, Vue, Angular frontends, FastAPI REST backend, MCP Server.

1K 127 28
r-uben
deepseek-ocr-cli

CLI tool for OCR using DeepSeek-OCR model via Ollama. Local processing with zero cloud dependencies.

971 12 2
    • Data from PyPI, GitHub, ClickHouse, and BigQuery