Document Parser Python Packages

docling

Get your documents ready for gen AI

17.6M 63K 4K

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.3M 15K 1K

docling-slim

Get your documents ready for gen AI

4.3M 63K 4K

graphlit-client

Python client library for Graphlit Platform

17K 19 3

autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

8K 5K 406

rhwp-python

PyO3 Python bindings for rhwp — parser and renderer for HWP/HWPX documents (Korean Hancom word processor format)

7K 10 4

deepdoctection

A Repo For Document AI

6K 3K 194

openparse

Improved file parsing for LLM’s

6K 3K 143

all2md

Convert PDF, Word, PowerPoint, HTML, email & 40+ formats to clean Markdown — and back. Built for LLMs, RAG & Python pipelines, with a built-in MCP server.

5K 12 1

dd-core

A Repo For Document AI

3K 3K 194

dd-datasets

A Repo For Document AI

3K 3K 194

docstrange

Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

2K 1K 135

vision-parse

Parse PDFs into markdown using Vision LLMs

1K 481 69

olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 9 1

semantic-ai

An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

534 22 1

churro-ocr

OCR and page detection for historical documents

481 49 4

automated-document-parser

A powerful and automated document parser built with LangChain for intelligent document processing. Automatically detects file types and uses appropriate loaders for PDF, DOCX, CSV, JSON, HTML, and more.

479 1 0

llamarker

Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨

342 2 0

unstructured-cpu

A library that prepares raw documents for downstream ML tasks.

320 15K 1K

docling-google-ocr

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

299 63K 4K

anyparser-crewai

Anyparser CrewAI Integration

239 2 0

df-extract

DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.

207 14 0

marie-ai

Python library to Integrate AI-powered features into your applications

172 93 11

docling-enhanced

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

168 63K 4K