PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Document Parser Python Packages

Python packages with the GitHub topic document-parser. Sorted by relevance, with stars and monthly downloads.
docling-project
docling

Get your documents ready for gen AI

7.2M 60K 4K
Unstructured-IO
unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.4M 15K 1K
docling-project
docling-slim

Get your documents ready for gen AI

1.1M 60K 4K
graphlit
graphlit-client

Python client library for Graphlit Platform

25K 20 3
DanMeon
rhwp-python

PyO3 Python bindings for rhwp — parser and renderer for HWP/HWPX documents (Korean Hancom word processor format)

9K 5 1
deepdoctection
deepdoctection

A Repo For Document AI

8K 3K 191
Marker-Inc-Korea
autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

8K 5K 395
Filimoa
openparse

Improved file parsing for LLM’s

5K 3K 140
deepdoctection
dd-core

A Repo For Document AI

4K 3K 191
Unstructured-IO
unstructured-cpu

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

3K 15K 1K
deepdoctection
dd-datasets

A Repo For Document AI

3K 3K 191
iamarunbrahma
vision-parse

Parse PDFs into markdown using Vision LLMs

2K 475 66
nanonets
docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 132
Hugues-DTANKOUO
olgadoc

Four formats. One engine. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Rust core with strictly-typed Python bindings.

1K 8 0
decisionfacts
semantic-ai

Sematic AI RAG System

472 22 1
stanford-oval
churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

464 43 4
RevanKumarD
llamarker

Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨

413 2 0
Pulkit12dhingra
automated-document-parser

A powerful and automated document parser built with LangChain for intelligent document processing

376 1 0
DS4SD
docling-google-ocr

Get your documents ready for gen AI

232 60K 4K
anyparser
anyparser-crewai

Anyparser CrewAI Integration

209 2 0
decisionfacts
df-extract

DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.

156 14 0
marieai
marie-ai

Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

126 89 11
docling-project
docling-enhanced

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

95 60K 4K
DS4SD
extended-docling

Get your documents ready for gen AI

88 60K 4K
    • Data from PyPI, GitHub, ClickHouse, and BigQuery