Pdf Parser Python Packages

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

108.5M 10K 2K

pypdf2

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

29M 10K 2K

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.5M 85K 11K

mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

365K 73K 6K

liteparse

A fast, helpful, and open-source document parser

274K 11K 760

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

146K 875 99

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

142K 26K 2K

magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

72K 73K 6K

extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

31K 2K 96

oxidize-pdf

Python bindings for oxidize-pdf — generate, parse, split, merge & manipulate PDFs with native Rust performance. No C deps, no Java, no subprocesses.

28K 3 1

casparser

Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech

23K 209 86

pdf-oxide-fips

8K 875 99

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

4K 44 5

pdfmuse

Deterministic PDF/DOCX parser for RAG/LLMs — Rust core with byte-identical Python/Node/WASM bindings

4K 3 1

kritrim-smriti

A local-first, hybrid AI study assistant that refactors technical documentation into VARK learning tracks and automated Anki flashcards.

3K 2 1

pdfalyzer

Analyze PDFs with colors (and YARA)

3K 371 25

scipdf-parser

Python PDF parser for scientific publications: content and figures

2K 455 65

ethos-pdf

Open-source verifier for document citations, source evidence, RAG, and agents, with built-in document parsing.

2K 5 0

docstrange

Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

2K 1K 135

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

2K 712 59

mineru-next-dev

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

2K 73K 6K

doctra

📄🔍 Parse, extract, and analyze documents with ease 📄🔍

2K 211 33

langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2K 85K 11K

vision-parse

Parse PDFs into markdown using Vision LLMs

1K 481 69