Pdf Extraction Python Packages

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

191K 9K 507

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

142K 26K 2K

oxidize-pdf

Python bindings for oxidize-pdf — generate, parse, split, merge & manipulate PDFs with native Rust performance. No C deps, no Java, no subprocesses.

28K 3 1

pdf-mcp

An MCP server that lets Claude Code and other AI agents work through large PDFs without overflowing their context — search by meaning or keyword, read only the pages that matter, and cleanly pull out tables, images, and scanned text, even from multi-column and Japanese layouts.

11K 73 7

pytr

Use TradeRepublic in terminal and mass download all documents

11K 761 150

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

4K 44 5

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

4K 73 12

azure-di-financial-haystack

Haystack components for structured KV extraction from financial PDFs via Azure Document Intelligence

2K 0 0

ethos-pdf

Open-source verifier for document citations, source evidence, RAG, and agents, with built-in document parsing.

2K 5 0

phd-deepread-workflow

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

1K 53 4

ocrqueen

Official Python SDK for the OCRQueen document extraction API

851 0 0

xberg

785 9K 507

pagewise-pdf-extractor

Page-wise PDF to Markdown extraction with text extraction, OCR, LLM fallback, and progress metadata.

688 0 0

iterationlayer

Official Python SDK for the Iteration Layer API

544 0 0

llama-index-readers-azure-tax-forms

LlamaIndex reader for IRS tax form extraction via Azure Document Intelligence — supports Form 1040, W-2, Schedule C/E/K-1, 1065, 1120, 1120-S

524 0 0

opendataloader-pdf-frankin

A Python wrapper for the opendataloader-pdf Java CLI.

426 26K 2K

stmtforge

Open-source Python tool to parse credit card PDF statements from Indian banks (HDFC, ICICI, SBI, Axis + 5 more) into structured data. Offline, privacy-first, Streamlit dashboard. pip install stmtforge

336 3 0

pdfstructx

The PDF parser built for AI pipelines. Structured sections, tables, images, and metadata not just raw text.

322 0 0

mseep-kreuzberg

259 9K 508

ragtable-extract

PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

205 1 0

gdoczai

Python SDK for GdoczAI — OCR, extract, and segment documents

172 8 2