Table Extraction Python Packages

pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

100.9M 10K 750

pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

48.9M 11K 893

pymupdfb

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

5.3M 10K 750

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

191K 9K 507

img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

174K 877 120

pdf-mcp

An MCP server that lets Claude Code and other AI agents work through large PDFs without overflowing their context — search by meaning or keyword, read only the pages that matter, and cleanly pull out tables, images, and scanned text, even from multi-column and Japanese layouts.

11K 73 7

bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

11K 1 0

extracttable

Python library to extract tabular data from images and scanned PDFs

10K 285 35

tablers

A blazingly fast PDF table extraction library with python API powered by Rust

3K 11 1

mdify-cli

MDify is a document-to-Markdown conversion library for extracting structured content from complex PDFs and document images, including tables, charts, and scanned documents.

3K 0 0

grits-metric

GriTS metric for table extraction

1K 2 0

gridgulp

Automatically detect and extract tables from Excel, CSV, and text files.

1K 13 2

xberg

785 9K 507

docext

Onprem information extraction from documents

645 2K 147

pdfplumber-aemc

Plumb a PDF for detailed information about each char, rectangle, and line.

360 11K 893

pdfstructx

The PDF parser built for AI pipelines. Structured sections, tables, images, and metadata not just raw text.

322 0 0

pdftablr

A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3

276 2 0

mseep-kreuzberg

259 9K 508

depdf

PDF table & paragraph extractor

234 11 0

ragtable-extract

PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

205 1 0

aqpymupdf

A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

184 10K 750

tablecv

TableCV: Table extraction from images made easy.

177 11 2

quipucamayoc

dev repo for article

142 33 5