Text Extraction Python Packages

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

10.5M 6K 392

justext

Heuristic based boilerplate removal tool

9M 818 90

srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

615K 533 53

html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

597K 793 61

tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

467K 2K 250

liteparse

A fast, helpful, and open-source document parser

274K 11K 760

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

191K 9K 507

boilerpy3

Python port of Boilerpipe library

163K 96 17

sumy

Module for automatic summarization of text documents and HTML pages.

149K 4K 543

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

146K 875 99

breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

85K 205 27

python-hwpx

Pure Python HWPX automation: read, edit, generate, and validate documents without Hancom Office.

71K 89 32

mobi

python based software to unpack kindlegen generated ebooks

33K 76 11

bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

11K 1 0

galeodes

Browsers options

11K 0 0

contextractor

Crawl any website and extract clean, boilerplate-free main-content text as Markdown, plain text, JSON, HTML, or raw original HTML — ready for LLMs, RAG, and vector databases. Built on rs-trafilatura + Crawlee/Playwright. Ships as a CLI, npm library, Python package, and Apify Actor.

10K 0 2

pdf-oxide-fips

8K 875 99

all2md

Convert PDF, Word, PowerPoint, HTML, email & 40+ formats to clean Markdown — and back. Built for LLMs, RAG & Python pipelines, with a built-in MCP server.

5K 12 1

office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

5K 55 8

fundus

A very simple news crawler with a funny name

4K 464 110

pdfmuse

Deterministic PDF/DOCX parser for RAG/LLMs — Rust core with byte-identical Python/Node/WASM bindings

4K 3 1

hwpkit

Read, fill, and edit Korean HWP (Hancom Office) documents in Python. Extract text for LLM / RAG pipelines, fill government & university forms programmatically, and rewrite the binary without corrupting it.

3K 1 0

legacy-doc

Pure-Python text extraction for classic Microsoft Word .doc files.

3K 0 0

pnlp

NLP预/后处理工具。

2K 30 6