PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Text Extraction Python Packages

Python packages with the GitHub topic text-extraction. Sorted by relevance, with stars and monthly downloads.
adbar
trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

8.9M 6K 371
miso-belica
justext

Heuristic based boilerplate removal tool

6.6M 819 89
cdown
srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

571K 531 53
kreuzberg-dev
html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

494K 710 57
chrismattmann
tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

417K 2K 250
kreuzberg-dev
kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

174K 8K 488
jmriebold
boilerpy3

Python port of Boilerpipe library

166K 96 17
miso-belica
sumy

Module for automatic summarization of text documents and HTML pages.

147K 4K 545
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

127K 756 82
bookieio
breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

95K 205 26
airmang
python-hwpx

Pure Python HWPX automation: read, edit, generate, and validate documents without Hancom Office.

38K 74 30
run-llama
liteparse

A fast, helpful, and open-source document parser

37K 5K 341
iscc
mobi

python based software to unpack kindlegen generated ebooks

31K 77 11
harubi
bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

12K 1 0
qeeqbox
galeodes

Browsers options

9K 0 0
flairNLP
fundus

A very simple news crawler with a funny name

8K 455 109
yfedoseev
pdf-oxide-fips

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

4K 756 82
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
iamarunbrahma
vision-parse

Parse PDFs into markdown using Vision LLMs

2K 475 66
amenezes
aiopytesseract

asyncio tesseract wrapper for Tesseract-OCR

2K 27 7
yfedoseev
office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

2K 14 1
OwenOrcan
yirabot

YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

1K 17 0
kennipj
reap-pdf

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

1K 3 1
meer-khan
pattex

Regex-based pattern extraction library for Python — emails, URLs, phones, IPs, and more.

1K 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery