PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Data Extraction Python Packages

Python packages with the GitHub topic data-extraction. Sorted by relevance, with stars and monthly downloads.
firecrawl
firecrawl-py

🔥 Search, scrape, and clean the web for AI agents.

6.7M 121K 7K
vi3k6i5
flashtext

Extract Keywords from sentence or Replace keywords in sentences.

2.3M 6K 597
firecrawl
firecrawl

🔥 Search, scrape, and clean the web for AI agents.

997K 121K 7K
D4Vinci
scrapling

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

893K 50K 5K
thinh-vu
vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

437K 1K 280
scrapfly
scrapfly-sdk

Official Python SDK for the Scrapfly platform: web scraping, screenshots, AI extraction, crawling, and a remote anti-bot browser. Integrates with Scrapy, LlamaIndex, and LangChain.

249K 55 15
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

132K 756 82
hhursev
recipe-scrapers

Python package for scraping recipes data

91K 2K 645
a-maliarov
amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

78K 490 91
jpjacobpadilla
stealth-requests

Undetected web-scraping & seamless HTML parsing in Python!

43K 470 48
linw1995
jsonpath-extractor

A query expression for extracting data from JSON.

15K 41 4
shcherbak-ai
contextgem

ContextGem: Effortless LLM extraction from documents

13K 2K 156
html-extract
hext

Domain-specific language for extracting structured data from HTML documents

7K 55 3
AIMLPM
markcrawl

Fast Python web crawler for RAG and AI ingestion. Extracts clean Markdown from any site for LLMs and vector stores.

6K 2 0
LeonTing1010
taprun

Local-first browser automation MCP for Claude Code / Cursor. taprun compiles once with AI, your authenticated Chrome replays forever at zero LLM tokens. Credentials never leave your machine.

6K 5 1
us
crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

6K 89 5
nppoly
cyac

High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!

6K 94 15
thinh-vu
vnstock3

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

6K 1K 280
aborruso
scrape-cli

Extract HTML elements from the command line using CSS selectors or XPath. Pipe-friendly Python CLI.

5K 26 1
ironmussa
optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

4K 2K 232
yfedoseev
pdf-oxide-fips

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

4K 756 82
gambolputty
wiktionary-de-parser

Extract data from German Wiktionary XML files.

3K 26 9
StabRise
scaledp

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

3K 18 1
StoneSteel27
automatiq

A tool that watches you browse, then writes HTTP-based automation scripts

2K 49 5
    • Data from PyPI, GitHub, ClickHouse, and BigQuery