Data Extraction Python Packages

firecrawl-py

The API to search, scrape, and interact with the web at scale. 🔥

7.2M 144K 8K

flashtext

Extract Keywords from sentence or Replace keywords in sentences.

2.6M 6K 596

firecrawl

The API to search, scrape, and interact with the web at scale. 🔥

1.3M 144K 8K

scrapling

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

797K 68K 7K

openccu-data

Data-extraction pipeline and source-of-truth distribution for Homematic CCU configuration metadata — parses OCCU/RaspberryMatic TCL & JS into compact, typed JSON artifacts (easymodes, translations, profiles).

321K 0 0

apify

Apify SDK for Python—The official library for building Apify Actors: serverless cloud programs for web scraping, browser automation, data processing, and AI agents. Manages the Actor lifecycle, storages (datasets, key-value stores, request queues), events, proxies, and pay-per-event monetization. Built on top of the the Apify API Client.

254K 172 25

scrapfly-sdk

Official Python SDK for the Scrapfly platform: web scraping, screenshots, AI extraction, crawling, and a remote anti-bot browser. Integrates with Scrapy, LlamaIndex, and LangChain.

191K 59 15

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

146K 875 99

socid-extractor

⛏️ The extraction engine behind Maigret: turn any profile URL into a structured OSINT record across 150+ sites

127K 1K 109

recipe-scrapers

Python package for scraping recipes data

92K 2K 661

vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

80K 1K 291

amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

61K 495 91

stealth-requests

Undetected web-scraping & seamless HTML parsing in Python!

46K 526 53

jsonpath-extractor

A query expression for extracting data from JSON.

19K 41 4

contextgem

ContextGem: Effortless LLM extraction from documents

13K 2K 158

crw

Fast, lightweight Firecrawl/Tavily alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/scrape, /crawl, /search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

10K 263 22

pdf-oxide-fips

8K 875 99

office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

5K 55 8

vnstock3

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

5K 1K 291

hext

Domain-specific language for extracting structured data from HTML documents

5K 55 3

cyac

High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!

5K 93 15

wiktionary-de-parser

Extract data from German Wiktionary XML files.

4K 26 9

taprun

Local-first browser automation MCP for Claude Code / Cursor. taprun compiles once with AI, your authenticated Chrome replays forever at zero LLM tokens. Credentials never leave your machine.

4K 10 0

optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

3K 2K 232