Extract Data Python Packages

pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

100.9M 10K 750

pymupdfb

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

5.3M 10K 750

mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

365K 73K 6K

meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

300K 3K 250

magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

72K 73K 6K

html-table-extractor

extract data from html table

21K 88 22

lyrics-extractor

Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.

5K 60 18

tap-stackexchange

Singer tap for the StackExchange API

2K 3 1

extract-zip

Extract all files within a zip file which can also be in a zip format by simply running this script

2K 0 0

mineru-next-dev

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

2K 73K 6K

doctra

📄🔍 Parse, extract, and analyze documents with ease 📄🔍

2K 211 33

mineru-selfhosted-mcp

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

1K 73K 6K

tap-dbt

Singer tap for dbt, built with the Singer SDK.

912 12 8

pullcite

Evidence-backed structured extraction. Pull data from documents with proof of where each value came from.

650 1 0

bluebird

Unofficial Python client for Twitter

414 44 14

wxpath

wxpath - declarative web crawling with XPath; a Web Query Language (WQL)

361 112 6

googleplaystorescrape

GooglePlayStoreScrape Package for Scraping Play Store Reviews and App Information

324 4 0

tubeframes

A Python package for retrieving YouTube data, including video statistics, captions, and channel information. TubeFrames outputs results in a user-friendly pandas DataFrame format, making it ideal for data analysis workflows — especially in Jupyter Notebooks.

253 3 0