PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Content Extraction Python Packages

Python packages with the GitHub topic content-extraction. Sorted by relevance, with stars and monthly downloads.
Khamel83
argus-search

Retrieval platform for AI agents — 14 search providers, 12-step content extraction, dead URL recovery, site capture, and local corpus storage. One API.

2K 2 2
pankaj28843
article-extractor

Pure-Python article extraction library and HTTP API - Extract clean content from web pages as Markdown or HTML

2K 0 0
currentsapi
extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

1K 297 26
Vincentwei1021
llamaindex-agent-toolbox

LlamaIndex integration for Agent Toolbox API — 13 tools for AI agents: search, extract, screenshot, weather, finance, news, translate, geoip, whois, dns, pdf-extract, and qr code generation

450 3 1
herrkaefer
anything2md

Convert documents to Markdown using Cloudflare Workers AI toMarkdown.

401 1 0
tiramisu-sh
lurkers

Convenient API + CLI to fetch web content (HTML, YouTube, Twitter, RSS) for agents

334 0 0
baughmann
tikara

The metadata and text content extractor for almost every file type.

278 9 0
praveenc
fetchv2-mcp-server

A robust MCP server for fetching and extracting web content using Trafilatura

268 2 0
Vincentwei1021
langchain-agent-toolbox

LangChain integration for Agent Toolbox API — 13 tools for AI agents: search, extract, screenshot, weather, finance, news, translate, geoip, whois, dns, pdf-extract, and qr code generation

206 3 1
schatten
jsuite

Tools for parsing and manipulating JATS XML documents.

145 3 2
JakeLiuMe
webpeel

Fast web fetcher for AI agents — smart extraction, stealth mode, structured data

139 6 6
Prakashmaheshwaran
docscraperforai

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

101 0 0
midstreeeam
peduncle

content extraction from html

84 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery