Content Extraction Python Packages

web-researcher-mcp

The AI research assistant that cites real sources honestly — and searches the web. Your AI research assistant that cites real sources and stays honest. Works with Claude, Cursor, any MCP client.

30K 37 4

contextractor

Crawl any website and extract clean, boilerplate-free main-content text as Markdown, plain text, JSON, HTML, or raw original HTML — ready for LLMs, RAG, and vector databases. Built on rs-trafilatura + Crawlee/Playwright. Ships as a CLI, npm library, Python package, and Apify Actor.

10K 0 2

searchts

The missing layer between AI and the web. Open-source escalating web unlocker + read/search/transcribe/grab across web, GitHub, YouTube, Reddit, Twitter, LinkedIn, and RSS for AI agents.

3K 1 1

argus-search

Multi-provider web search broker for AI agents. 14 providers, budget-aware routing, content extraction — one API.

2K 3 1

article-extractor

Pure-Python article extraction library and HTTP API - Extract clean content from web pages as Markdown or HTML

2K 0 0

netlens-mcp

MCP server for AI agents to read any web page in full — past the robots.txt and bot blocks that stop native fetch — as clean, ad-stripped Markdown. Plus web search.

911 0 0

pulpie

Pareto-optimal models for cleaning the web — fast, encoder-based main-content extraction from HTML.

892 14 0

extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

879 298 26

url2md4ai

Convert any URL into clean, token-efficient Markdown for LLMs. One thing, done well.

819 4 0

websearch-skill

Keyless, uv-native web search + read for AI agents: ddgs multi-engine search with de-correlated rank fusion, Trafilatura extraction to paginated Markdown, plus keyless arxiv and github search. One fenced agent face (web_search/web_fetch/web_open) and a FastMCP server. Installs via npx skills add, a Claude plugin, or uvx.

616 1 0

llamaindex-agent-toolbox

LlamaIndex integration for Agent Toolbox API — 13 tools for AI agents: search, extract, screenshot, weather, finance, news, translate, geoip, whois, dns, pdf-extract, and qr code generation

447 3 2

anything2md

Python package and CLI for converting documents to Markdown using Cloudflare Workers AI toMarkdown.

428 1 0

tikara

The metadata and text content extractor for almost every file type.

425 10 0

langchain-agent-toolbox

LangChain integration for Agent Toolbox API — 13 tools for AI agents: search, extract, screenshot, weather, finance, news, translate, geoip, whois, dns, pdf-extract, and qr code generation

361 3 2