Html To Markdown Python Packages

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

10.5M 6K 392

firecrawl-py

The API to search, scrape, and interact with the web at scale. 🔥

7.2M 144K 8K

firecrawl

The API to search, scrape, and interact with the web at scale. 🔥

1.3M 144K 8K

spider-client

Python, Javascript, and Rust libraries for the Spider Cloud API.

368K 26 9

confluence-markdown-exporter

Export Atlassian Confluence pages as markdown files.

44K 487 120

pyhtml2md

Transform your HTML into clean, easy-to-read markdown with html2md.

19K 82 12

contextractor

Crawl any website and extract clean, boilerplate-free main-content text as Markdown, plain text, JSON, HTML, or raw original HTML — ready for LLMs, RAG, and vector databases. Built on rs-trafilatura + Crawlee/Playwright. Ships as a CLI, npm library, Python package, and Apify Actor.

10K 0 2

crw

Fast, lightweight Firecrawl/Tavily alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/scrape, /crawl, /search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

10K 263 22

markdown-query

A jq-like Markdown query language for command-line processing

7K 955 17

all2md

Convert PDF, Word, PowerPoint, HTML, email & 40+ formats to clean Markdown — and back. Built for LLMs, RAG & Python pipelines, with a built-in MCP server.

5K 12 1

pagetomd

Convert web pages to clean Markdown

3K 3 0

llm-data-converter

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

2K 7 1

article-extractor

Pure-Python article extraction library and HTTP API - Extract clean content from web pages as Markdown or HTML

2K 0 0

file2txt

Turn a supported list of filetypes (e.g. .docx) into a markdown structured text file. Also optionally defangs indicators and extract texts from images. Built for threat intel use-cases.

1K 12 2

markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page, designed for LLM RAG

1K 455 55

netlens-mcp

MCP server for AI agents to read any web page in full — past the robots.txt and bot blocks that stop native fetch — as clean, ad-stripped Markdown. Plus web search.

911 0 0

pulpie

Pareto-optimal models for cleaning the web — fast, encoder-based main-content extraction from HTML.

892 14 0

url2md4ai

Convert any URL into clean, token-efficient Markdown for LLMs. One thing, done well.

819 4 0

markgrab

Universal web content extraction — any URL to LLM-ready markdown

738 0 0

document-data-extractor

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

493 7 1

markmaton

Lightweight HTML-to-Markdown parser for AI agent workflows — Python CLI and API around a fast Go engine, available on PyPI.

430 0 0

html2txt

Convert HTML to markdown

412 1 2

spiderwebai-py

Python, Javascript, and Rust libraries for the Spider Cloud API.

228 26 9

bookmarkdown

✅ Parse your browser's exported HTML bookmark file to Markdown.

204 18 0