PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Warc Python Packages

Python packages with the GitHub topic warc. Sorted by relevance, with stars and monthly downloads.
chatnoir-eu
fastwarc

A robust web archive analytics toolkit

2.3M 138 18
chatnoir-eu
resiliparse

A robust web archive analytics toolkit

2.2M 138 18
webrecorder
warcio

Streaming WARC/ARC library for fast web archive IO

1.3M 457 69
ArchiveBox
archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7K 28K 2K
cocrawler
cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 206 34
oduwsdl
ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

4K 651 41
webrecorder
cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15
mikwielgus
forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

3K 117 6
openzim
warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

941 85 7
WEB-CHILD
internet-archive-extractor

Tool for querying the Internet Archive CDX API, downloading resources and packaging them as WARC files.

416 2 0
cocrawler
cocrawler

A modern web crawler framework for Python

354 194 24
q-m
scrapy-webarchive

A webarchive extension for Scrapy

303 9 1
internetarchive
scrapy-warcio

Scrapy WARC I/O

300 24 6
internetarchive
cdxsummary

Summarize web archive capture index (CDX) files

284 90 30
Florents-Tselai
warcdb

WarcDB: Web crawl data as SQLite databases.

269 405 10
oduwsdl
otmt

Tools for determining if web archive collecions are Off-Topic

247 9 3
datacoon
metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

191 34 2
ArchiveBox
archivebox-likn

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

169 28K 2K
    • Data from PyPI, GitHub, ClickHouse, and BigQuery