Web Archiving Python Packages

waybackpy

Wayback Machine API interface & a command-line tool

5M 592 40

warcio

Streaming WARC/ARC library for fast web archive IO

774K 459 70

archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

72K 28K 2K

pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

10K 2K 238

cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

7K 211 34

storetle

Pip-installable wheel for streaming large document corpora (USPTO, arXiv, PubMed, web crawls). HTML-aware compression with random access, about 46% smaller than gzip WARC.

6K 0 0

cdxj-indexer

CDXJ Indexing of WARC/ARCs

5K 34 16

ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

4K 652 41

auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

4K 1K 107

hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, replay, mirroring, data scraping, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

1K 132 10

hoardy-web-sas

798 132 10