Open-source self-hosted web archiving

Preserve the web.On infrastructure you control.

ArchiveBox saves websites, bookmarks, RSS feeds, social posts, media, source code, and research material in durable files like HTML, PDF, PNG, TXT, JSON, WARC, MP4, and SQLite.




Open source MIT badge Active development badge GitHub stars badge Docker pulls badge PyPI installs badge Chrome store users badge
zsh · ~/archivebox · docker-compose
# Docker Compose is the recommended setup
$ mkdir -p ~/archivebox/data && cd ~/archivebox
$ curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
$ docker compose run archivebox init --install
-> created ./data/index.sqlite3
-> installed Chrome, wget, yt-dlp, SingleFile, readability
ok listening on http://127.0.0.1:8000
$ 
ArchiveBox snapshot list in the admin web interface
CLI Web UI REST API Webhooks Browser extension Filesystem access

Why ArchiveBox

Designed for useful archives, not proprietary silos.

Readable for decades

Snapshots are stored as ordinary files and folders, with metadata in SQLite and JSON. You can browse the collection without depending on a hosted service.

Multiple captures per URL

ArchiveBox can save rendered HTML, screenshots, PDFs, WARC files, article text, headers, favicons, media, subtitles, and source repositories.

Bring your sources

Import one URL, pipe text into the CLI, upload exported bookmarks, or schedule recurring pulls from RSS feeds and other text-based source lists.

Self-hosted by default

Run it as a Docker web app, use one-off CLI commands, or automate with APIs while keeping private and public material under your own policy.

Who it is for

Archiving workflows for individuals, professionals, and institutions.

Personal archivists and self-hosters

Save bookmarks, browser history, RSS feeds, social media, form content, videos, podcasts, music, photos, and personal knowledge collections.

  • Own your data and keep it on local or remote storage you control.
  • Use the browser extension, CLI, Web UI, and scheduled imports together.
  • Export static HTML or browse the filesystem directly.

Lawyers and journalists

Capture web pages, articles, source material, and public records while preserving reviewable copies outside of volatile platforms.

  • Store screenshots, PDFs, headers, WARC files, and text extraction.
  • Tag and review sources through the self-hosted web interface.
  • Tune privacy, authentication, and publishing settings for casework.

Researchers and institutions

Support OSINT, social media research, AI-powered research agents, libraries, governments, and collection-building teams.

  • Automate imports through the CLI, REST API, webhooks, and schedules.
  • Keep machine-readable metadata in JSON and SQLite.
  • Extend extraction pipelines through the ArchiveBox ecosystem.

Recommended install

Start with Docker Compose.

Docker Compose is the recommended ArchiveBox setup for the easiest install and update path, better isolation, and bundled archiving dependencies like Chrome, wget, yt-dlp, SingleFile, and readability tools.

1

Create a data directory

$ mkdir -p ~/archivebox/data && cd ~/archivebox
2

Download Docker Compose config

$ curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
3

Initialize and start

$ docker compose run archivebox init --install
$ docker compose up
4

Add your first URL

$ docker compose run archivebox add 'https://example.com'

Other install paths

ArchiveBox can also run with plain Docker, pip, brew, deb packages, or the optional setup script. Docker Compose remains the default recommendation, especially when you want bundled dependencies and clean upgrades.

Configuration

Configure ArchiveBox with powerful options.

ArchiveBox can be configured with environment variables, the archivebox config CLI, or by editing ./ArchiveBox.conf. The same configuration model works in Docker, Docker Compose, bare-metal installs, scheduled jobs, and one-off CLI runs.

Common settings people search for

Use TIMEOUT for slow networks, CHECK_SSL_VALIDITY for sites with broken certificates, PUBLIC_INDEX, PUBLIC_SNAPSHOTS, and PUBLIC_ADD_VIEW for publishing policy, and browser user-agent settings for sites that block obvious bots.

For authenticated or difficult sites, review CHROME_USER_DATA_DIR, COOKIES_FILE, CHROME_USER_AGENT, WGET_USER_AGENT, and CURL_USER_AGENT. Public archive operators should also configure instance branding and contact details with settings like FOOTER_INFO and CUSTOM_TEMPLATES_DIR.

archivebox config
$ archivebox config                         # view full config
$ archivebox config --get CHROME_BINARY
$ archivebox config --set TIMEOUT=240
$ archivebox config --set PUBLIC_INDEX=False
$ env CHROME_BINARY=chromium archivebox add 'https://example.com'

Inputs and outputs

Feed it URLs from any source. Save redundant, portable captures.

Inputs

Archive one URL at a time or schedule imports from bookmarks, browser history, RSS, JSON, CSV, TXT, SQL, HTML, Markdown, Pocket, Pinboard, Instapaper, Shaarli, Wallabag, and more.

Outputs

Each snapshot can include original HTML, rendered single-file HTML, PDF, screenshot PNG, WARC, title, article text, favicon, headers, media, subtitles, metadata, thumbnails, and git clones.

Access

Manage the same collection through the Web UI, CLI, REST API, Python API, SQLite, or the data folder itself. The tools are complementary, not separate products.

Disk layout, exporting, and security

A normal folder, a SQLite index, and files you can inspect without ArchiveBox.

All ArchiveBox state for a collection lives in one data folder: the SQLite database, configuration, logs, and archived content. Snapshots are organized on disk as ordinary files and folders so they can be backed up, searched, published carefully, or browsed by hand.

Archive layout

~/archivebox/data
data/
  index.sqlite3          # main metadata database
  ArchiveBox.conf        # collection configuration
  archive/
    1617687755/
      index.html
      index.json
      screenshot.png
      output.pdf
      warc/1617687755.warc.gz
      media/some_video.mp4
      git/somerepo.git

Each snapshot folder includes static index.html and index.json metadata plus extractor outputs such as HTML, PDFs, screenshots, WARC files, media, subtitles, headers, favicons, article text, and git repositories.

Static exports and publishing

You can export the archive index as static HTML, JSON, or CSV with archivebox list so collections can be reviewed without running the web server. Keep generated exports next to the archive/ folder so relative snapshot paths continue to work.

$ archivebox list --html --with-headers > index.html
$ archivebox list --json --with-headers > index.json
$ archivebox list --csv=timestamp,url,title > index.csv

Private content

If you archive private URLs, paywalled pages, unlisted media, Google Docs, or browser sessions, assume snapshot viewers may see private URLs, cookies, session tokens, headers, and page content unless you configure access carefully.

Restrict public access with PUBLIC_INDEX=False, PUBLIC_SNAPSHOTS=False, and PUBLIC_ADD_VIEW=False, then create authenticated users with archivebox manage createsuperuser.

Read the security overview

Archived JavaScript

Archived pages that execute JavaScript can make requests from the same domain as the web UI when viewed. If this risk matters for your collection, review the publishing guidance and consider disabling extractor outputs that execute archived JS when opened.

archivebox config --set SAVE_WGET=False SAVE_DOM=False

Read Issue #239

Storage requirements

ArchiveBox can use roughly 1 GB to 50 GB per 1,000 snapshots depending on media, video, audio, and extractor settings. Tune YTDLP_ENABLED and YTDLP_MAX_SIZE for media-heavy collections.

Keep index.sqlite3 on local storage or SSD when possible. Large archive/ folders can live on NFS, SMB, FUSE, S3-backed, or HDD storage, but Docker and fileshare setups may need PUID, PGID, and root-squash adjustments. Avoid older filesystems like EXT3 or FAT for very large archives.

Plan for large archives

Ecosystem

Core projects that make ArchiveBox extensible.

abx-plugins

Extractor and plugin packages used to preserve more types of content.

GitHub Docs

abx-dl

Download and extraction tooling for saving web content and media.

GitHub

abxpkg

Runtime package management for dependencies used by ArchiveBox and plugins.

GitHub Docs

abxbus

Event bus infrastructure for ArchiveBox automation and integrations.

GitHub Site

Background and motivation

Built for decentralized preservation.

ArchiveBox exists because link rot, platform churn, censorship, and disappearing media routinely erase useful knowledge. The project aims to make the web content you care about viewable with common software in 50 to 100 years, without requiring ArchiveBox or a hosted replay service to understand your files.

Why self-host?

Centralized public archives are essential, but not every page belongs in a global public service. ArchiveBox lets individuals and organizations save public or private material they can access, keep it locally or within their institution, and decide what to publish case by case.

The project differentiates itself by combining a comprehensive CLI, a self-hosted web UI, API surfaces, scheduled imports, standard output formats, and a simple on-disk data layout that remains useful without the app running.

Ethical and legal context

Responsible archiving depends on your jurisdiction, your use case, and your publishing policy. ArchiveBox is a tool; operators remain responsible for handling private data, copyright, DMCA or GDPR requests, and local legal requirements.

Public instances should publish contact and removal information, avoid monetizing copied content, and review the security and publishing documentation before exposing snapshots to anonymous viewers.

Related web archiving tools and communities

ArchiveBox is a broad, self-hosted archive manager. For specialized crawling, replay, bookmark management, or research workflows, the wider web archiving ecosystem is worth knowing.

Documentation

Further Reading.

The README and wiki cover supported sources, outputs, scheduled archiving, storage backends, authentication, security, publishing, Chrome setup, upgrades, troubleshooting, API usage, development, and the broader web archiving community.