ARK — verified knowledge compression

If the internet disappeared tomorrow, how much of human knowledge could one machine keep — and how would you prove none of it rotted?

ARK ingests open corpora — Wikipedia, Project Gutenberg, OpenAlex, Wikidata, OpenStreetMap, UniProt — through a pipeline of scan → normalize → deduplicate → chunk → compress → verify, climbing fifteen validated "scale rungs" from tiny synthetic corpora to a 47.7 GB Wikipedia slice. Each rung must pass completely before the next is attempted.

What makes it more than "ran zstd on some files"

Manifest-first design. Every artifact carries permanent provenance: source, content hash, transformation chain, byte offsets. The manifest is the archive's source of truth; the packs are just bytes it vouches for.
Verification is the product. Every archive must round-trip — decompress, re-hash, compare against source — before its intermediates may be deleted. Current state: 112/112 archives verified. This discipline caught a real incident: an audit script silently reading the wrong manifest reported "all verified" while checking nothing. The re-audit that exposed it is committed alongside the fix.
Content-defined chunking with measured wins — chunk-boundary and pack-layout optimization produced a 140× speedup on the hot path, measured, not estimated.
Queryable without decompression ceremony. The knowledge layer exposes full-text search (SQLite FTS5), columnar analytics (DuckDB), and graph queries (SPARQL over Wikidata) through one CLI — ark knowledge — built as the integration hook for downstream consumers.
Boring-by-design operations. Resumable phases, journaled deletions (every freed intermediate has a reversible record), and health checks wired into the platform's cron registry.

Why it matters

This is data engineering with the stakes made visible: petabyte instincts at terabyte scale, provenance as a first-class object, and a verification culture where "trust me" is never the answer. The same muscles transfer directly to dataset pipelines, archival systems, and any context where data integrity is the job.

Status & limits — archives through scale 15 are verified and intermediates freed (~249 GB reclaimed). The unrealized half of the vision is consumption: nothing reads the knowledge layer in anger yet. Choosing the first real consumer — which assistant queries compressed knowledge, for what query shapes — is the next design milestone.

What makes it more than "ran zstd on some files"

Why it matters

Stack