ARK — verified knowledge compression
An offline, provenance-rich archive of open human knowledge — compressed, integrity-verified end-to-end, and queryable without the internet.
If the internet disappeared tomorrow, how much of human knowledge could one machine keep — and how would you prove none of it rotted?
ARK ingests open corpora — Wikipedia, Project Gutenberg, OpenAlex, Wikidata, OpenStreetMap, UniProt — through a pipeline of scan → normalize → deduplicate → chunk → compress → verify, climbing fifteen validated "scale rungs" from tiny synthetic corpora to a 47.7 GB Wikipedia slice. Each rung must pass completely before the next is attempted.
What makes it more than "ran zstd on some files"
- Manifest-first design. Every artifact carries permanent provenance: source, content hash, transformation chain, byte offsets. The manifest is the archive's source of truth; the packs are just bytes it vouches for.
- Verification is the product. Every archive must round-trip — decompress, re-hash, compare against source — before its intermediates may be deleted. Current state: 112/112 archives verified. This discipline caught a real incident: an audit script silently reading the wrong manifest reported "all verified" while checking nothing. The re-audit that exposed it is committed alongside the fix.
- Content-defined chunking with measured wins — chunk-boundary and pack-layout optimization produced a 140× speedup on the hot path, measured, not estimated.
- Queryable without decompression ceremony. The knowledge layer exposes full-text
search (SQLite FTS5), columnar analytics (DuckDB), and graph queries (SPARQL over
Wikidata) through one CLI —
ark knowledge— built as the integration hook for downstream consumers. - Boring-by-design operations. Resumable phases, journaled deletions (every freed intermediate has a reversible record), and health checks wired into the platform's cron registry.
Why it matters
This is data engineering with the stakes made visible: petabyte instincts at terabyte scale, provenance as a first-class object, and a verification culture where "trust me" is never the answer. The same muscles transfer directly to dataset pipelines, archival systems, and any context where data integrity is the job.
Status & limits — archives through scale 15 are verified and intermediates freed (~249 GB reclaimed). The unrealized half of the vision is consumption: nothing reads the knowledge layer in anger yet. Choosing the first real consumer — which assistant queries compressed knowledge, for what query shapes — is the next design milestone.