Edge multimodal node — ROCK 5B+

The constraint was the point: how much of a modern multimodal assistant fits on a single-board computer with 6 TOPS of int8 NPU? Answer: thirteen systemd services — streaming ASR (Whisper-small), two TTS engines (Kokoro ONNX, plus a quantized NPU-resident fallback), a VLM describe endpoint, local LLM inference, voice activity detection, and a health sentry, all surviving reboots unattended.

The work most relevant to real jobs is the model engineering: converting and quantizing models through ONNX and Rockchip's RKNN toolchain, benchmarking each variant against the board's memory and NPU limits, and documenting the honest sweet spots (a 3B-class LLM is the right ceiling; bigger models technically run and practically don't).

The capture side is principled rather than greedy: camera and microphone samplers embed candidates with CLIP and YAMNet and keep only what's novel against the existing archive — data collection optimized for diversity per byte, not volume.

Status & limits — edge services are live; the archive-ingest pipeline has a known hardlink-collision bug and capture is paused until it's fixed. Imagery never leaves the LAN — architecture and benchmarks are the shareable artifacts.

Stack