← All projects

Edge multimodal node — ROCK 5B+

13 services live · 2025–2026

A complete multimodal AI stack on a 6-TOPS ARM board — ASR, TTS, VLM, and diversity-sampled capture, with models quantized and converted for the NPU by hand.

13 systemd servicesRKNN/ONNX NPU conversionsCLIP + YAMNet diversity samplingcamera + mic + speaker I/O

The constraint was the point: how much of a modern multimodal assistant fits on a single-board computer with 6 TOPS of int8 NPU? Answer: thirteen systemd services — streaming ASR (Whisper-small), two TTS engines (Kokoro ONNX, plus a quantized NPU-resident fallback), a VLM describe endpoint, local LLM inference, voice activity detection, and a health sentry, all surviving reboots unattended.

The work most relevant to real jobs is the model engineering: converting and quantizing models through ONNX and Rockchip's RKNN toolchain, benchmarking each variant against the board's memory and NPU limits, and documenting the honest sweet spots (a 3B-class LLM is the right ceiling; bigger models technically run and practically don't).

The capture side is principled rather than greedy: camera and microphone samplers embed candidates with CLIP and YAMNet and keep only what's novel against the existing archive — data collection optimized for diversity per byte, not volume.

Status & limits — edge services are live; the archive-ingest pipeline has a known hardlink-collision bug and capture is paused until it's fixed. Imagery never leaves the LAN — architecture and benchmarks are the shareable artifacts.

Stack

ROCK 5B+ (RK3588)RKNN toolkitONNXWhisper-smallKokoro TTSCLIP ViT-B/32YAMNet

← All projects