GPU timeshare — multi-tenant scheduling for one RTX 3090

When a transcription job, a benchmark run, and a vision service all want the same consumer GPU, you either babysit them manually forever or you build a scheduler. This is the scheduler — the same resource-arbitration problem cloud platforms solve, scaled to a single card, with all the sharp edges intact.

Design

Leases, not hope. Clients request VRAM through an API (Python helper, CLI wrapper, or a binary shim for tools I don't control). Effective free VRAM is computed as total minus the max of physically-used and sum-of-reservations — so a reservation can't be stolen by an unmetered process in the gap before allocation.
Priority with aging. Queue position is priority plus waiting time, so low-priority batch work can't starve forever beneath a chatty interactive consumer.
Preemption is reversible. The novel part: instead of killing victims, the coordinator runs a two-phase pause — ask the consumer to yield, then verify from NVML that VRAM actually dropped before marking it paused. Paused processes park in RAM under a ledger capped at 32 GiB so "paused" can't quietly become "swapping the box to death." Tier-0 consumers support checkpoint-in-place via ptrace grants. Pause/resume costs are measured and persisted per consumer, so future scheduling decisions can price preemption instead of guessing.
The enforcer assumes nothing. A 15-second loop reconciles reality against the lease table. Unapproved GPU processes get a tree-aware audit — cross-referencing /proc ancestry and kernel process accounting — then SIGTERM with a 300-second grace before the kill, with full forensics logged. Anti-thrash timers stop pause/resume oscillation.
Failure-injection tested. 1,097 lines of tests include drills for the ugly paths: consumers that lie about yielding, processes that die mid-pause, leases that outlive their owners.

Why it matters

This is distributed-systems thinking applied where most people would write a lock file: admission control, reversible preemption, verified state transitions, adversarial enforcement, and measured cost models. It generalizes directly to any shared-resource problem — and it's the project I'd extract first as open source, because every homelab with one good GPU has this exact fight.

Status & limits — in production arbitrating three live consumers daily. Single-node, single-GPU by scope; multi-GPU placement and contention-trend reporting are roadmapped, not built.

Design

Why it matters

Stack