5 memory regions. Autonomous dispatch chains. Single vkQueueSubmit. The GPU runs your program while the CPU handles I/O and nothing else.
Each VM instance has 5 GPU-resident memory regions mapped as Vulkan SSBOs. Data stays in VRAM between dispatches.
General-purpose working memory. Holds intermediate results, loop variables, computation state.
Accumulator for reductions. Sum, min, max, count operations write here for CPU readback.
Shared constants and model weights. Loaded once, read by all dispatches in the chain.
Program counter, dispatch parameters, status flags. Drives indirect dispatch and self-scheduling.
Dynamic allocation region. Large tensors, KV caches, variable-length data structures.
A complete GPU VM session in 8 lines. Data stays on the GPU between kernel dispatches.
// Boot a VM with 3 memory region sizes let vm = vm_boot(1.0, 8.0, 16.0) // Load data into registers (SSBO 0) let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0] let _w = vm_write_register(vm, 0.0, 0.0, data) // Dispatch a compute kernel — data stays in VRAM let pc = [0.0, 3.0, 8.0] let _d = vm_dispatch(vm, "stdlib/gpu/kernels/vm_scale.spv", pc, 1.0) // Build and execute the command buffer (single submit) let prog = vm_build(vm) let _e = vm_execute(prog) // Read results back — only now does data cross PCIe let result = vm_read_register(vm, 0.0, 0.0, 8.0) print("Result: {result}") // [3, 6, 9, 12, 15, 18, 21, 24]
Not a shader wrapper. Not a GPGPU library. A complete virtual machine that lives on the GPU.
Chain multiple kernel dispatches into one Vulkan command buffer. One fence, one submit, zero CPU round-trips between stages.
CPU reads GPU status in ~1 microsecond via zero-copy mapped memory. No download, no fence wait, no overhead.
The GPU decides its own workgroup count. Self-scheduling without CPU intervention. True GPU autonomy.
Pre-built command buffers with indirect dispatch. Over-provisioned and ready to fire. Wake, dispatch, sleep.
GPU self-regulates via maxnorm + regulator kernels. Activation magnitudes stay bounded without CPU monitoring.
Scale, affine, matvec, reduce, WHERE, delta encode/decode, dictionary lookup, sort, scan, histogram, and more.
The VM model enables workloads that were previously impossible without custom GPU programming.
Layer-by-layer transformer execution. Weights in Globals, KV cache in Heap, activations in Registers. Entire forward pass as one dispatch chain.
Chain decode, transform, filter, and encode operations. Pixel data never leaves VRAM between stages. Process entire video streams on GPU.
Agent loop runs on GPU. Observe, decide, act — all in VRAM. CPU only handles I/O. Zero orchestration overhead per agent step.
ETL pipelines, columnar analytics, GPU-accelerated queries. Load CSV, filter, aggregate, join — all as GPU dispatch chains.