GPU Virtual Machine

Architecture

5-SSBO Memory Model

Each VM instance has 5 GPU-resident memory regions mapped as Vulkan SSBOs. Data stays in VRAM between dispatches.

SSBO 0

Registers

General-purpose working memory. Holds intermediate results, loop variables, computation state.

SSBO 1

Metrics

Accumulator for reductions. Sum, min, max, count operations write here for CPU readback.

SSBO 2

Globals

Shared constants and model weights. Loaded once, read by all dispatches in the chain.

SSBO 3

Control

Program counter, dispatch parameters, status flags. Drives indirect dispatch and self-scheduling.

SSBO 4

Heap

Dynamic allocation region. Large tensors, KV caches, variable-length data structures.

Example

Boot, dispatch, read results.

A complete GPU VM session in 8 lines. Data stays on the GPU between kernel dispatches.

gpu_vm_example.flow

// Boot a VM with 3 memory region sizes
let vm = vm_boot(1.0, 8.0, 16.0)

// Load data into registers (SSBO 0)
let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
let _w = vm_write_register(vm, 0.0, 0.0, data)

// Dispatch a compute kernel — data stays in VRAM
let pc = [0.0, 3.0, 8.0]
let _d = vm_dispatch(vm, "stdlib/gpu/kernels/vm_scale.spv", pc, 1.0)

// Build and execute the command buffer (single submit)
let prog = vm_build(vm)
let _e = vm_execute(prog)

// Read results back — only now does data cross PCIe
let result = vm_read_register(vm, 0.0, 0.0, 8.0)
print("Result: {result}")
// [3, 6, 9, 12, 15, 18, 21, 24]

Capabilities

What makes it different.

Not a shader wrapper. Not a GPGPU library. A complete virtual machine that lives on the GPU.

Single vkQueueSubmit

Chain multiple kernel dispatches into one Vulkan command buffer. One fence, one submit, zero CPU round-trips between stages.

HOST_VISIBLE Polling

CPU reads GPU status in ~1 microsecond via zero-copy mapped memory. No download, no fence wait, no overhead.

Indirect Dispatch

The GPU decides its own workgroup count. Self-scheduling without CPU intervention. True GPU autonomy.

Dormant VMs

Pre-built command buffers with indirect dispatch. Over-provisioned and ready to fire. Wake, dispatch, sleep.

Homeostasis

GPU self-regulates via maxnorm + regulator kernels. Activation magnitudes stay bounded without CPU monitoring.

62+ Compute Kernels

Scale, affine, matvec, reduce, WHERE, delta encode/decode, dictionary lookup, sort, scan, histogram, and more.

Use Cases

GPU-resident workloads.

The VM model enables workloads that were previously impossible without custom GPU programming.

LLM Inference

Layer-by-layer transformer execution. Weights in Globals, KV cache in Heap, activations in Registers. Entire forward pass as one dispatch chain.

Image & Video Pipelines

Chain decode, transform, filter, and encode operations. Pixel data never leaves VRAM between stages. Process entire video streams on GPU.

Autonomous Agents

Agent loop runs on GPU. Observe, decide, act — all in VRAM. CPU only handles I/O. Zero orchestration overhead per agent step.

Data Processing

ETL pipelines, columnar analytics, GPU-accelerated queries. Load CSV, filter, aggregate, join — all as GPU dispatch chains.

5

Memory Regions

62+

GPU Kernels

~1 μs

Polling Latency

1

Queue Submit

7/7

VM Steps Passing