
Check out the latest model drops and powerful integrations.
Real-time generative video as a live instrument. 24fps autoregressive synthesis on KREA Real-Time 14B + VACE, driven from an 8×8 MIDI grid, 9 latent-space faders, and synchronized AI music — rebuilt from scratch after losing access to the GPUs that made Round 1 fast.
The first round built the engine: WebRTC streaming, prompt transitions, LoRA blending, depth puppeteering. This round built the instrument.
I integrated a hardware control surface, trained a sparse autoencoder to discover the geometry of my prompt library, mapped those learned directions to physical faders, added AI-generated music that follows the visual narrative, and then lost access to the GPUs that made it all fast — so I had to rebuild the performance stack from scratch on different hardware.
The Akai APC Mini MK2 is an 8×8 RGB pad grid with 9 faders — designed for Ableton, repurposed here as a video performance controller.

42 cinematic worlds mapped to an 8×8 MIDI grid. Top rows select palettes, bottom rows handle transport, transitions, and VACE control modes. Faders steer latent-space parameters at chunk rate.
Each pad maps to a prompt. Each prompt is a full scene description — camera, lighting, action, materials — written specifically for the KREA pipeline's autoregressive generation. I built **42 palettes** of 8 prompts each, organized as cinematic worlds:
Blade Runner. Akira. Spirited Away. The Thing. Fury Road. Nosferatu. Metropolis. Space Odyssey. Stalker. Suspiria. Kaiju Rampage. Rear Window. Good Bad Ugly. And 29 more — from Bodega Showdown to Cosmic Horror to Block Party Weather.
The prompts aren't simple labels. They're dense, multi-sentence scene descriptions that specify shot framing, material texture, lighting color, and temporal pacing — all tuned to how the model actually reads text:
> *Deckard, a felt-and-clay figure with a trench coat of heavy canvas, sits hunched at a miniature noodle stand made of weathered balsa wood. A sweeping medium shot captures the dense street scene where fiber-optic neon signs in magenta and teal flicker rhythmically...*
Press a pad → the prompt fires → the stream transitions. The LED grid reflects state: green for active, colors for palette membership. It feels like playing a launchpad, except each clip is a generative scene that never repeats.
The faders needed to do something more interesting than tweak `noise_scale`. I wanted continuous, interpretable control over *what the model is thinking about* — not just how it renders.
The approach: I trained a k-sparse autoencoder (Top-K SAE, k=64) on pooled UMT5-XXL embeddings extracted from ~16,000 prompts and captions. The SAE decomposes the entangled 4096-dimensional conditioning space into 16,384 sparse features, of which exactly 64 are active for any given input.
What came out: tSNE visualization of the sparse activations revealed clean cluster structure — cinematic genres, material types, mood categories, character archetypes.

13.5k prompts projected through k-SAE feature space (k=64). Each island is a cinematic genre the autoencoder discovered on its own — Akira, cel animation, stop-motion, graffiti, kaiju. The faders navigate between them.
The features are interpretable:

9 of 16,384 sparse features mapped to physical faders. Each direction was learned unsupervised — the autoencoder named its own axes. Push a fader and the embedding shifts along one interpretable dimension without disturbing the rest.
How it works live: The faders apply identity-preserving deltas — output = input + (decode(z + Δ) - decode(z)) * σ. Push a fader up and you're steering the active embedding along one learned direction without destroying the rest of the prompt's meaning. The deltas are applied at chunk boundaries (~150ms latency), so fader movements land on the next generated frame group.
This is not prompt interpolation. It's continuous navigation through a learned manifold of visual concepts, grounded in what the model's text encoder actually represents.
The idea: audio as a co-steered output channel, not a soundtrack layered on top. When you press a pad, both the visual stream and the music shift together — same gesture, synchronized output.
Google's Lyria RealTime generates music that follows the narrative tone of whatever scene is active on the pad grid. Each palette gets a scoring identity — generated by Gemini Flash from the palette's visual brief — that defines genre, key, instruments, BPM range, and reference soundtracks. Escape from New York gets "John Carpenter Analog Synth — E Minor, LinnDrum, Prophet-5, Arp Odyssey." Blade Runner gets "Mechanical Chamber Folk-Noir — C Minor / F# Minor, detuned piano, glass harmonica." Each cell's music prompt is then derived from that identity, so every pad press launches both a visual scene and a thematically matched musical cue.
The integration runs through the APC Mini bridge: pad press sends the visual prompt to Scope and the music prompt to Lyria simultaneously. The blend fader crossfades both streams — visual embedding SLERP and weighted music prompts move from the same gesture. Faders control BPM, density, brightness, and guidance in real time. Track buttons handle play/pause, mute drums, mute bass, and context resets.
Audio currently plays out of the bridge's local speaker rather than through WebRTC — server-side audio routing is the remaining gap.

Tensor parallelism made it worse before it made it better. The fix: compile-friendly collectives that eliminate graph breaks entirely.
The B200 and B300 cards that powered Round 1 — 23 FPS at 448×448, comfortable margin above the live target — are gone. Moved to H200s. Still fast GPUs, but the pipeline that flew on Blackwell silicon doesn't hit the same numbers here.
Stage 1: Single H200 — ~20 FPS. Ported everything from the B300 stack: BF16 weights, `torch.compile`, and a Flash Attention 4 score-mod path we'd developed for Blackwell that turned out to work on Hopper too. Twenty frames per second is almost performable. But "almost" means dropped frames during transitions, visible hitching when puppeteering, no headroom for the faders or audio loop. Not good enough.
Stage 2: Naive TP=2 — 16 FPS. The obvious move: shard the 40-block transformer across two GPUs using Megatron-style column/row parallelism. Column-parallel on QKV projections, row-parallel on output, all-reduce after each. First inference worked — real panda on screen, correct image, no artifacts. But 16 FPS. Two H200s losing to one. The all-reduce overhead in eager mode ate the parallelism gain whole.
Stage 3: TP + compile — 9.6 FPS. Surely `torch.compile` would fix it — Inductor fuses kernels, overlaps compute and communication. Instead: half the speed of eager. The root cause was 160+ graph breaks. Every all-reduce call was wrapped in `dynamo.disable()`, which forced the compiler to shatter the graph into fragments — two collectives per block, 40 blocks, two ops each. More dispatch overhead than actual compute. Two GPUs at less than half the speed of one. This was the low point.
Stage 4: The funcol fix — 24.5 FPS. Replaced the collective ops with `torch.distributed._functional_collectives` — a compile-friendly all-reduce that traces cleanly through Inductor without graph breaks. Zero graph breaks from collectives. The entire transformer block compiles into a single fused graph. **24.5 FPS on 2×H200.** Past the live target for the first time on this hardware. The fix was one import and a conditional code path — the debugging was the hard part.
Stage 5: TP=4 — 27 FPS. Scaled to four GPUs. Same funcol path, wider sharding, NVLink keeping the all-reduces fast. 10% headroom above the target.
Stage 6: Pipeline parallelism (in progress). The next step: split the pipeline so rank 0 handles VAE decode and text encoding *while the TP mesh is already denoising the next chunk*. The two stages run concurrently instead of sequentially. Current status: real pixels flowing through a 3-rank pipeline on 4×A100, with dynamic prompts as the next milestone.
What worked: The k-SAE faders produce real, controllable effects on the generation — not random noise, not mode collapse, but directional shifts along interpretable axes. The 42-palette prompt library gives the pad grid genuine cinematic range. And the funcol TP fix turned a regression (16 FPS on two GPUs, worse than one) into 27 FPS on four.
What's rough: Lyria audio plays locally from the bridge process — not yet routed through the server or embedded in the WebRTC stream. The k-SAE deltas are visible but not yet dramatic at conservative sigma values. Pipeline parallelism is functional but not yet overlapping stages for real speedup. Only 2 of 42 palettes have been enriched with music prompts so far.
What's next: Dynamic prompts through the PP pipeline (text encoder on rank 0, live prompt changes mid-stream). Overlap decode with mesh inference for true stage hiding. And sequencing the 64-grid launch scene — curating prompt adjacency across the pads into a performable, rehearsable set.
The foundation from Round 1 ("Video Conductor") remains: prompt transitions (soft cut, hard cut, embedding interpolation), LoRA blending, depth puppeteering, infinite loops, record/replay/re-render, and the full WebRTC streaming stack. See the Round 1 page for details and video demos.