Video Conductor

KREA Real-Time 14B at 23 fps. 4x faster than baseline. Fast enough to perform with.

Soft cuts, style morphs, depth puppeteering. Record the performance, replay it, re-render it at higher quality.

The Core Idea

Generative video has been a rendering problem. You prompt, you wait, you get a clip. Broadcast is different. You watch the output, feel when it's gone stale, react: new prompt, new seed, different style. The system has to be fast enough that instinct works. This is that. ~24 fps autoregressive synthesis you can pilot live and puppeteer with your body. Not generation as post-production. Generation as instrument.

Features

Prompt Transitions — Crossfade between scenes and concepts. No hard cuts, no re-renders. The model interpolates between prompts, so you steer through latent space, not between clips.

Infinite Loops — Hold any prompt indefinitely. A dog walking up the street keeps generating: new variations, new trucks passing, new light. A single prompt can stay interesting for a surprisingly long time. Or use it as a stable backdrop for compositing.

A moment stretched between two beats

An explosion that never completes

LoRA Blending — Mix trained styles in real-time. Dial between aesthetics on the fly, treating LoRAs as faders.

Same scene, style blended live

Rankin-Bass to squash-and-stretch to hand-carved wood

Live Puppeteering — Drive the generation with your body via webcam input: depth maps, OpenPose, face tracking. The model responds to you, not just to text. Performance becomes the prompt.

Upper body controls character

Interacting with virtual props

Record → replay → re-render — Capture the stream (WebM) and capture the control timeline (JSON). Hard cuts reset state, so from any hard cut forward, same seed and settings means same output. Record a performance, then replay or re-render it.

Control Inputs

Webcam (WebRTC, browser, OBS pass-through)
Phone camera (WebRTC pairing. Your phone becomes a roving depth sensor)
NDI sources (OBS, Unity, game engines; local only, WAN needs WebRTC bridge)
Synthetic (procedural depth for testing/effects)

Desk as Virtual Set

Depth conditioning — Video Depth Anything on the same GPU, feeding depth maps as a control signal. Point a phone at your desk and the model treats it like a miniature set. Characters instantiate on a little stage defined by the contours of furniture and objects.

Desk as miniature set

Gaussian Splat ->OBS->Scene Control

TikTok->OpenPose->OBS->Dancing Kaiju

Control Surface

Stream Deck mapped to 5 custom-trained LoRAs. Toggle or blend styles.

Control Map Modes

Engineering Struggles & Triumphs

The 30fps Mystery. Desktop streaming at 30fps, but the cloud GPU received only 4fps—triggering constant "stale input" resets. The culprit: NDI over Tailscale has recv variance issues, averaging 12fps with 3-90ms jitter. The fix was counterintuitive: add OBS as a frame stabilizer in the middle. WebRTC handles WAN transport; NDI stays local. Sometimes the solution is another layer, not fewer.

System topology: Control from MacBook, preprocessing on desktop GPU, generation on B300 in Finland, all connected via Tailscale mesh

Plastic Windows. The breakthrough was realizing we already had the mechanism. During prompt transitions, we lower the cache attention bias—creating a "plastic window" where the model accepts new inputs. Then we restore it and the new state becomes stable. The same trick that enables smooth prompt transitions now enables real-time control responsiveness. One mechanism, two capabilities.

Start with Reliable Channels. We tried engineering flow fields and other exotic control signals—it was premature. We hadn't thought it through. What actually worked: start with media that already has the structure you want, and use proven channels first. Depth from Video Depth Anything has been solid. The fancy signals can wait until the simple ones hit their limits.

Engineering Highlights

1. Real-time is multiple clocks: input FPS, pipeline FPS, output FPS. Made staleness/hold-last/resume explicit so generation doesn't block on control.

2. Performance bring-up is systems work: avoid silent fallbacks, remove hidden bottlenecks (buffering/queues/copies/sync), and use selective compilation where it’s stable.

3. Fast runtime style swapping: avoid per-frame LoRA overhead by doing work only when style scales change (in-place merge strategy), while keeping replayability and stability in mind.

Opportunities / Problems (What We're Wrestling With)

Mouth Drift (Problem). The autoregressive model pays attention to its own outputs. Characters start talking and emoting even when the prompt doesn't ask for it. We didn't build this, we discovered it. Ideally we'd have mouth control via input injection on top of depth, but for now it's an unsolved problem.

Camera Motion is Hard. When the camera moves, every pixel changes and overwhelms the KV cache. The model "devolves to depth map," literally outputting the control signal. Slow motion works; fast pans don't. This is a known research frontier (CameraCtrl, EPiC, Motion Prompting all tackle it).

Outlook (Next Steps)

Hardware Control Surfaces. Stream Deck for style presets, MIDI knobs for continuous parameters
Agent Director. A process that watches the stream, tracks world state, and decides when to transition or cut.
Music reactivity. Map audio features to visual parameters. Beat triggers hard cuts, bass drives structure.

Performances

First Performance (2h 30 min live session)

Two and a half hours of continuous live generation

Technical Deep Dive: The B200/B300 Optimization Journey

: From ~8–11 FPS baseline to ~32–34 FPS benchmark (and ~23 FPS live at 448×448 with control overhead). Six days of profiling, layout fixes, and backend routing. Dec 25–30, 2025. Canonical benchmark: 320×576, 4 steps, BF16, bias=0.3.

4× faster per frame: Decode dropped from 65ms to 5.4ms. Total: 117ms → 30ms/frame. Canonical benchmark: 320×576, 4 steps, BF16, bias=0.3, after warmup.

About half the project was performance bring-up - and it’s what made the realtime features usable.

We started at ~8 FPS on a B300 (11 FPS on a B200). The goal was to cross the interaction threshold where you can perform with the video, not just render it. We ended at ~32–34 FPS on a canonical benchmark (after warmup) and ~23 FPS live at 448×448 with realtime control overhead. That headroom is what made live style swaps + conditioning feel stable instead of fragile.

How I worked: research + profiling → one-change experiments → write down the “textbook” so progress compounds. (I ended up writing an 18‑part explainer series on FA4/Blackwell internals + a runbook/logbook so results were repeatable.)

The 6-day experiment-card sprint

(profiling continued afterwards):

› 12-25: Routed KV-bias attention to FA4/CuTe score_mod (avoid slow backend paths)

› 12-26: Patch-embed fastpath (Conv3d→Conv2d) deleted a copy/fill storm

› 12-27: Fixed decode layout/contiguity + selective torch.compile → hit ~30+ FPS

› 12-28: VACE wrapper fastpaths removed a VACE-only regression

› 12-29: Encode channels-last in VACE path → halved encode cost at 640

› 12-30: Verified with Nsight; on our pipeline shape FA4 was ~4× faster than FA2

What moved the needle:

▸ Kill silent fallbacks (new GPU stacks "work" while running slow paths)

▸ Layout contracts beat cleverness (contiguous/channels-last in the right spots)

▸ Backend routing is product work (prove which kernel ran)

▸ Selective compilation (compile stable subgraphs; avoid known-bad SM103 modes)

What didn't:

▸ FP8 as default (slower and/or lower quality than BF16 in our realtime stack)

▸ Chasing "mysterious copy overhead" (most was already fused; wins were specific layout slowpaths)

GitHub

https://github.com/davidrd123/scope/tree/competition-vace