Check out the latest model drops and powerful integrations.
That's the question driving The Mirror's Echo, an interactive AI projection installation I'm developing that transforms spoken language into living visual landscapes. A viewer steps up to a microphone, says "crystalline forest under a blood moon," and within seconds the projection shifts — trees crystallize, the sky bleeds red, light scatters through impossible geometries.
This article describes how I built the real-time voice-to-visual pipeline powering this work, using Daydream Scope's StreamDiffusionV2 pipeline and a custom audio-transcription preprocessor plugin.
The system chains together several components in a real-time loop:
Microphone → Whisper AI → spaCy NLP → StreamDiffusionV2 → Projection
The key insight: people don't speak in prompts. They say "oh wow, that's like a, um, stained glass butterfly or something." The NLP layer distills that into stained glass butterfly — exactly what the diffusion model needs.
I explored several approaches before landing on Scope. I've worked extensively with TouchDesigner and StreamDiffusion's TD plugin, but Scope's preprocessor architecture solved a fundamental problem: how do you inject prompts from an external source into a running diffusion pipeline?
Scope's preprocessor system lets you intercept the pipeline at the frame level. My audio-transcription plugin sits between the input and StreamDiffusionV2, passing video frames through untouched while injecting voice-derived prompts into the generation parameters. The pipeline doesn't know or care that its prompts are coming from a microphone — it just receives text and generates.
The input_mode: "text" override was critical. StreamDiffusionV2 normally expects video input for img2img generation. By forcing text-only mode, the model generates purely from the prompt, creating imagery that responds to speech rather than transforming a camera feed.
The installation needs two modes:
Voice Mode (Green): When someone is actively speaking and nouns are detected, their words drive the visuals. "Ocean waves crashing" produces ocean imagery. "Cathedral ceiling" shifts to architecture. The transition between prompts uses Scope's cache reset for hard cuts — each new noun phrase gets a fresh generation.
Text Box Fallback (Yellow): When no one is speaking (10 seconds of silence), the system falls back to whatever prompt is set in Scope's UI. This serves as an ambient visual state — a default aesthetic that plays between interactions. Gallery staff can change this by typing in the prompt box without touching code.
A prompt monitor overlay (a small tkinter window) shows the current state in real-time: which mode is active, what nouns were extracted, the microphone amplitude, and whether transcription is happening. This is essential for debugging during installation and for gallery staff to understand what the system is doing.
My development machine has an 8GB GPU — far from the 5090s in Ryan's VACE demos. Making this work required aggressive optimization:
The result is approximately 3 fps of AI-generated imagery driven by voice. Not silky smooth, but for a projection installation where the visual shifts are the spectacle, it works. The dreamy, slightly stuttered quality actually reinforces the feeling that you're watching something being imagined in real-time.
Nouns are everything. Early versions sent the full transcription to StreamDiffusion. The results were incoherent — diffusion models don't know what to do with "um, so like, maybe a." SpaCy's noun extraction was the breakthrough. It turns rambling speech into clean, generative prompts.
Queue architecture matters. Scope's parameter queue can flood when a preprocessor sends updates too frequently. The solution was a bypass that merges prompt parameters directly, skipping the queue entirely. Without this, voice prompts would get dropped in favor of the UI prompt that the frontend sends every frame.
The fallback needs to be graceful. Hard-cutting from voice-driven imagery to a static prompt looks jarring. The cache reset smooths transitions, and the 10-second timeout gives speakers natural breathing room without immediately snapping to the fallback.
Monitor everything. You cannot debug a real-time audio-visual pipeline by reading logs after the fact. The prompt monitor overlay was an afterthought that became essential. Seeing "VOICE: crystalline forest" flash green while the projection shifts gives you immediate confirmation that the whole chain is working.
The Mirror's Echo is being developed for exhibition at the Columbus Museum of Art's Wonderball 2026, alongside baroque-themed projection pieces. The voice pipeline will be the centerpiece — an interactive station where guests speak and watch their words become visual worlds.
I'm exploring several extensions:
The audio-transcription preprocessor is built as a Scope plugin. The core requirements:
The plugin architecture means you can drop this into any Scope pipeline — not just StreamDiffusionV2. As new real-time models land in Scope (LongLive, VACE, MemFlow), the voice input layer stays the same.
Speaking to machines and having them dream back at you — it's the most natural interface I've ever built. No keyboard. No touchscreen. Just your voice and an AI that listens.
Krista Faist is a VR/AI/moving image artist represented by Chaos Contemporary Craft gallery, a 2024 Fuse Factory Artist-in-Residence, and founding board member of Mural ReMix. Her work explores perception, wonder, and technological mediation through interactive installations and projection mapping. She splits her time between Columbus, Ohio and Sarasota, Florida.
Find her on Daydream: @Eicos73