🎤 Scope Audio Transcription Plugin

Voice-controlled real-time AI video generation for Daydream Scope

By Krista Faist | GitHub: kfaist/scope-audio-transcription

What It Does

Speak into your microphone and watch AI-generated visuals respond to your words in real-time.

This is a preprocessor plugin for Daydream Scope that captures live audio, transcribes speech using Whisper AI, extracts visual nouns with spaCy NLP, and injects them as prompts into any downstream pipeline (StreamDiffusionV2, LongLive, etc.).

Say "crystalline forest under a blood moon" — the plugin extracts crystalline forest, blood moon and feeds it to the diffusion model. The video output shifts accordingly within seconds.

Install

In Scope, install with:

git+https://github.com/kfaist/scope-audio-transcription.git

Then select audio-transcription as your preprocessor in the pipeline settings.

Requirements

Daydream Scope (latest)
A microphone
NVIDIA GPU (8GB minimum)
The plugin auto-installs: faster-whisper, spaCy, sounddevice

How It Works

🎙️ Microphone ↓ 📊 Amplitude check (threshold: 0.008) ↓ 🗣️ Whisper AI (tiny model, CPU int8 — no GPU overhead) ↓ 🔍 spaCy NLP extracts nouns ("um that's like a stained glass butterfly" → "stained glass butterfly") ↓ 🎨 Prompt injected into StreamDiffusionV2 / LongLive / etc. ↓ 📺 Real-time AI video output

The key insight: people don't speak in prompts. They ramble, use filler words, trail off mid-sentence. The NLP layer distills messy human speech into clean, concrete nouns that diffusion models understand.

The Dual-Prompt System

The plugin manages two prompt sources with automatic switching:

🟢 Voice Mode

When you speak and nouns are detected, voice drives the output. Each new noun phrase triggers a cache reset for a hard visual cut — the image snaps to the new concept immediately.

🟡 Text Box Mode

Whatever's typed in Scope's prompt box serves as the ambient fallback. Gallery staff can set the mood without touching code.

🟠 Fallback

After 10 seconds of silence (no audio above threshold), the plugin gracefully reverts to the text box prompt. The transition is automatic.

Switching Rules

Voice always wins while you're speaking and nouns are detected
Typing a new prompt in the text box immediately overrides voice
Filler speech ("um, yeah, okay") with no nouns keeps the current voice prompt alive without changing it
Silence (10s) triggers fallback to text box

Prompt Monitor

The plugin includes a standalone prompt monitor — a small always-on-top overlay window that shows exactly what's happening in real-time:

python tools/scope-prompt-monitor.pyw

Color Codes

ColorStatusMeaning🟢 GreenVOICE (active)Voice nouns are driving the video output🟡 YellowUI PROMPTUser typed a new prompt in the text box🟠 OrangeFALLBACKVoice timed out, reverted to text box⚪ GrayWaitingNo activity yet

What It Shows

Top line: Current source (VOICE / TEXT BOX / FALLBACK)
Middle line: The actual prompt driving the image right now
Bottom line: Detail — amplitude levels, extracted nouns, raw transcriptions, skipped filler

The monitor tails %APPDATA%\Daydream Scope\logs\main.log in real-time. Essential for debugging during installation setup and for understanding what the system is doing at a glance.

Configuration

Scope UI Settings

Preprocessor: audio-transcription
Pipeline: StreamDiffusionV2 (or any supported pipeline)
VAE: LightVAE recommended for 8GB GPUs
Input Mode: Video (the plugin overrides to text-only internally)
Manage Cache: ON

Plugin Parameters (in pipeline.py)

ParameterDefaultDescriptionAmplitude threshold0.008Minimum mic level to trigger transcriptionProcess interval3.0sHow often audio is transcribedVoice timeout10.0sSeconds of silence before fallbackWhisper modeltinySpeech recognition model (tiny/base/small)Sample rate16000Audio sample rate for Whisper

Optimizations for 8GB VRAM

LightVAE — 75% pruned, faster inference
144×144 or 208×208 resolution — small but sufficient for projection
Denoising steps [47, 23] — two-step generation
Whisper on CPU (int8 quantized) — zero GPU overhead for transcription

The Build Story

This plugin was built for The Mirror's Echo, an interactive AI projection installation where gallery visitors speak and watch their words become visual landscapes. It's been through dozens of iterations — from NLTK keyword extraction to spaCy noun chunks, from OpenAI Whisper to faster-whisper, from 5-second processing intervals to 3-second, from voice-persists-forever to 10-second timeout with graceful fallback.

Key challenges solved:

Queue flooding: Scope's parameter queue gets overwhelmed when a preprocessor sends updates every frame. Solution: a bypass in pipeline_processor.py that merges prompt parameters directly, skipping the queue.

UI prompt wars: The frontend sends the text box prompt every frame. Without careful debouncing, it overwrites voice prompts instantly. Solution: track UI prompt initialization separately from user changes — only clear voice when the prompt actually changes, not on first load.

Filler speech: "Um, yeah, so like, okay" produces no useful visual nouns. Solution: spaCy filters to noun chunks only. If no nouns are found, the current prompt persists unchanged. The monitor shows "skipped" so you know the system heard you but chose not to change.

Room noise: Background audio above threshold keeps resetting the voice timeout timer forever. Solution: timeout based on last noun injection, not last audio detection.

Use Cases

Live performances — Speak, sing, or narrate while AI visuals respond
Interactive gallery installations — Visitors become co-creators
Storytelling — Narration drives visual evolution
Accessibility — Hands-free video generation for artists with mobility limitations
VJing — MC drives visuals with voice while DJ handles music