
Check out the latest model drops and powerful integrations.
Picture this: you have a powerful text-to-video model that can generate stunning motion from words alone. But what if you want to start with an image? What if you want to take a photo and say "make this come alive"? That's the dream of Image-to-Video (I2V), and it's exactly what we set out to build.
StreamDiffusion V2 runs on the `Wan2.1-T2V-1.3B` checkpoint—a text-to-video model that's incredibly good at what it does. But like a chef who's only ever cooked from recipes, it had never learned to work with images as input. We had two promising ideas that seemed like they should work. Spoiler alert: they didn't. But the third time? That's where the magic happened.
Our first attempt was elegant in theory: use CLIP (Contrastive Language-Image Pre-training) to extract semantic meaning from the input image, then inject those features directly into the model's attention mechanism. Think of it like showing the model a picture and whispering "here's what this image contains" at every decision point.
The architecture was straightforward:
It was beautiful. It was logical. It was... completely broken.
The Transformer Rebellion
Even with perfect tensor formatting and correct concatenation order, something strange happened: the CLIP tokens took over. They dominated the attention mechanism so completely that the text prompt became irrelevant. The model, never having been trained to balance CLIP features with text in this specific way, essentially ignored your carefully crafted prompts.
The result? Grid-like, scattered outputs that looked nothing like what you asked for. It was like trying to have a conversation where one person keeps shouting over everyone else—the signal gets lost in the noise.
The Missing Weights Problem
But there was a deeper issue. The `Wan2.1-T2V-1.3B` checkpoint we're using is a text-to-video model. It was never trained to understand CLIP image features. The crucial `clip_img_emb` projection layer—the bridge that maps CLIP features into the model's internal language—only exists in dedicated I2V models.
We investigated other implementations like Matrix-Game-2 and MotionStream, but they use SkyReels V2 I2V 1.3b - a completely different base model that was trained from the ground up for I2V. To make CLIP work with our T2V checkpoint, we'd need to retrain the entire model. That's not a feature addition; that's a new model.
Sometimes the architecture looks right, the code is correct, but the fundamental mismatch between what the model was trained for and what you're asking it to do makes success impossible. The UI sliders for CLIP control remain in the interface, but they're functionally disabled for T2V pipelines to prevent confusion.
Our second approach was inspired by how native I2V models work: concatenate the VAE-encoded reference image directly into the input channels. Instead of injecting features through attention, we'd give the model the image as part of its input—like handing a painter both the canvas and a reference photo at the same time.
We tried multiple designs:
It seemed like the most direct way to mimic native I2V behavior. What could go wrong?
The Rolling KV Cache Disaster
StreamDiffusion V2 uses a clever optimization called a "rolling KV cache" to maintain temporal consistency across video chunks. Think of it like a memory system that keeps track of what happened in previous frames so the model can generate smooth, coherent motion.
When we prepended an extra image frame to the temporal dimension, we broke this system completely. The cache indexing logic expects a consistent temporal window—it's like a conveyor belt that expects items at specific intervals. Inserting a static "past" frame desynchronized everything.
We tried:
Nothing worked. The fundamental conflict between the streaming architecture and our modification was irreconcilable.
The Channel Count Mismatch
Even if we could fix the cache issue, there was another problem: channel count. Native I2V models accept 32 channels (16 for video + 16 for reference). Our T2V checkpoint expects exactly 16 channels. Without the projection weights to map those extra 16 channels down to the model's internal dimension, the DiT (Diffusion Transformer) would receive random or zeroed inputs for half of its input. The reference image would be useless noise.
Sometimes the streaming architecture itself becomes a constraint. What works in a batch processing context (like native I2V) breaks catastrophically when you need to maintain state across chunks in real-time. Channel concatenation was formally disabled for streaming and documented as incompatible with T2V-based models.
After two failures, we took a step back. What if instead of trying to *condition* the model with the image, we just... *replaced* the video feed with the image? What if we turned the whole pipeline into a streaming Img2Img system?
It sounds simple, but it required a crucial insight: motion-aware noise control.
When you upload an image, the frame processor does something clever:
The system doesn't know it's processing an image—it just sees a normal video chunk. This avoids all the cache synchronization issues we hit before.
Here's where it gets interesting. If you just feed static image frames through the pipeline, the model will get "stuck"—it will faithfully reproduce the static image with no motion. We needed a way to tell the model: "this is a static image, generate some motion!"
Enter the Motion-Aware Noise Controller. It's like an adaptive volume knob that adjusts based on how much motion it detects:
- Static Input (Image): Motion ≈ 0 → Higher noise scale (~0.8)
- Effect: High noise disrupts the static latent, allowing the model to "hallucinate" motion and change
- Active Input (Video): Motion is high → Lower noise scale (~0.6)
- Effect: Low noise preserves the structure and motion of the input video
The controller calculates the L2 distance (motion) between frames and dynamically adjusts the noise. It's like having an AI assistant that knows when to be creative (high noise for static images) and when to be faithful (low noise for dynamic video).
Input Image:

Input Image
This approach isn't perfect, but it's practical:
The cleanest code in the world won't help if you're trying to use a model for something it wasn't designed for. The T2V checkpoint simply doesn't have the weights needed for CLIP-based I2V, and no amount of clever engineering can fix that without retraining.
What works in batch processing (like native I2V channel concatenation) can break catastrophically when you need to maintain state across chunks. The rolling KV cache isn't just an optimization—it's a fundamental part of the architecture that can't be easily modified.
Instead of fighting the architecture, we worked with it. By treating images as video chunks and using adaptive noise control, we achieved I2V behavior without needing to retrain models or rebuild architectures.
One of our most valuable lessons came from reading research papers directly rather than relying on AI-generated summaries. The summaries often made incorrect assumptions—like assuming that because cross-attention exists in T2V models, they must support CLIP features. The reality is more nuanced: CLIP features are only trained on I2V models, not T2V.
Today, StreamDiffusion V2 supports Image-to-Video through the VAE-latent approach with motion-aware noise control. Users can upload an image, adjust noise scales and prompts in real-time, and watch as static images come to life. It's not the I2V system we originally envisioned, but it's one that actually works—and sometimes that's even better.
Want to try it yourself? Check out the I2V Branch in Scope and experiment with different noise scales and prompts. The journey from failure to success is often more educational than getting it right the first time.