
Check out the latest model drops and powerful integrations.
I’ve been digging into real-time upscaling solutions for AI video, specifically looking for a way to break past the "High Fidelity vs. High Speed" trade-off.
Standard upscalers like Real-ESRGAN (SISR) treat every frame individually, leading to that notorious "flickering" and instability and most of the open source VSR based solutions out there which produce great quality solution are not Auto-Regressive and are also heavy.
Enter FlashVSR. ⚡
FlashVSR is a one-step streaming diffusion framework that achieves real-time, high-quality video super-resolution. It introduces One-Step Streaming Distillation, Locality-Constrained Sparse Attention (LCSA), and a Tiny Conditional Decoder to deliver upscaling with extreme efficiency.
I’ve written a full technical report comparing various upscalers (SISR vs. VSR), have a read here but I wanted to share a demo of FlashVSR running directly inside Daydream Scope.
Why FlashVSR? Unlike standard image upscalers, FlashVSR uses Video Super Resolution (VSR). It utilizes temporal information across multiple frames to maintain consistency.
Once you have Scope up and running, you can install this right now into your running Scope instance with a single command:
uv run daydream-scope install git+https://github.com/varshith15/FlashVSR-Pro.git
In the demo video, you'll see the output running at around 15 FPS on a H100 SXM.
You can run FlashVSR not only on a H100 SXM but on any machine which has minimum of 15 GB VRAM, on RTX 5090 we get about 20 FPS as a post-processor and about 14-15 FPS end to end as a standalone pipeline.
The Math: Why Post-Processing is the Unlock
While the demo shows FlashVSR running as a standalone plugin, the real efficiency unlock comes from chaining it directly as a post-processor after your generation pipeline (e.g., LongLive). By keeping the tensors on the GPU, we avoid the expensive encode/decode roundtrips and get a massive performance boost compared to native high-res generation.
Here is the math on why this approach wins:
By generating at lower resolution and using FlashVSR as a post-processor, we effectively get >2x the performance (13.9 FPS vs 6 FPS) for 1024px output, without sacrificing the stability or quality of the final video.
For more details, quality comparisons between different upscalers check out: detailed report
Updates:
- Pushed a fix recently, the E2E fps should now be about 20-22 on a H100 SXM (Updated the values in the report)
- Switched to Sparse Sage Attention instead of Block Sparse and we get about 22 FPS inference throughput on a RTX 5090 which is almost 1.5x than before.