Würstchen V2 Model Wins Over Stable Diffusion XL with Impressive Speed for Generating High-Resolution Images
A recent tweet by the author of an article titled “Würstchen” (German for “Sausage”) has captured the attention of enthusiasts and experts alike. The tweet shared the intriguing results of generating images using the new Würstchen V2 model.
Würstchen is fast and efficient, generating images faster than models like Stable Diffusion XL while using less memory. It also has reduced training costs, with Würstchen v1 requiring only 9,000 GPU hours of training at 512×512 resolutions, compared to 150,000 GPU hours spent on Stable Diffusion 1.4. This 16x reduction in cost not only benefits researchers conducting new experiments but also opens the door for more organizations to train such models. Würstchen v2 used 24,602 GPU hours, making it 6x cheaper than SD1.4, which was only trained at 512×512.
Würstchen V2 is a diffusion model that works in a highly compressed latent space of images, reducing computational costs for training and inference by orders of magnitude. It employs a novel design that achieves a 42x spatial compression, a feat not previously seen. Würstchen employs a two-stage compression, Stage A and Stage B, which decode compressed images back into pixel space. A third model, Stage C, is learned in the highly compressed latent space, requiring fractions of the compute used for current top-performing models while allowing cheaper and faster inference.
Würstchen V2 comprises two diffusion stages:
- Stage A: This stage involves text-conditioned diffusion and boasts a staggering 1 billion parameters. The acceleration here is achieved through ultra-high compression techniques. Notably, instead of the hidden code size of 128x128x4, as seen in SDXL, Würstchen V2 initially operates at a resolution of 24x24x16. This means fewer pixels but more channels, resulting in a significant speed boost.
- Stage B: This is a diffusion model equipped with 600 million parameters, responsible for decompressing the image from 24×24 to a resolution of 128×128.
Completing the process is a decoder with 20 million parameters that transforms the hidden code into a rendered image.
The practical benefit that immediately stands out is the remarkable speed of Würstchen V2. It operates at a velocity that’s 2-2.5 times faster than SDXL, a noteworthy advancement in the field of AI image generation.
As with any technological innovation, there may be trade-offs. In terms of image quality, some experts suggest a slight loss, although a comprehensive and honest comparison is still awaited to provide concrete evidence.
Genreated text-to-image examples are below:
Read more related topics:
Any data, text, or other content on this page is provided as general market information and not as investment advice. Past performance is not necessarily an indicator of future results.
The Trust Project is a worldwide group of news organizations working to establish transparency standards.