Technical Articles, Integration
Quantization for Image Generation Models to 3 bits: Shrinking Models, Keeping the Magic!
Mar 26, 2024

Louis Leconte
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist
Quantization is a fundamental technique in machine learning (ML) that involves reducing the precision of numbers used to represent a model’s parameters and computations. By using lower precision (e.g., converting 32-bit floating-point numbers to 8-bit integers), quantization reduces the size of models and accelerates inference, making them more suitable for resource-constrained devices or large-scale applications.

✨ Worried About Quality Loss? Think of JPEG Compression!
It's natural to worry about quality degradation after quantization, but in practice, quality is not a severe limitation. Consider how we share and store images today — most people use compressed formats like JPEG instead of RAW files. Why? Because for the human eye, the difference is barely noticeable, while the file size is drastically reduced.
Similarly, quantization for image generation models works the same way: it compresses model parameters while maintaining perceptual quality. Just like JPEG became the gold standard for sharing images, quantized models are becoming the standard for efficient AI deployment.
At Pruna, we specialize in frictionless quantization algorithms that compress diffusion models while maintaining human-level quality. That means faster inference, smaller models, and no trade-off in visual fidelity — so you can generate stunning images without worrying about losing their magic. 🚀🎨
In the world of image generation models, quantization plays a crucial role in reducing memory consumption while keeping the visual quality intact. Below is a comparison of images generated using different quantization techniques on the Flux-8B diffuser model from Freepik. ;)
Original Image | HQQ Quantized | BitsandBytes Quantized | HIGGS Quantized | TorchAO Quantized | |
---|---|---|---|---|---|
#bits | 16 bits | 4 bits | 4 bits | 3 bits | 8 bits |
PSNR* | - | 23.33 | 20.38 | 21.18 | 33.19 |

*PSNR can give “intuition” on differences in between two images, but does not indicate the absolute quality of an image. In other words, two images can be different but have both high quality.
🔍 What We Empirically Found Out
Overall, while we can make the model ~x4 smaller, the generated images still have very high quality, sometimes with subtle difference with the image generated by the base model.
Not all weights are equal in a diffusion pipeline. Some weights are more sensitive to quantization, and can potentially induce higher image quality degradation. While it is standard in the LLM community to not quantize the last layer (i.e. the “lm_head” layer), there is (yet) no gold standard for diffusion models. In our experiments we have carefully selected some weights (always < 20% of the total of the model weights), that we keep in full precision. Hence, we can keep a good balance between memory savings and image generation quality!
Now, let's explore four quantization techniques and how to use them with image generation models. 🎨✨
🚀HQQ (Half-Quadratic Quantization)
Fast, calibration-free, and highly efficient!
HQQ is a cutting-edge quantization method that removes the need for calibration data, making it significantly faster than traditional techniques. Initially designed for LLM quantization, HQQ is now available for diffuser models in Pruna, bringing its efficiency and speed to image generation. 🐘
🔹 Code example
🏋️♂️BitsandBytes
Popular for lightweight 4-bit & 8-bit quantization!
BitsandBytes is widely used for efficient low-bit quantization, drastically reducing memory footprint.
🔹 Code example
⚡HIGGS Hadamard Incoherence and Gaussian mse-optimal GridS
Optimized for batch inference!
HIGGS is an innovative data-free quantization method, specifically optimized for both single-image and batch inference. Leveraging vector quantization, it eliminates the need for calibration data, making it highly efficient for fast and scalable image generation.
🔹 Code example:
🔥TorchAO (Torch Accelerated Optimization)
Native PyTorch quantization with advanced inference boosts!
TorchAO is a robust PyTorch library designed for quantization, sparsity, and model optimization. Pruna seamlessly integrates TorchAO’s AutoQuant feature for Image generation models, enabling 8-bit quantization and significant inference speed ups for enhanced performance. 🚀
🔹 Code example:
🎯 Conclusion: Make Your Models Lighter, Greener, and Just as Magical
Quantization isn’t just a technical optimization—it’s a game-changer for image generation. With the right technique, you can shrink your models up to 4x, speed up inference, and still generate images with stunning, human-level quality. At Pruna, we make this transformation effortless with frictionless compression tools tailored for diffusion models.
Whether you're deploying on edge devices or scaling up in the cloud, our tools help you keep the magic of image generation alive—without the memory bottleneck. ✨🎨
🚀 Ready to unlock faster, greener image generation?
👉 Try Pruna OSS today for free, open-source quantization, and experience the future of efficient generative AI: speed, simplicity, and no compromise in quality.
👌Upgrade to Pruna Pro for advanced features, premium support, and the best combinations of compression algorithms.



・
Mar 26, 2024
Subscribe to Pruna's Newsletter