Quantization for Image Generation Models to 3 bits: Shrinking Models, Keeping the Magic! - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Articles, Integration

Quantization for Image Generation Models to 3 bits: Shrinking Models, Keeping the Magic!

Mar 26, 2024

Louis Leconte

ML Research Engineer

Bertrand Charpentier

Cofounder, President & Chief Scientist

Quantization is a fundamental technique in machine learning (ML) that involves reducing the precision of numbers used to represent a model’s parameters and computations. By using lower precision (e.g., converting 32-bit floating-point numbers to 8-bit integers), quantization reduces the size of models and accelerates inference, making them more suitable for resource-constrained devices or large-scale applications.

✨ Worried About Quality Loss? Think of JPEG Compression!

It's natural to worry about quality degradation after quantization, but in practice, quality is not a severe limitation. Consider how we share and store images today — most people use compressed formats like JPEG instead of RAW files. Why? Because for the human eye, the difference is barely noticeable, while the file size is drastically reduced.

Similarly, quantization for image generation models works the same way: it compresses model parameters while maintaining perceptual quality. Just like JPEG became the gold standard for sharing images, quantized models are becoming the standard for efficient AI deployment.

At Pruna, we specialize in frictionless quantization algorithms that compress diffusion models while maintaining human-level quality. That means faster inference, smaller models, and no trade-off in visual fidelity — so you can generate stunning images without worrying about losing their magic. 🚀🎨

In the world of image generation models, quantization plays a crucial role in reducing memory consumption while keeping the visual quality intact. Below is a comparison of images generated using different quantization techniques on the Flux-8B diffuser model from Freepik. ;)

	Original Image	HQQ Quantized	BitsandBytes Quantized	HIGGS Quantized	TorchAO Quantized
#bits	16 bits	4 bits	4 bits	3 bits	8 bits
PSNR*	-	23.33	20.38	21.18	33.19

*PSNR can give “intuition” on differences in between two images, but does not indicate the absolute quality of an image. In other words, two images can be different but have both high quality.

🔍 What We Empirically Found Out

Overall, while we can make the model ~x4 smaller, the generated images still have very high quality, sometimes with subtle difference with the image generated by the base model.

Not all weights are equal in a diffusion pipeline. Some weights are more sensitive to quantization, and can potentially induce higher image quality degradation. While it is standard in the LLM community to not quantize the last layer (i.e. the “lm_head” layer), there is (yet) no gold standard for diffusion models. In our experiments we have carefully selected some weights (always < 20% of the total of the model weights), that we keep in full precision. Hence, we can keep a good balance between memory savings and image generation quality!

Now, let's explore four quantization techniques and how to use them with image generation models. 🎨✨

🚀HQQ (Half-Quadratic Quantization)

Fast, calibration-free, and highly efficient!

HQQ is a cutting-edge quantization method that removes the need for calibration data, making it significantly faster than traditional techniques. Initially designed for LLM quantization, HQQ is now available for diffuser models in Pruna, bringing its efficiency and speed to image generation. 🐘

🔹 Code example

import torch
from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained("Freepik/flux.1-lite-8B", torch_dtype=torch.float16).to("cuda")

from pruna_pro.smash import smash
from pruna.config.smash_config import SmashConfig

config = SmashConfig()
config['quantizer'] = 'hqq_diffusers'
config['hqq_diffusers_weight_bits'] = 4  # also 2, 8 available

smashed_pipeline = smash(pipeline, config)

with torch.inference_mode():
    image = smashed_pipeline(
        "a smiling cat dancing on a table. Miyazaki style",
        guidance_scale=3.5,
        num_inference_steps=50
		    ).images[0]

🏋️‍♂️BitsandBytes

Popular for lightweight 4-bit & 8-bit quantization!

BitsandBytes is widely used for efficient low-bit quantization, drastically reducing memory footprint.

🔹 Code example

import torch
from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained("Freepik/flux.1-lite-8B", torch_dtype=torch.float16).to("cuda")

from pruna_pro.smash import smash
from pruna.config.smash_config import SmashConfig

config = SmashConfig()
config['quantizer'] = 'diffusers_int8'
config['diffusers_int8_weight_bits'] = 4  #8 is also available

smashed_pipeline = smash(pipeline, config)

with torch.inference_mode():
    image = smashed_pipeline(
        "a smiling cat dancing on a table. Miyazaki style",
        guidance_scale=3.5,
        num_inference_steps=50
		    ).images[0]

⚡HIGGS Hadamard Incoherence and Gaussian mse-optimal GridS

Optimized for batch inference!

HIGGS is an innovative data-free quantization method, specifically optimized for both single-image and batch inference. Leveraging vector quantization, it eliminates the need for calibration data, making it highly efficient for fast and scalable image generation.

🔹 Code example:

import torch
from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained("Freepik/flux.1-lite-8B", torch_dtype=torch.float16).to("cuda")

from pruna_pro.smash import smash
from pruna.config.smash_config import SmashConfig

config = SmashConfig()
config['quantizer'] = 'diffusers_higgs'
config['diffusers_higgs_weight_bits'] = 3  #2 or 4 also available 

smashed_pipeline = smash(pipeline, config)

with torch.inference_mode():
    image = smashed_pipeline(
        "a smiling cat dancing on a table. Miyazaki style",
        guidance_scale=3.5,
        num_inference_steps=50
		    ).images[0]

🔥TorchAO (Torch Accelerated Optimization)

Native PyTorch quantization with advanced inference boosts!

TorchAO is a robust PyTorch library designed for quantization, sparsity, and model optimization. Pruna seamlessly integrates TorchAO’s AutoQuant feature for Image generation models, enabling 8-bit quantization and significant inference speed ups for enhanced performance. 🚀

🔹 Code example:

import torch
from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained("Freepik/flux.1-lite-8B", torch_dtype=torch.float16).to("cuda")

from pruna_pro.smash import smash
from pruna.config.smash_config import SmashConfig

config = SmashConfig()
config['quantizer'] = 'torchao_autoquant'

smashed_pipeline = smash(pipeline, config)

with torch.inference_mode():
    image = smashed_pipeline(
        "a smiling cat dancing on a table. Miyazaki style",
        guidance_scale=3.5,
        num_inference_steps=50
		    ).images[0]

🎯 Conclusion: Make Your Models Lighter, Greener, and Just as Magical

Quantization isn’t just a technical optimization—it’s a game-changer for image generation. With the right technique, you can shrink your models up to 4x, speed up inference, and still generate images with stunning, human-level quality. At Pruna, we make this transformation effortless with frictionless compression tools tailored for diffusion models.

Whether you're deploying on edge devices or scaling up in the cloud, our tools help you keep the magic of image generation alive—without the memory bottleneck. ✨🎨

🚀 Ready to unlock faster, greener image generation?

👉 Try Pruna OSS today for free, open-source quantization, and experience the future of efficient generative AI: speed, simplicity, and no compromise in quality.

👌Upgrade to Pruna Pro for advanced features, premium support, and the best combinations of compression algorithms.

Back to article

Button

・

Mar 26, 2024

Subscribe to Pruna's Newsletter

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

The AI Optimization Engine