Auto-Caching: Pushing the Limits of Caching for FLUX - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Articles, Integration

Auto-Caching: Pushing the Limits of Caching for FLUX

Mar 19, 2025

Nils Fleischmann

ML Research Engineer

Bertrand Charpentier

Cofounder, President & Chief Scientist

There exist many compression methods to accelerate image generation performed by models like Flux. Often these methods have two problems: (1) the compressed model does not deliver high-speed gains, or (2) the quality of the generated images by the compressed models changes. Unfortunately, methods that yield the highest speed gains often produce outputs that deviate from those of the original model. Because the quality of image-generation models is inherently difficult to evaluate, one solution is to check that the images produced by the compressed model closely match the original outputs.

With this goal in mind, we introduce Auto Caching:

High speed up at no cost: it decreases the latency of FLUX.1[dev] by a factor of up to 3, while the resulting images remain virtually unchanged to the human eye.
SOTA Performance for FLUX: it outperforms existing caching methods such as TeaCache and FORA for FLUX.1[dev].
Simple to use: It only has a single parameter that directly controls the image generation latency of the compressed model. With the Pruna package, you can conveniently use it with less than 10 lines of code.

How does Auto Caching work?

Diffusion transformer (DiT) models generate images by starting with pure noise and gradually removing it over multiple inference steps until the final image emerges. At each step, a blurry image is fed into a transformer, which predicts the noise that should be subtracted from the image. Because the generation of a single image involves several passes through the transformer, models like FLUX.1[dev] can take up to half a minute per image.

Recent papers have shown that consecutive passes through the transformer share many similarities. In particular, the outputs of expensive operations within the transformer tend to remain almost the same from one inference step to the next. This finding motivates the use of caching: if these outputs only differ slightly, we can compute them once and reuse them in subsequent steps.

FORA proposes a simple but effective caching mechanism: it regularly computes the expensive operations every n steps and reuses the previously computed operations for the other steps. At Pruna, we really like FORA. In fact, we offer a refined implementation (called Flux Caching) as part of our library. However, we asked ourselves if we can do better than FORA with a less regular caching schedule.

This led to the development of Auto Caching, which caches steps in irregular patterns. Given a desired latency, it automatically determines the optimal steps for reusing the cached outputs. The latency can be controlled with the parameter speed_factor, which can take values from 0 to 1. For a given speed_factor, the latency of Auto Caching is approximately: speed_factor × latency of the base model.

How well does Auto Caching perform?

As mentioned earlier, our objective is to generate images that closely resemble those produced by the original model. To evaluate how effectively Auto Caching maintains image quality, we compare the images of our compressed model with the original images using several metrics:

LPIPS compares images within a learned feature space, capturing high-level perceptual differences that closely mirror human vision. Lower scores reflect a higher similarity, with 0.0 being the perfect score. Paper | Code
SSIM assesses the similarity between two images by comparing their luminance, contrast, and structural information. Higher scores are better, with 1.0 being the optimal score. Wikipedia
PSNR compares two images pixel by pixel. Higher values reflect closer similarity. Wikipedia

We compute these metrics based on a 65 prompt subset of the PartiPrompts (P2) dataset by Google. We consider the FLUX.1[dev] model with the common choices of 15, 28, and 50 for the number of inference steps.

We compare the following caching algorithms:

Auto-Caching with speed_factor=[0.95, 0.9, ..., 0.15] .
TeaCache with l1_threshold=[0.1, 0.2, ..., 1.0] Paper | Code | Website
FORA with cache-threshold=[2, 3, ..., 7] Paper | Code

Although the parameters are named differently, they all effectively control the tradeoff between quality and latency for the respective caching method.

Key Takeaways:

Across all settings, Auto Caching outperforms the other caching algorithms, with the largest performance gap observed at 50 inference steps.
Every caching algorithm exhibits a tradeoff between latency and quality. Auto-Caching makes it easy to control this tradeoff. Unlike FORA, it also allows for less aggressive caching, resulting in nearly perfect quality.

How to use Auto Caching?

If we got you interested in trying out this new caching algorithm with your own prompts, we have great news: you can do it in fewer than 10 lines of code. As a premium user of Pruna, you can use the following code snippet to play around with Auto caching:

from diffusers import FluxPipeline
from pruna_pro import SmashConfig, smash
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
).to("cuda")

smash_config = SmashConfig()
smash_config["cacher"] = "auto"
smash_config["auto_speed_factor"] = 0.5

smashed_pipe = smash(model=pipe,
										 token=<your_token>,
										 smash_config=smash_config)

smashed_pipe("A cat holding a sign that says hello world", 
		         num_inference_steps=50, 
		         width=512, 
		         height=512)

# change the speed_factor on the fly		         
smashed_pipe.cache_helper.set_params(speed_factor=0.3)

Conclusion

Reducing the latency of image generation models does not necessarily compromise output quality. With Auto Caching, we demonstrate that FLUX can be accelerated by a factor of three while preserving image fidelity. Our new algorithm pushes the boundaries of caching methods on FLUX. But not only for FLUX, the new Auto Caching algorithm is available for nearly all diffusers pipelines, including text-to-video models such as tencents HunyuanVideo model. We look forward to sharing further benchmarks and updates as we continue to push the boundaries of caching. Stay tuned!

Back to article

Button

・

Mar 19, 2025

Subscribe to Pruna's Newsletter

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

The AI Optimization Engine