Technical Articles, Integration
Auto-Caching: Pushing the Limits of Caching for FLUX
Mar 19, 2025

Nils Fleischmann
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist
There exist many compression methods to accelerate image generation performed by models like Flux. Often these methods have two problems: (1) the compressed model does not deliver high-speed gains, or (2) the quality of the generated images by the compressed models changes. Unfortunately, methods that yield the highest speed gains often produce outputs that deviate from those of the original model. Because the quality of image-generation models is inherently difficult to evaluate, one solution is to check that the images produced by the compressed model closely match the original outputs.
With this goal in mind, we introduce Auto Caching:
High speed up at no cost: it decreases the latency of FLUX.1[dev] by a factor of up to 3, while the resulting images remain virtually unchanged to the human eye.
SOTA Performance for FLUX: it outperforms existing caching methods such as TeaCache and FORA for FLUX.1[dev].
Simple to use: It only has a single parameter that directly controls the image generation latency of the compressed model. With the Pruna package, you can conveniently use it with less than 10 lines of code.

How does Auto Caching work?
Diffusion transformer (DiT) models generate images by starting with pure noise and gradually removing it over multiple inference steps until the final image emerges. At each step, a blurry image is fed into a transformer, which predicts the noise that should be subtracted from the image. Because the generation of a single image involves several passes through the transformer, models like FLUX.1[dev] can take up to half a minute per image.
Recent papers have shown that consecutive passes through the transformer share many similarities. In particular, the outputs of expensive operations within the transformer tend to remain almost the same from one inference step to the next. This finding motivates the use of caching: if these outputs only differ slightly, we can compute them once and reuse them in subsequent steps.
FORA proposes a simple but effective caching mechanism: it regularly computes the expensive operations every n steps and reuses the previously computed operations for the other steps. At Pruna, we really like FORA. In fact, we offer a refined implementation (called Flux Caching) as part of our library. However, we asked ourselves if we can do better than FORA with a less regular caching schedule.
This led to the development of Auto Caching, which caches steps in irregular patterns. Given a desired latency, it automatically determines the optimal steps for reusing the cached outputs. The latency can be controlled with the parameter speed_factor
, which can take values from 0 to 1. For a given speed_factor
, the latency of Auto Caching is approximately: speed_factor
× latency of the base model.

How well does Auto Caching perform?
As mentioned earlier, our objective is to generate images that closely resemble those produced by the original model. To evaluate how effectively Auto Caching maintains image quality, we compare the images of our compressed model with the original images using several metrics:
LPIPS compares images within a learned feature space, capturing high-level perceptual differences that closely mirror human vision. Lower scores reflect a higher similarity, with 0.0 being the perfect score. Paper | Code
SSIM assesses the similarity between two images by comparing their luminance, contrast, and structural information. Higher scores are better, with 1.0 being the optimal score. Wikipedia
PSNR compares two images pixel by pixel. Higher values reflect closer similarity. Wikipedia
We compute these metrics based on a 65 prompt subset of the PartiPrompts (P2) dataset by Google. We consider the FLUX.1[dev] model with the common choices of 15, 28, and 50 for the number of inference steps.
We compare the following caching algorithms:
Auto-Caching with
speed_factor=[0.95, 0.9, ..., 0.15]
.TeaCache with
l1_threshold=[0.1, 0.2, ..., 1.0]
Paper | Code | Website
Although the parameters are named differently, they all effectively control the tradeoff between quality and latency for the respective caching method.



Key Takeaways:
Across all settings, Auto Caching outperforms the other caching algorithms, with the largest performance gap observed at 50 inference steps.
Every caching algorithm exhibits a tradeoff between latency and quality. Auto-Caching makes it easy to control this tradeoff. Unlike FORA, it also allows for less aggressive caching, resulting in nearly perfect quality.
How to use Auto Caching?
If we got you interested in trying out this new caching algorithm with your own prompts, we have great news: you can do it in fewer than 10 lines of code. As a premium user of Pruna, you can use the following code snippet to play around with Auto caching:
Conclusion
Reducing the latency of image generation models does not necessarily compromise output quality. With Auto Caching, we demonstrate that FLUX can be accelerated by a factor of three while preserving image fidelity. Our new algorithm pushes the boundaries of caching methods on FLUX. But not only for FLUX, the new Auto Caching algorithm is available for nearly all diffusers pipelines, including text-to-video models such as tencents HunyuanVideo model. We look forward to sharing further benchmarks and updates as we continue to push the boundaries of caching. Stay tuned!



・
Mar 19, 2025
Subscribe to Pruna's Newsletter