Technical Articles, Integration
FLUX-Juiced: The Fastest Image Generation Endpoint (2.6x Faster!)
Apr 23, 2025

David Berenstein
ML & DevRel

John Rachwan
Cofounder & CTO

Nils Fleischmann
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist
Over the past few years, image generation models have become incredibly powerful—but also incredibly slow. Models like FLUX.1, with their massive architectures, regularly take 6+ seconds per image, even on cutting-edge GPUs like the H100. While optimizations are widely applied techniques can reduce inference time, their impact on quality often remains unclear, so we decided to do two things.
We built FLUX-juiced—the fastest FLUX.1 endpoint available today. And it’s now live on Replicate. 🚀
We created InferBench by running a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by different inference providers like Replicate, Fal, Fireworks, and Together! It’s now live on Hugging Face 🤗.
⚡ What’s FLUX-juiced?
FLUX-juiced is our optimized version of FLUX.1, delivering up to 2.6x faster inference than the official Replicate API, without sacrificing image quality.
Under the hood, it uses a custom combination of:
Graph compilation for optimized execution paths
Inference-time caching for repeated operations
We won’t go deep into the internals here, but here’s the gist:
We combine compiler-level execution graph optimization with selective caching of heavy operations (like attention layers), allowing inference to skip redundant computations without any loss in fidelity.
These techniques are generalized and plug-and-play via the Pruna Pro pipeline and can be applied to nearly any diffusion-based image model, not just FLUX. You can use our open-source solution for a free but still very juicy model.
🧪 Try FLUX-juiced now → replicate.com/prunaai/flux.1-juiced
📊 InferBench: Comparing Speed, Cost, and Quality
To back it up, we ran a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by:
Fireworks AI: https://fireworks.ai/models/fireworks/flux-1-dev-fp8
Together AI: https://www.together.ai/models/flux-1-dev
All of these inference providers offer FLUX.1 [dev] implementations, but they don’t always communicate about the optimisation methods used in the background, and most endpoints have different response times and performance measures.
We used the same generation configuration and hardware among the different providers for comparison purposes.
28 inference steps
1024×1024 resolution
Guidance scale of 3.5
H100 GPU (80GB)—only reported by Replicate
Although we did test with this configuration and hardware, the applied compression methods work with different config and hardware too!
While the full results of this benchmark are published in an InferBench Space on Hugging Face 🤗, you will learn about the key findings in this blog.
🗂️ Datasets & Benchmarks
To make a fair comparison, we decided to use datasets and benchmarks containing prompts, images, and human annotations designed to evaluate the capabilities of text-to-image generation models in a standardized way.
Name | Description | Source |
---|---|---|
DrawBench | A comprehensive set of prompts designed to evaluate text-to-image models across various capabilities, including rendering colors, object counts, spatial relations, and scene text. | |
HPSv2 | A large-scale dataset capturing human preferences on images from diverse sources, aimed at evaluating the alignment of generated images with human judgments. | |
GenAI-Bench | A benchmark designed to assess multimodal large language models' ability to judge AI-generated content quality by comparing model evaluations with human preferences. | |
GenEval | An object-focused framework for evaluating text-to-image alignment using existing object detection methods to produce fine-grained instance-level analysis. | |
PartiPrompts | A rich set of over 1600 English prompts designed to measure model capabilities across various categories and challenge aspects. |
📐 Metrics and scoring
Along with the datasets, we used various scoring measures to quantitatively and objectively assess the performance of text-to-image generation models across aspects such as image quality, relevance to the prompt, and alignment with human preferences.
Name | Description | Source |
---|---|---|
ImageReward | A reward model trained on 137k human preference comparisons to evaluate text-to-image generation quality. It serves as an automatic metric for assessing synthesis quality. | |
VQA-Score (model='clip-flant5-xxl') | A vision-language generative model fine-tuned for image-text retrieval tasks, providing scores that reflect the alignment between images and textual descriptions. | |
CLIP-IQA | An image quality assessment metric based on the CLIP model, measuring visual content quality by calculating the cosine similarity between images and predefined prompts. | |
CLIP-Score | Using the CLIP model, a reference-free metric evaluates the correlation between generated captions and image content and the similarity between texts or images. | |
CMMD | CLIP Maximum Mean Discrepancy (CMMD) is used to evaluate image generation models. CMMD stands out as a better metric than FID and tries to mitigate the longstanding issues of FID. | |
ARNIQA | A no-reference image quality assessment metric that predicts the technical quality of an image with high correlation to human judgments is included in the TorchMetrics library. | |
Sharpness (Variance of Laplacian) | A method to quantify image sharpness by calculating the variance of the Laplacian, where higher variance indicates a sharper image. |
🏆 Results
🕷️ Comparison overview
As discussed above, we evaluated the models' usage across various benchmarks and metrics. The figure below gives a comparison overview of our findings.

FLUX-juiced clearly leads the pack! While the model achieves comparable quality performance, it is much more efficient: it can generate 180 images for a dollar and takes only 2.5 seconds for one image, while the base model needs 6s.
🏎️ Speed Comparison
Most compression techniques trade off inference speed against quality. To find out where the Flux-juiced endpoint lies within this tradeoff, we plotted different quality dimensions against the measured inference speed.

The various FLUX-juiced versions form the Pareto front, meaning that no other API delivers the same or faster latency for any given speed without compromising quality.
FLUX-extra-juiced only takes 2.5 seconds per image compared to the baseline’s 7 seconds, a speedup that becomes significant at scale. Generating 1 million images translates to an approximate saving of 18 hours in compute runtime.
💸 Cost Comparison
Generating images at scale using these API’s is not cheap - most require about $25,000 to produce 1M images. Therefore ,we also consider quality versus cost.

The result? FLUX-juiced consistently sits on the Pareto front—delivering top-tier quality, at top-tier speed for a top-tier price! When generating 1M images, you could be saving $20.000 in costs and tens of hours in compute duration.
🖼️ Side-by-Side Comparison
We’ve also created a website that compares the outputs of FLUX-juiced with those of the baseline across 600 prompts. Take a look for yourself.

🧃Get a juiced Version of your Model
Using our Pruna Pro engine, we created FLUX-juiced with just 10 lines of code. You can do it, too, as this snippet works for almost every 🤗 Diffuser pipeline.
⏭️ What’s Next?
We have been releasing public models on Hugging Face already, and we plan on releasing many more models across inference providers like Replicate! We will not only focus on image generation but will likely tackle other modalities as well, implementing the latest and greatest optimization techniques!
Ready to make your own models go faster?!
🔧 Install Pruna SDK and optimize your own models
Let us know what models you want to see juiced next. And if you’re building with diffusion models, we would love to hear from you.
Stay juicy! We know we will.



・
Apr 23, 2025
Subscribe to Pruna's Newsletter