FLUX-Juiced: The Fastest Image Generation Endpoint (2.6x Faster!) - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Articles, Integration

FLUX-Juiced: The Fastest Image Generation Endpoint (2.6x Faster!)

Apr 23, 2025

David Berenstein

ML & DevRel

John Rachwan

Cofounder & CTO

Nils Fleischmann

ML Research Engineer

Bertrand Charpentier

Cofounder, President & Chief Scientist

flux-juiced the best Flux optimization out there

Over the past few years, image generation models have become incredibly powerful—but also incredibly slow. Models like FLUX.1, with their massive architectures, regularly take 6+ seconds per image, even on cutting-edge GPUs like the H100. While optimizations are widely applied techniques can reduce inference time, their impact on quality often remains unclear, so we decided to do two things.

We built FLUX-juiced—the fastest FLUX.1 endpoint available today. And it’s now live on Replicate. 🚀
We created InferBench by running a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by different inference providers like Replicate, Fal, Fireworks, and Together! It’s now live on Hugging Face 🤗.

⚡ What’s FLUX-juiced?

FLUX-juiced is our optimized version of FLUX.1, delivering up to 2.6x faster inference than the official Replicate API, without sacrificing image quality.

Under the hood, it uses a custom combination of:

Graph compilation for optimized execution paths
Inference-time caching for repeated operations

We won’t go deep into the internals here, but here’s the gist:

We combine compiler-level execution graph optimization with selective caching of heavy operations (like attention layers), allowing inference to skip redundant computations without any loss in fidelity.

These techniques are generalized and plug-and-play via the Pruna Pro pipeline and can be applied to nearly any diffusion-based image model, not just FLUX. You can use our open-source solution for a free but still very juicy model.

🧪 Try FLUX-juiced now → replicate.com/prunaai/flux.1-juiced

📊 InferBench: Comparing Speed, Cost, and Quality

To back it up, we ran a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by:

Replicate: https://replicate.com/black-forest-labs/flux-dev
Fal: https://fal.ai/models/fal-ai/flux/dev
Fireworks AI: https://fireworks.ai/models/fireworks/flux-1-dev-fp8
Together AI: https://www.together.ai/models/flux-1-dev

All of these inference providers offer FLUX.1 [dev] implementations, but they don’t always communicate about the optimisation methods used in the background, and most endpoints have different response times and performance measures.

We used the same generation configuration and hardware among the different providers for comparison purposes.

28 inference steps
1024×1024 resolution
Guidance scale of 3.5
H100 GPU (80GB)—only reported by Replicate

Although we did test with this configuration and hardware, the applied compression methods work with different config and hardware too!

While the full results of this benchmark are published in an InferBench Space on Hugging Face 🤗, you will learn about the key findings in this blog.

🗂️ Datasets & Benchmarks

To make a fair comparison, we decided to use datasets and benchmarks containing prompts, images, and human annotations designed to evaluate the capabilities of text-to-image generation models in a standardized way.

Name	Description	Source
DrawBench	A comprehensive set of prompts designed to evaluate text-to-image models across various capabilities, including rendering colors, object counts, spatial relations, and scene text.	Dataset \| Paper
HPSv2	A large-scale dataset capturing human preferences on images from diverse sources, aimed at evaluating the alignment of generated images with human judgments.	GitHub \| Paper
GenAI-Bench	A benchmark designed to assess multimodal large language models' ability to judge AI-generated content quality by comparing model evaluations with human preferences.	Dataset \| GitHub \| Website \| Paper
GenEval	An object-focused framework for evaluating text-to-image alignment using existing object detection methods to produce fine-grained instance-level analysis.	GitHub \| Paper
PartiPrompts	A rich set of over 1600 English prompts designed to measure model capabilities across various categories and challenge aspects.	Dataset \| GitHub \| Website \| Paper

📐 Metrics and scoring

Along with the datasets, we used various scoring measures to quantitatively and objectively assess the performance of text-to-image generation models across aspects such as image quality, relevance to the prompt, and alignment with human preferences.

Name	Description	Source
ImageReward	A reward model trained on 137k human preference comparisons to evaluate text-to-image generation quality. It serves as an automatic metric for assessing synthesis quality.	Paper \| GitHub
VQA-Score (model='clip-flant5-xxl')	A vision-language generative model fine-tuned for image-text retrieval tasks, providing scores that reflect the alignment between images and textual descriptions.	GitHub \| Paper
CLIP-IQA	An image quality assessment metric based on the CLIP model, measuring visual content quality by calculating the cosine similarity between images and predefined prompts.	Paper \| TorchMetrics
CLIP-Score	Using the CLIP model, a reference-free metric evaluates the correlation between generated captions and image content and the similarity between texts or images.	Paper \| TorchMetrics
CMMD	CLIP Maximum Mean Discrepancy (CMMD) is used to evaluate image generation models. CMMD stands out as a better metric than FID and tries to mitigate the longstanding issues of FID.	GitHub \| Paper
ARNIQA	A no-reference image quality assessment metric that predicts the technical quality of an image with high correlation to human judgments is included in the TorchMetrics library.	Paper \| TorchMetrics
Sharpness (Variance of Laplacian)	A method to quantify image sharpness by calculating the variance of the Laplacian, where higher variance indicates a sharper image.	Blogpost \| Blogpost 2

🏆 Results

🕷️ Comparison overview

As discussed above, we evaluated the models' usage across various benchmarks and metrics. The figure below gives a comparison overview of our findings.

FLUX-juiced clearly leads the pack! While the model achieves comparable quality performance, it is much more efficient: it can generate 180 images for a dollar and takes only 2.5 seconds for one image, while the base model needs 6s.

🏎️ Speed Comparison

Most compression techniques trade off inference speed against quality. To find out where the Flux-juiced endpoint lies within this tradeoff, we plotted different quality dimensions against the measured inference speed.

The various FLUX-juiced versions form the Pareto front, meaning that no other API delivers the same or faster latency for any given speed without compromising quality.

FLUX-extra-juiced only takes 2.5 seconds per image compared to the baseline’s 7 seconds, a speedup that becomes significant at scale. Generating 1 million images translates to an approximate saving of 18 hours in compute runtime.

💸 Cost Comparison

Generating images at scale using these API’s is not cheap - most require about $25,000 to produce 1M images. Therefore ,we also consider quality versus cost.

The result? FLUX-juiced consistently sits on the Pareto front—delivering top-tier quality, at top-tier speed for a top-tier price! When generating 1M images, you could be saving $20.000 in costs and tens of hours in compute duration.

🖼️ Side-by-Side Comparison

We’ve also created a website that compares the outputs of FLUX-juiced with those of the baseline across 600 prompts. Take a look for yourself.

🧃Get a juiced Version of your Model

Using our Pruna Pro engine, we created FLUX-juiced with just 10 lines of code. You can do it, too, as this snippet works for almost every 🤗 Diffuser pipeline.

import torch
from diffusers import FluxPipeline
from pruna_pro import SmashConfig, smash

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["cacher"] = "taylor_auto"
# lightly juiced (0.6), juiced (0.5), extra juiced (0.4)
smash_config["taylor_auto_speed_factor"] = 0.4
smash_token = "<your-token>"
smashed_pipe = smash(
    model=pipe,
    token=smash_token,
    smash_config=smash_config,
)

smashed_pipe("A cute, round, knitted purple prune.").images[0]

⏭️ What’s Next?

We have been releasing public models on Hugging Face already, and we plan on releasing many more models across inference providers like Replicate! We will not only focus on image generation but will likely tackle other modalities as well, implementing the latest and greatest optimization techniques!

Ready to make your own models go faster?!

🔌 Try FLUX-juiced on Replicate
🌟 Star us on GitHub
🔧 Install Pruna SDK and optimize your own models

Let us know what models you want to see juiced next. And if you’re building with diffusion models, we would love to hear from you.

Stay juicy! We know we will.

Back to article

Button

・

Apr 23, 2025

Subscribe to Pruna's Newsletter

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Install Pruna AI

The AI Optimization Engine