Case Study
Solving Evaluation for 15 AI Models at a German Automotive Giant
Jan 13, 2025
Quentin Sinig
Go-To-Market Lead
A Benchmarking Challenge with High Stakes
For several months, Pruna AI partnered with a major German car manufacturer to deliver an in-depth performance evaluation of multiple AI models. While our product was positively received, before committing to a multi-year contract, they wanted to battle-test our expertise through an advanced benchmarking mission.
And that’s something we’re used to. As we’ve mentioned in previous blog posts, it’s the number one request we receive. For large enterprises, however, it’s a completely different scale—this is not just a mere PoC.
Evaluation is a complex topic, as it involves numerous variables to consider before making it a repeatable and scalable process. Many of the companies we spoke with—whether SMBs or enterprises—acknowledged that they’re in a weak position when it comes to evaluation.
This article provides a synthesized overview of the benefits and gains from using our inference optimization and various method combinations across multiple model types. While many metrics were benchmarked, not all are included here. We’ve prioritized a few key metrics for clarity and relevance.
Disclaimer: This case study is subject to a strict NDA. While we all agree that stories are better with names, we weren’t able to reveal the customer’s identity. All results shared here were obtained in our own environments using open-source models and public datasets. A special shoutout to our ML team for the incredible effort they put into this implementation.
The Main Challenges with Evaluation
"Do you have benchmarks?" — It’s the question we hear most often at Pruna. But the truth is, model evaluation lacks a universal standard. Every team seems to approach it differently, which makes consistent benchmarking a challenge.
Performance evaluation often begins with a simple question: does the base model replicate the accuracy reported in the original paper? From there, the complexities multiply, starting with alignment on what “inference time” actually means. Is it end-to-end generation time? Are you including warm-up time? And what about encoder/decoder—are they accounted for?
Equally important is ensuring the evaluation environment is consistent. Details like GPU configurations, input preprocessing, and even Python package versions can impact results and must be aligned across tests.
Finally, there’s the issue of the metrics themselves. Sometimes, even commonly used metrics don’t have unanimous scientific agreement on how they should be calculated.
Now, add in the complexity of testing across multiple target hardware, plus the need of definition clear threshold for when further optimization is unnecessary… and you’ve got yourself a process that can easily spiral into chaos. That’s why even large, experienced teams often struggle to get evaluation right. It’s hard to set up, harder to repeat, and even harder to maintain over time.
And here’s a surprising twist: sometimes, you don’t even need to dive into inference optimization to see massive gains. Some open-source projects suffer from terrible code quality. Just cleaning up and refactoring the code can lead to significant improvements—no fancy tricks required.
Take compiling models, for example. This requires access to the computation graph, but in the case of the mono-depth model we worked on, the graph was broken due to older practices at the time it was written. On top of that, the codebase was unfamiliar to us, which made things even trickier. We had two options:
Integrate our work into the massive, tangled structure that we didn’t fully understand, or
Extract and isolate the relevant pieces of code—like data preprocessing, postprocessing, and metrics—and build on that.
We went with option two. Why? It gave us the flexibility we needed and made handling the benchmark tasks far more efficient. No extra baggage, just clean and streamlined execution.
How Pruna Simplifies AI Optimization for Enterprises
Enterprises are partnering with Pruna AI for model optimization because we don’t just provide tooling—we deliver a solution to one of the most overlooked pain points in machine learning: evaluation.
While we’re still building features to make these capabilities scalable & repeatable, by integrating our optimization engine as an additional layer in your ML pipelines, you gain a consistent framework for benchmarking across diverse hardware and model types.
In the future, we plan to make evaluation even more reliable, automated, and convenient. This includes offering custom evaluation metrics on-demand, enabling teams to measure what truly matters to them, and introducing metric-based evaluations to streamline comparisons and enhance transparency. We’re also exploring human-in-the-loop evaluations, particularly for image generation use cases—a space where we’ve already begun alpha development. To tie everything together, we envision dashboards that let you monitor each deployment, track before-and-after performance, and stay on top of your models’ performance as they evolve.
This not only ensures your models are faster, more efficient, and cost-effective but also provides a unified, automated approach to tracking performance over time, enabling your teams to focus on innovation rather than troubleshooting. If you’re interested to know about it, feel free to contact us!
The Results: Key Metrics Across 15 Model Variations
Psst, why say 15 models? Well, technically, it’s 5 model types, but each one was deployed on 3 different hardware configurations. Depending on the target hardware, you can’t use the exact same methods, hyperparameters, or backends—each setup comes with its own constraints. When you factor in the additional workload and fine-tuning required for each configuration, it’s fair to say we benchmarked 15 models, not just 5. See what we did there?
Automatic Speech Recognition (ASR)
Efficiency Gains: up to 2x speed improvements.
No Compromise on Accuracy: Across all methods tested, there was zero loss in Word Error Rate (WER), ensuring that optimization efforts do not impact transcription accuracy.
Substantial Speed Gains: The methods delivered an average speed-up of 2x, highlighting a substantial improvement in processing efficiency.
Encouraging Real-Time Performance: While the audio files tested were short, the Real-Time Factor (RTF) results indicate strong potential for handling longer recordings or real-time discussions. This ensures a seamless and responsive user experience, even under dynamic conditions.
Conditions:
Base model, dataset: whisper-v3-small, LibriSpeech.
Metrics evaluated: RTF, WER, throughput_async.
Hardware tested: NVIDIA Tesla T4, NVIDIA A10G, Intel Xeon CL.
Monocular Depth Estimation
Efficiency Gains: up to 80x speed improvements on GPUs, 10x on CPUs.
No Accuracy Loss: the smashed models delivered identical performance to the original models across all metrics, ensuring no compromise on reliability.
Outstanding Speed Gains: we achieved up to 80x speed improvements on GPUs and at least 10x on CPUs, highlighting substantial efficiency across hardware configurations.
Energy Efficiency: we observed significant reductions in energy consumption, varying by device, demonstrating improved sustainability for optimized deployments.
Conditions:
Base model & dataset: Metric3D-v2-S, KITTI Eigen.
Metrics evaluated: FPS, abs_rel, sq_rel, rmse, rmse_log, energy consumption.
Hardware tested: NVIDIA Tesla T4, NVIDIA A10G, Intel Xeon CL.
Natural Language Processing (NLP)
Efficiency Gains: up 2x to 5x speed improvements.
Minimal Accuracy Trade-Offs: For every method tested, we observed only acceptable decreases in performance, staying well within the 5% maximum accuracy drop guideline. These decreases were minor and within the expected noise levels of the benchmark.
Impressive Speed Gains: The optimization methods delivered significant speed-ups:
On GPUs, text generation was 1.5–2x faster, showcasing strong efficiency gains for high-performance use cases.
On CPUs, the results were even more striking, achieving up to 5x faster speeds.
Human-Like Token Generation: benchmarks were conducted on a smaller model (llama-3.2-1B with 1 billion parameters instead of 8 billion), yet we achieved token generation speeds on CPU, matching the pace of average human reading (around 8 token/s). This highlights the practical viability of these optimizations even in constrained environments.
Conditions:
Base model: Llama3.2-1B, Hellaswag.
Metric evaluated: Hellaswag accuracy, token/s
Hardware Tested: NVIDIA Tesla T4, NVIDIA A10G, Intel Xeon CL
Semantic Segmentation
Transformer - Efficiency Gains: up to 30x speed improvements on GPUs..
CNN - Efficiency Gains: up to 60x speed improvements on GPUs.
Accuracy Maintained: All tested methods showed no compromise on accuracy, ensuring the reliability of the results.
Conditions:
Base model, dataset: DDRNet23 and SegFormer-MiTB2, Cityscapes.
Metric evaluated: FPS, MIoU, memory, energy, macs.
Hardware tested: NVIDIA Tesla T4, NVIDIA A10G, Intel Xeon CL.
Scaling Beyond Benchmarks: What’s Next?
While we may eventually go public with this case study (some day?), everything we’ve developed so far is already functional and seamlessly integrated for our customer—off-the-shelf and ready to deliver results. But the real question is: how do partnerships like these help enterprises scale AI effectively? Here’s what we focus on:
Expertise in Model Selection: Benchmarks are useful, but they’re just the start. The hard part is figuring out what works for your specific challenge. That means understanding your context, picking the right methods, and adapting them to real-world problems. It’s not a one-time thing either—this process involves constant testing and improvement to keep your models efficient and ready for the next challenge.
Collaborative Trade-Off Management: There’s no such thing as perfect optimization. It’s always about finding the right balance between speed, accuracy, and efficiency. Every use case is different, and we work with you to weigh the trade-offs and decide what makes the most sense for your goals.
Scenario Testing and Fine-Tuning: It’s not enough to just run the numbers. We experiment with batch sizes, deployment conditions, and every other variable that might affect performance. The goal is to figure out what works best, not just in theory but in real-world operations.
Iterative Improvement: Optimization isn’t done after the first try. Testing, feedback, and stress scenarios usually reveal gaps or areas for improvement. The process is simple: find what’s missing, fix it, and make it better. The result? Models that pass the toughest Q&A and stress tests without breaking a sweat.
Hardware-Aware Optimizations: Different hardware means different constraints. What works on a high-end GPU might not fly on an edge device or custom chip. That’s why we tailor optimization techniques to fit the hardware you’re using, instead of trying to force one-size-fits-all solutions.
Pruna’s approach is hands-on. It’s not about sitting back, waiting for tickets, or handing over generic instructions. It’s about digging into the challenges, providing practical solutions, and delivering results that integrate seamlessly into real-world pipelines. Thanks for reading—your journey toward optimized AI starts here.