Technical Article, Integration

Achieve 2x to 4x Efficiency Gains on Databricks DBRX with Pruna AI

Oct 28, 2024

Quentin Sinig

Go-To-Market Lead

Johanna Sommer

Johanna Sommer

ML Research Engineer

Are You a Databricks Customer? Supercharge DBRX x4 Today!

Since its launch on March 27th, Databricks' DBRX—an open, general-purpose LLM—has made waves in the AI community. While billions were invested in this new AI standard development, it shows how imperative it is to maximize the efficiency and impact of these resources. Yes, Jonathan! You can be proud that DBRX “surpassed everything” in terms of accuracy and innovation. Yet, is it still true? In the relentless race for AI dominance, performance isn't just about accuracy—it’s also about efficiency. That’s where Pruna steps in. Check how pruning 4-bit quantized versions of DBRX-Base & DBRX-Instruct can help the Databricks community save time and money!

DBRX Minimum Requirements

DBRX (aka “Databricks” without its vowels 😉*)* is a large language model (LLM) built entirely from scratch. DBRX is licensed under Databricks Open Model License* (meaning sublicensing is disabled) and readily available for developers to explore and utilize. Their repository provides essential code examples for running inference tasks, along with helpful resources and links for use.

DBRX is built on the MegaBlocks research and open-source project (OSS FTW ♥️).

When you read the README, you’ll notice it states that 'to run inference with 16-bit precision, a minimum of a 4 x 80GB multi-GPU system is required,' and it has only been tested on A100 and H100 GPUs. While TensorRT and vLLM are mentioned as optimization options, this presents a somewhat limited view of what's possible when aiming for deep optimization. There’s much more that can be done to achieve truly significant improvements.

Supercharge Your DBRX Models With Pruna AI

Unlocking the true potential of DBRX open LLMs lies in quantization. This technique streamlines these models, dramatically reducing their size. The result? A triple win: tiny size model, significant cost savings on hardware and infrastructure, and a greener approach to AI development.

Consider this: DBRX already outperforms the likes of LaMA2-70B, Mixtral, Grok-1, and GPT-3.5 in core areas like language comprehension, programming, tackling mathematical problems, and logical reasoning. While it might seem 'old news,' especially with the latest GPT-4.0 benchmarks, DBRX is still highly integrated into a product now used by over 10,000 companies. With an initial investment of $10M, it's unlikely they'll phase it out anytime soon. So, there’s still a strong case for smashing your DBRX deployment.

Now, imagine this: What if you could already achieve a staggering x2 to x4 efficiency boost? Or fit your model into a SINGLE A100? That's the power of Pruna AI. With a single line of code, we empower organizations to tailor DBRX to their specific industry needs – propelling you ahead of the competition. Don't just use DBRX, optimize it with Pruna and unlock its full potential to gain a significant competitive edge.


Getting Started with DBRX and Pruna AI

Getting started with DBRX models is easy. First, make sure you have the following packages installed:

pip install "torch==2.4.0" "transformers>=4.39.2" "tiktoken>=0.6.0" "bitsandbytes"

You can then download and run the model with the following simple code snippet. Make sure to supply your HuggingFace token by replacing “hf_YOUR_TOKEN” with your own token.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch 

tokenizer = AutoTokenizer.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", trust_remote_code=True, token="hf_YOUR_TOKEN")
model = AutoModelForCausalLM.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="hf_YOUR_TOKEN")

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
  
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Since the DBRX model is rather large and takes time to download, it might be worth it to speed up download time with:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1


Conclusion

In the fast-paced world of AI, staying ahead isn't just about choosing the best models—it’s about making them work smarter. With DBRX, Databricks has given the community an open, high-performance LLM to build on. But why settle for standard performance when you can push the boundaries further?

By leveraging Pruna AI’s quantization and optimization techniques, you not only unlock more efficient deployments but also take a step toward reducing infrastructure costs and embracing a more sustainable AI strategy.

So, whether you're running DBRX or other LLMs, there's no reason not to make it leaner, faster, and more efficient with Pruna! Ready to start smashing? Contacts Us for a Demo or Join the Discord Community!


—————————————————

About Databricks

Databricks is the Data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on LinkedInX, and Facebook.

Button

Button

Johanna Sommer

Quentin Sinig

&

Johanna Sommer

Oct 28, 2024

Johanna Sommer

Quentin Sinig

&

Johanna Sommer

Oct 28, 2024

Johanna Sommer

Quentin Sinig

&

Johanna Sommer

Oct 28, 2024

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

© 2024 Pruna AI - Built with Pretzels & Croissants 🥨 🥐

© 2024 Pruna AI - Built with Pretzels & Croissants

© 2024 Pruna AI - Built with Pretzels & Croissants 🥨 🥐