Technical Articles

Hugging Face & Pruna AI: SmolLM2, Now 7x Smaller and 2x Faster

Jan 6, 2025

Begüm Çığ

Begüm Çığ

Machine Learning Research Engineer

At Pruna, our mission is to make models smaller, greener, faster, and cheaper, with reliability at the core of everything we do. We believe that true efficiency is meaningless without trust and the ability to consistently deliver dependable solutions. HuggingFace’s SmolLM family has set a high standard for compact language models across benchmarks, combining performance with resource efficiency. In this post, we focus on their latest iteration, SmolLM2, and explore how we collaborated with HuggingFace to push the boundaries of model efficiency even further.

What is SmolLM? 🤗

SmolLM is Hugging Face’s lineup of compact language models, built to deliver solid performance while staying lightweight. Available in 135M360M, and 1.7B parameter sizes, these models are versatile for a range of applications.

The latest version, SmolLM2, builds on its predecessor with notable enhancements. It’s trained on 11 trillion tokens and uses the SmolTalk dataset, which includes sources like FineWeb-Edu, Cosmopedia v2, and The Stack. SmolLM2 also incorporates additional data focused on instruction-followingreasoning, and math, enabling better performance in tasks like summarization, code generation, and complex question answering.

In benchmarks, their largest base model, SmolLM2-1.7B, delivers strong results on commonsense reasoning benchmarks such as HellaSwag ( 68.7%) and PIQA (77.6%), as well as science and reasoning-focused benchmarks like ARC Average (60.5%). These results demonstrate its strong performance, comparing effectively with similar models like Llama-1B and Qwen2.5-1.5B.

Our Pre-Smash Prep 💅

To unlock even greater efficiency from SmolLM2, we turned to quantization, a process of converting the model’s weights and activations from high-precision formats like float32 to lower-precision formats such as int8. This significantly reduces the model's memory footprint and computational overhead, making deployment on resource-constrained environments more feasible.

Optimizing language models often involves experimenting with various quantization methods, as each comes with its own strengths, limitations, and suitability for specific applications. Luckily, the Pruna library supports a wide range of these methods, making it easier to test and apply the right approach for any use case!

from pruna import SmashConfig 
smash_config = SmashConfig()
smash_config["quantizers"] = ["awq"] #change with your choice of quantizer

For SmolLM2 we experimented with several methods: GPTQ, which fine-tunes accuracy using gradients; AWQ, for joint weight and activation quantization; HQQ, which balances compression and performance; Quanto, offering ultra-low-bit formats; and LLM-Int8 by BitsAndBytes, optimized for large models with mixed precision.

We experimented with various weight precision configurations on two Instruct models: 135M-Instruct and 1.7B-Instruct. For the smaller 135M model, we explored 2-bit, 4-bit, and 8-bit quantization across all available configurations to observe performance under diverse conditions. For the larger 1.7B model, testing was primarily focused on 4-bit quantization, with additional evaluations of specific 2-bit and 8-bit configurations.

smash_config["quant_awq_weight_bits"] = 4 #again, change 'awq' with your quantizer choice!

For evaluation, we used the smol-smoltalk dataset, a subset of the smoltalk dataset designed for models with fewer than 1B parameters. The dataset includes shorter conversations and less task-specific data compared to smoltalk, making it more suitable for smaller models like SmolLM2.

We used 1,000 test samples as the evaluation dataset. For quantization methods requiring calibration, we used 100 training samples for tuning. This setup provided an efficient and consistent way to benchmark and calibrate the models without overloading them with unnecessary complexity.

And here’s some exciting news: we now support this dataset with a dedicated DataModule! 🎉 (A little insider tip: it’s not live just yet, but stay tuned. 😉)

from pruna.data import get_dataset

model_id = "HuggingFaceTB/SmolLM2-135M-Instruct" #Or any model you'd prefer. 
smoltalk_datamodule = get_dataset(
		dataset_name = "SmolTalk_2048",#you can change with your ideal sequence length,
		directory_dataset = smash_config.cache_dir,
		tokenizer_name =  model_id
)

smash_config.add_data(smoltalk_datamodule.val_dataloader())

Now we are ready to “smash” our model:

from transformers import AutoModelForCausalLM
from pruna import smash

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
smashed_model = smash(
		model=model,
		smash_config = smash_config
)

It’s that easy!

Our Results ✨

Now, with the setup out of the way, let’s move on to the exciting part. We evaluated and compared our quantized models using several metrics: perplexity, inference memory usage, synchronous and asynchronous latency for both inference and token generation, and CO2 emissions along with energy consumption during these tasks.

2-bit Quantization:

  • At 2-bit precision, memory usage is dramatically reduced, with gptq achieving as low as 94 MB, reflecting an improvement of more than 8x.

  • Perplexity suffers significantly at this precision, which is expected given the trade-offs of ultra-low-bit quantization. However, the savings in memory still make these configurations worth exploring. With additional fine-tuning, such as parameter-efficient fine-tuning (PEFT), we could work towards recovering perplexity in future work.

4-bit Quantization:

  • 4-bit precision hits the sweet spot between accuracy and efficiency. For instance, awq retains a perplexity close to the base model (3.82 vs 3.38) while reducing memory usage to 106 MB, a 7x reduction.

  • Latencies also show significant improvements, with asynchronous latency as low as 57.84 ms, making this configuration suitable for real-time applications.

  • The environmental impact (e.g., energy consumption and CO2 emissions) of the models also reduced, reflecting more sustainable operation 💚 🌍.

8-bit Quantization:

  • At 8 bits, hqq maintains a perplexity of 3.25, improving the base model, with a memory reduction to 230 MB3x smaller than the base configuration

  • Synchronous latency is reduced to 72.47 ms, demonstrating that higher-precision quantization can still yield efficiency gains without compromising accuracy significantly.

8-bit methods tend to preserve perplexity, with some even improving it. At 4 bits, there is some perplexity loss, but memory savings also increase.

The base model achieves the best performance in asynchronous inference timing, though there are close contenders at 4-bit and 8-bit settings.

For synchronous inference timing, there are smashed models that outperform the base model across all bit settings, with Hqq consistently delivering better results in every case.

4-bit Quantization:

  • In the 4-bit configuration, awq retains a perplexity of 2.29, closely resembling the base model (2.21), while reducing memory usage to 1194 MB—a 6x reduction.

  • Latencies are significantly reduced, with asynchronous latency dropping to 41.46 ms, making this a strong candidate for latency-critical tasks.


    Energy consumption and CO2 emissions show improvements for half of the quantization methods.


8-bit Quantization:

  • At 8 bits, hqq achieves perfect perplexity retention (2.21) while reducing memory usage to 2354 MB, offering a 3x reduction compared to the base model.

  • Async latency improves to 121.85 ms, reflecting enhanced responsiveness.

  • The metrics indicate that 8-bit quantization is an ideal choice for applications requiring high accuracy with moderate resource constraints.

Carbon emissions are almost always better for every smashed model.

Overall, our experiments demonstrate that quantization maintains performance in key areas and opens new possibilities for resource-constrained deployments.

Try It Out for Yourself! 🚀

You can find our full collection of SmolLM2 quantized models on Hugging Face here! Whether you're experimenting with compact models or seeking to optimize performance and efficiency, our smashed models are ready for action. Check them out, experiment with them, and let us know what you think. 🌟

Button

Button

Begüm Çığ

Begüm Çığ

Jan 6, 2025

Begüm Çığ

Begüm Çığ

Jan 6, 2025

Begüm Çığ

Begüm Çığ

Jan 6, 2025

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

Speed Up Your Models With Pruna AI.

Inefficient models drive up costs, slow down your productivity and increase carbon emissions. Make your AI more accessible and sustainable with Pruna.

© 2024 Pruna AI - Built with Pretzels & Croissants 🥨 🥐

© 2024 Pruna AI - Built with Pretzels & Croissants

© 2024 Pruna AI - Built with Pretzels & Croissants 🥨 🥐