Introduction

The sizes of large language models (LLMs) have been steadily increasing over the last few years. OpenAI’s GPT-1 started off with 117 million parameters, which increased to 1.5 billion in GPT-2, and then 175 billion in GPT-3.5. The most recent GPT-4 was rumoured to have over a trillion parameters. We cannot deny that the natural language abilities of LLMs have also skyrocketed with their numbers of parameters, but this is clearly an unsustainable path to stay on. Training GPT-3 resulted in 100 million dollars in electricity bills alone, and the costs of training models are expected to double every ten months.

Previous compression methods have been proposed for LLMs, such as quantisation and knowledge distillation, which we will briefly introduce in this article. However, they have been met with mixed success since controlling the error that they bring about via compression is challenging. Instead, a novel technique, CompactifAI, has just been announced with amazing results that we will review.

Large Language Model Compression

Several compression techniques that have been proposed over the years are listed out here:

  • Quantization: Quantization maps a larger set of values (such as continuous infinite numbers) to a smaller set of values (like discrete finite numbers). In LLMs, parameters in the form of 32-bit floating-point (`float32`) numbers are typically reduced to `float16` or `int8`. Hugging Face has a good explanation to quantization if you are interested.
  • Knowledge Distillation: Knowledge distillation focuses on training smaller, simpler models (known as student models) to mimic the predictions of larger, more complex models (referred to as teacher models). This approach is helpful if we want to achieve better performance from larger models without the computational or memory costs.
  • Pruning: Unimportant connections within neural networks are identified and pruned to reduce the number of parameters.

These methods can sometimes be brute-force and reduce the number of neurons at the cost of model performance. Instead, the method proposed in CompactifAI focuses on controlled and interpretable compression.

CompactifAI

Inspired by the Tensor Networks commonly used in quantum physics and the successful decomposition of weight matrices into Matrix Product Operators (MPOs) in deep learning architectures, CompactifAI similarly decomposes weight matrices in LLMs into MPOs.

If these terms seem completely alien to you, don’t worry. These concepts originate from quantum physics, hence why they may seem convoluted. Tensors and MPOs are concepts in mathematics and physics that are used in deep learning and quantum computing. We will provide a brief introduction to them in the following section, but feel free to skip ahead to the Methods or Results sections.

Tensors and Matrix Product Operators (MPOs)

The easiest way to understand tensors is through an image, like the one below. Tensors can be visualised as stacks of matrices. RGB images are represented as 3-dimensional tensors, and a series of RGB images (like frames in a video) form a 4-dimensional tensor.

Matrix Product Operators (MPOs) are used in quantum mechanics to express entities like Hamiltonians. They are capable of representing any operator or function. Think of them as a matrix containing matrices. By using MPOs, a larger matrix can be decomposed into a series of smaller matrices, and the level of compression can be controlled explicitly via the bond dimension χ.

An explanation of tensors.

The CompactifAI Method

To compress LLMs, a series of steps were used in CompactifAI:

  • Layer sensitivity profiling: Viable self-attention (SA) and multi-layer perceptron (MLP) layers were identified for compression.
  • Compression: The weights of identified layers were replaced with tensor networks, where the level of compression was controlled via the bond dimension χ.
  • Healing: This is a rapid retraining phase, since earlier compression did not take into account the effects of other layers.

The retraining step in the “healing” phase is quick, as we will see in the next section, due to the smaller number of parameters, and could be completed in less than one epoch without compromising performance.

The Results

The CompactifAI method was evaluated using the Llama-2 7B model (HugginFace). The original model was compared to 8-bit and 4-bit quantized versions of Llama-2, a float-16 quantized version with CompactifAI applied (88% version), and a mixed version. The mixed version applied 4-bit quantization to the non-tensorised layers of the 88% version.

The original Llama-2 7B model was compared with both quantised and compressed versions of Llama-2. It is evident that the compressed models (88% and 93%) have a higher level of compression than quantised models (8-bit and 4-bit). Source: Tomut et al. (2024), https://doi.org/10.48550/ARXIV.2401.14109

The compressed versions were compared with the original Llama-2 model on three factors:

  1. Performance Accuracy
Comparison of performance accuracy of the original Llama-2 model with quantized (8-bit and 4-bit) and compressed (88% compressed and 93% compressed) versions. Source: Tomut et al. (2024), https://doi.org/10.48550/ARXIV.2401.14109

Despite having 70% fewer parameters, both compressed models experienced only a 2% to 3% drop in performance accuracy compared to the original model. This was observed across five benchmarks for language understanding (MMLU), commonsense reasoning (HellaSwag), reading comprehension (BoolQ), world knowledge (TriviaQA), and maths (GSM8K).

2. Training Time

As mentioned earlier, the compressed models also train twice as fast as both the original and quantized models on the same amount of data.

Comparison of training time of the original Llama-2 model with quantized (8-bit and 4-bit) and compressed (88% compressed and 93% compressed) versions. Source: Tomut et al. (2024), https://doi.org/10.48550/ARXIV.2401.14109

3. Inference Time

Furthermore, the compressed versions experienced a speedup in inference time as they completed inference tasks in 75% of the time taken by the other models.

Comparison of inference time of the original Llama-2 model with quantised (8-bit and 4-bit) and compressed (88% compressed and 93% compressed) versions. Inference times were normalised with respect to the original model. Source: Tomut et al. (2024), https://doi.org/10.48550/ARXIV.2401.14109

Conclusion

The results from this novel technique by Tomut et al. (2024) present a different reality to what we have previously believed. They show that LLMs do not have to get larger to get better. Instead, they can perform just as well with just a fraction of the parameters. Moreover, the article also presented a novel compression technique that, unlike previous methods, is controllable and explainable.

With the growing prevalence of LLMs across applications and domains, there is no doubt there will be more research into this topic as the world moves toward developing more sustainable LLMs.

References

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium