Quantizing Giant Language Fashions (LLMs) is the most well-liked method to scale back the scale of those fashions and velocity up inference. Amongst these strategies, GPTQ delivers superb efficiency on GPUs. In comparison with unquantized fashions, this technique makes use of virtually 3 instances much less VRAM whereas offering an analogous degree of accuracy and sooner technology. It turned so in style that it has just lately been instantly built-in into the transformers library.
ExLlamaV2 is a library designed to squeeze much more efficiency out of GPTQ. Because of new kernels, it’s optimized for (blazingly) quick inference. It additionally introduces a brand new quantization format, EXL2, which brings numerous flexibility to how weights are saved.
On this article, we’ll see the right way to quantize base fashions within the EXL2 format and the right way to run them. As traditional, the code is obtainable on GitHub and Google Colab.
To start out our exploration, we have to set up the ExLlamaV2 library. On this case, we would like to have the ability to use some scripts contained within the repo, which is why we’ll set up it from supply as follows:
git clone https://github.com/turboderp/exllamav2
pip set up exllamav2
Now that ExLlamaV2 is put in, we have to obtain the mannequin we wish to quantize on this format. Let’s use the wonderful zephyr-7B-beta, a Mistral-7B mannequin fine-tuned utilizing Direct Desire Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is a formidable outcome for a mannequin that’s ten instances smaller. You possibly can check out the bottom Zephyr mannequin utilizing this space.
We obtain zephyr-7B-beta utilizing the next command (this will take some time because the mannequin is about 15 GB):
git lfs set up
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
GPTQ additionally requires a calibration dataset, which is used to measure the affect of the quantization course of by evaluating the outputs of the bottom mannequin and its quantized model. We are going to use the wikitext dataset and instantly obtain the take a look at file as follows:
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet
As soon as it’s carried out, we are able to leverage the convert.py
script offered by the ExLlamaV2 library. We’re principally involved with 4 arguments:
-i
: Path of the bottom mannequin to transform in HF format (FP16).-o
: Path of the working listing with momentary recordsdata and last output.-c
: Path of the calibration dataset (in Parquet format).-b
: Goal common variety of bits per weight (bpw). For instance, 4.0 bpw will give retailer weights in 4-bit precision.
The whole record of arguments is obtainable on this page. Let’s begin the quantization course of utilizing the convert.py
script with the next arguments:
mkdir quant
python python exllamav2/convert.py
-i base_model
-o quant
-c wikitext-test.parquet
-b 5.0
Word that you will want a GPU to quantize this mannequin. The official documentation specifies that you just want roughly 8 GB of VRAM for a 7B mannequin, and 24 GB of VRAM for a 70B mannequin. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta utilizing a T4 GPU.
Underneath the hood, ExLlamaV2 leverages the GPTQ algorithm to decrease the precision of the weights whereas minimizing the affect on the output. You’ll find extra particulars concerning the GPTQ algorithm in this article.
So why are we utilizing the “EXL2” format as an alternative of the common GPTQ format? EXL2 comes with a couple of new options:
- It helps completely different ranges of quantization: it’s not restricted to 4-bit precision and might deal with 2, 3, 4, 5, 6, and 8-bit quantization.
- It could possibly combine completely different precisions inside a mannequin and inside every layer to protect crucial weights and layers with extra bits.
ExLlamaV2 makes use of this extra flexibility throughout quantization. It tries completely different quantization parameters and measures the error they introduce. On high of attempting to attenuate the error, ExLlamaV2 additionally has to realize the goal common variety of bits per weight given as an argument. Because of this conduct, we are able to create quantized fashions with a mean variety of bits per weight of three.5 or 4.5 for instance.
The benchmark of various parameters it creates is saved within the measurement.json
file. The next JSON exhibits the measurement for one layer:
"key": "mannequin.layers.0.self_attn.q_proj",
"numel": 16777216,
"choices": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},
On this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for a mean worth of two.188 bpw and a gaggle measurement of 32. This launched a noticeable error that’s taken under consideration to pick the perfect parameters.
Now that our mannequin is quantized, we wish to run it to see the way it performs. Earlier than that, we have to copy important config recordsdata from the base_model
listing to the brand new quant
listing. Mainly, we would like each file that’s not hidden (.*
) or a safetensors file. Moreover, we do not want the out_tensor
listing that was created by ExLlamaV2 throughout quantization.
In bash, you’ll be able to implement this as follows:
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/
Our EXL2 mannequin is prepared and we now have a number of choices to run it. Probably the most easy technique consists of utilizing the test_inference.py
script within the ExLlamaV2 repo (notice that I don’t use a chat template right here):
python exllamav2/test_inference.py -m quant/ -p "I've a dream"
The technology could be very quick (56.44 tokens/second on a T4 GPU), even in comparison with different quantization strategies and instruments like GGUF/llama.cpp or GPTQ. You’ll find an in-depth comparability between completely different options on this excellent article from oobabooga.
In my case, the LLM returned the next output:
-- Mannequin: quant/
-- Choices: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading mannequin...
-- Loading tokenizer...
-- Warmup...
-- Producing...I've a dream. <|consumer|>
Wow, that is a tremendous speech! Are you able to add some statistics or examples to assist the significance of schooling in society? It might make it much more persuasive and impactful. Additionally, are you able to recommend some methods we are able to guarantee equal entry to high quality schooling for all people no matter their background or monetary standing? Let's make this speech really unforgettable!
Completely! Here is your up to date speech:
Expensive fellow residents,
Schooling isn't just an educational pursuit however a elementary human proper. It empowers folks, opens doorways
-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (consists of immediate eval.)
Alternatively, you should utilize a chat model with the chatcode.py
script for extra flexibility:
python exllamav2/examples/chatcode.py -m quant -mode llama
When you’re planning to make use of an EXL2 mannequin extra often, ExLlamaV2 has been built-in into a number of backends like oobabooga’s text generation web UI. Word that it requires FlashAttention 2 to work correctly, which requires CUDA 12.1 on Home windows in the intervening time (one thing you’ll be able to configure in the course of the set up course of).
Now that we examined the mannequin, we’re able to add it to the Hugging Face Hub. You possibly can change the identify of your repo within the following code snippet and easily run it.
from huggingface_hub import notebook_login
from huggingface_hub import HfApinotebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="mannequin"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)
Nice, the mannequin will be discovered on the Hugging Face Hub. The code within the pocket book is sort of common and might let you quantize completely different fashions, utilizing completely different values of bpw. That is best for creating fashions devoted to your {hardware}.
On this article, we offered ExLlamaV2, a robust library to quantize LLMs. Additionally it is a implausible software to run them because it gives the best variety of tokens per second in comparison with different options like GPTQ or llama.cpp. We utilized it to the zephyr-7B-beta mannequin to create a 5.0 bpw model of it, utilizing the brand new EXL2 format. After quantization, we examined our mannequin to see the way it performs. Lastly, it was uploaded to the Hugging Face Hub and will be discovered here.
When you’re eager about extra technical content material round LLMs, follow me on Medium.