Actually, I am not very familiar with quantization, but since I am tight on VRAM, I decided to try using FP8. I measured it because Ada generation GPUs also support FP8. I tried it in Japanese, and since I wasn't too concerned about accuracy, my motivation was to increase the number of tokens generated per second with limited resources.
I didn't want to use any particularly difficult techniques, so I referred to the following:
[Intern Report] Measuring the Effect of Lightweighting Large Language Models through Quantization
https://engineering.linecorp.com/ja/blog/quantization-lightweighting-llms
The quantization process didn't go smoothly at first, mainly due to some hiccups during library installation, but I eventually managed to get it running.
I used the Mistral Instruct v0.2 7B model. For instance, the RTX 4060 Ti with 16GB of VRAM in the Ada generation GPUs doesn't have enough VRAM to run FP16, but switching to FP8 just barely made it work.
The various settings are as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
device="cuda"
model_name="mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#accelerator = Accelerator(mixed_precision="fp8")
#model = accelerator.prepare_model(model)
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
model.to(device)
By removing the comments marked with #, I was able to execute it with mixed precision.
The power consumption was 110W out of 165W,
and VRAM usage was 15533MiB / 16380MiB.
It was very close to the limit, so if I tried to use the GPU for both processes, like RAG with vector search, it would result in a VRAM capacity overflow error.
Even with a low-cost GPU like the 4060 Ti, it felt reasonable to run LLM with high-precision FP8, but the token generation speed seemed slow.
That's all.