Super Beginner's Guide to Local LLM

LLM, or Large Language Models, are AI capable of understanding text and responding in a sophisticated manner. Services like ChatGPT and Claude are well-known examples of LLMs. Recently, it has become increasingly feasible to install and run these on personal computers. Installing local LLMs on one's machine is particularly useful for those who do not wish to share information with external services, for use in robotics, or when wanting to use them without the restrictions often imposed by cloud-based services.

- When you don't want to share information with other services

- For use in robotics

- When you want to use it without any usage restrictions

Using local LLMs has become incredibly straightforward recently. The mainstream method involves downloading model files from a repository, often from Hugging Face. Given the surprising rarity of individuals who have implemented or used local LLMs, this introduction aims to outline a simple usage method. A significant barrier to local LLM use is likely the installation of a GPU. This is a super beginner's guide to that effect.

Firstly, the most crucial component of an LLM is the "model" files. These contain the core weight parameters of the LLM. When discussing models, you might hear numbers like 7B, 13B, or 70B. For example, Mistral-7B by mistralai has a parameter size of 7B.

Mistral-7B-Instruct-v0.2 on Hugging Face

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

The "B" in 7B stands for billion, indicating that the model has 7 billion parameters. The local LLMs that are commonly used recently range from 7B to about 70B models. Downloading these parameters to your local PC is where it all begins.

While it's possible to use a CPU (though incredibly time-consuming), it's more common to use a GPU. Check if your PC has an NVIDIA GPU. Here, VRAM (Video RAM) on the GPU becomes crucial. The larger the VRAM, the larger the models your LLM can handle. For a 7B model, about 16GB of VRAM is suitable, but this size is already on the larger end for GPUs commonly available in stores. GPUs with 16GB or more of VRAM currently available are limited to RTX 3090 and 4090 with 24GB. Other high-performance workstations intended for professional use may offer 16GB or 48GB models, but just the GPU can cost tens of thousands of dollars.

Given what's available, your choices are 24GB model or opt for a 12GB model. The RTX 3060 GPU is a more reasonable option and could be considered for use.

画像引用：https://amzn.asia/d/4UwL4N6

Additionally, there was a question about how to equip a GPU. Normally, desktop PCs, especially those that are custom-built, have a motherboard with PCIe slots, into which the GPU is inserted.

画像引用：https://amzn.asia/d/2IEo6Zs

This is the motherboard, the foundation, where you can see the PCIe slots in the lower left. There are different versions like 3.0, 4.0, 5.0, etc., with higher numbers indicating newer and faster standards. GPUs also support specific PCIe versions, so if you buy a motherboard compatible with PCIe 5, you need to check if the GPU supports it as well.

In the lower-left, next to the silver slot with a lever on the right, is the PCIe slot. Below it, you'll find another slot with a similar lever. This lever is used to release the GPU when you need to remove it. Given the substantial thickness of modern GPUs, having PCIe slots too close together can be problematic if you wish to install more than one GPU, so the spacing between PCIe slots is also crucial.

Motherboards capable of accommodating more than two GPUs are quite common, and even if a single GPU has 12GB of VRAM, installing two GPUs can effectively double the available VRAM to 24GB. It's worth checking if your system supports multiple GPUs, as it might be possible to increase VRAM by investing more money.

As for CPUs, there are no specific restrictions for AMD or Intel, and you simply choose one compatible with your motherboard. The main memory serves as the system's temporary storage area, but if you have a substantial amount of VRAM, it's advisable to also have plenty of main memory. Memory modules are available in sizes like 16GB or 32GB, so you purchase the required number of modules. Consumer PCs usually have up to four memory slots, so buy and install memory within that maximum number. Then, purchase storage for installing the OS.

Regarding the OS, it was asked in the workshop, and it's generally recommended to use Ubuntu, a Linux OS. Ubuntu Desktop provides a GUI (Graphical User Interface), allowing for easy operation with a mouse or trackpad, similar to Windows or Mac. After installing Ubuntu, you'll need to install drivers, software tools for connecting and utilizing the GPU.

Once all these setups are complete, you're ready to use local LLMs. Typically, Python is used to operate local LLMs, and you can also use an interactive Python interface called Jupyter Notebook, depending on preference.

If Python is installed, you can easily set up LLMs using a package management tool called PyPI.

The model containing the weights can be easily downloaded and installed from Hugging Face. Typically, the tool developed by Hugging Face called "transformers" is used.

pip install transformers accelerate

You can install from the console or a Jupyter Notebook like this. Once everything is installed, you can run it as follows.

First, specify the model you want to use. This time, I chose Mistral-7B by mistralai and entered its address in the model_id.

That's nearly it. I wrote the command I wanted to execute in the messages. This time, I'll ask about the capital of Japan.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

messages = [
{"role": "user", "content": "日本の首都は？"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Executing this will automatically start the installation. The model's weights are divided into three files this time. The download of these files will begin. Approximately three files, each around 5GB, will be downloaded and then loaded. This time, I asked in Japanese.

The answer is,

[INST] 日本の首都は？ [/INST] 日本の首都は東京です。

It responded correctly to the instruction. This isn't running on a cloud system over the internet but on my own PC.

Using a GPU is ideal, but it's also possible to use without one. Reducing the model size allows for processing on a CPU. Additionally, there's a global surge in developing chips that can rapidly perform such inference, executing trained models quickly. The recently introduced technology called Groq has become notably renowned for its unparalleled speed.

This is a super beginner's guide to starting with local LLMs. Recently, the performance of such local models has significantly improved. The ability to freely operate without the constraints of cloud-based, paid models right on your own device is quite appealing. I encourage you to try it. If you don't have a GPU, using a free service like Google Colab to test its functionality is also a convenient option.

Super Beginner's Guide to Local LLM

Yuichiro Minato