Hello, you want to build an LLM cost-effectively, right? I’ll share some recommended configurations with you. LLM stands for Large Language Model, which is a type of AI that generates text like ChatGPT. It’s available for free use, but when it comes to professional applications, there are some significant challenges, including data leaks. I’ll let you know the recommended components for building a local LLM. First, to build an LLM, you need a motherboard, CPU, memory, SSD, power supply, case, and GPU. Don’t worry about software or OS, as all can be obtained for free. <img src="https://assets.blueqat.com/public/uploads/us-east-2:4805ff4b-c3cc-4344-b165-86544c34d0bf/2024/04/19/20240416_213809.jpg"/> First, set up the case with the motherboard, CPU, memory, SSD, and power supply. I recommend a standard configuration for these components. The crucial part is the GPU, as the size of the VRAM significantly affects the performance of the LLM you can run. A model with around 7 billion parameters is commonly used, which requires about 14GB of VRAM. Therefore, a popular and cost-effective choice is likely the NVIDIA RTX 3060 12G, which offers 12G of VRAM for about 40,000-50,000 yen each in Japan. Buying two of these provides a total of 24GB for about 80,000-100,000 yen. In LLMs, the total VRAM is effectively utilized, so this setup should work. Machines with 3090 or 4090 can cost about 200,000-300,000 yen, so this is a significant saving. For the OS, install Ubuntu, which is free. All software is managed through Python and is also free. You’ll install various packages via Python. I think you can achieve this entire setup for cheap price, which is quite inexpensive. To get started, installing &#39;transformers&#39; and &#39;accelerate&#39; should be sufficient. I&#39;ve gone ahead and executed this <pre class="ql-syntax" spellcheck="false">from transformers import AutoTokenizer, pipeline
import torch
import time

model_id = &#34;mistralai/Mistral-7B-Instruct-v0.2&#34;
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(&#34;text-generation&#34;, model=model_id, tokenizer=tokenizer, device_map=&#34;auto&#34;, max_new_tokens=300, torch_dtype=torch.float16,)
query = &#39;量子コンピュータとは？&#39;

start = time.time()
answer = pipe(query)
print(time.time()-start)
print(answer)
</pre> I was curious, so I conducted benchmarks using three different GPUs: - Two RTX 3060 cards: 16 seconds- One RTX 3090 card: 7 seconds- One RTX 4090 card: 6 seconds These results were obtained using GPUs that are all available in stores. Among them, the most cost-effective option is likely the two RTX 3060 cards. It takes quite a while for generation to complete. The time taken is about 2-3 times longer, which seems like a significant increase. While this is true, I still thought it was just within the bearable range. I thought that two RTX 3060 cards offer an incredibly affordable setup for running LLMs, comparable to those typically run on more powerful GPUs like the RTX 3090 or 4090. Performance can vary depending on the machine configuration. Here, if speed is a concern, there is a feature in generative AI called &#39;text streaming&#39;, which outputs text as it is generated. The benchmarks above measured the time until all generation was completed, but typically, humans read text sequentially from the beginning. By considering the time it takes to read and outputting text as soon as it becomes available, you can significantly improve the perceived speed of the system. <pre class="ql-syntax" spellcheck="false">from transformers import AutoTokenizer, pipeline, TextStreamer
import torch
import time

model_id = &#34;mistralai/Mistral-7B-Instruct-v0.2&#34;
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True)

pipe = pipeline(&#34;text-generation&#34;,
model=model_id,
tokenizer= tokenizer,
torch_dtype=torch.bfloat16,
device_map=&#34;auto&#34;,
max_new_tokens = 512,
do_sample=True,
top_k=10,
num_return_sequences=1,
streamer=streamer,
eos_token_id=tokenizer.eos_token_id
)

query = &#39;量子コンピュータとは？&#39;

start = time.time()
answer = pipe(query)
print(time.time()-start)
print(answer)
</pre> Even with the 3060, it&#39;s clearly inferior to the 3090 or 4090, even with streaming, but it felt like it was just on the edge of being bearable. Personally, I think that if you&#39;re constrained by budget but want to work with LLMs, a dual 3060 setup might just be viable. I would love to hear from anyone who tries this setup.

Building a budget local LLM machine with two RTX 3060 GPUs

Yuichiro Minato