common.title
Cloud support

Nobisuke

Dekisugi

RAG


autoQAOA
RAG for dev
Fortune telling app
Annealing
DEEPSCORE
Translation

Overview
Service overview
Terms of service

Privacy policy

Contact
Research

Sign in
Sign up
common.title

I have just run Llama-3.2-90B-Vision-Instruct on a server with the latest AMD MI300X using only one GPU

Yuichiro Minato

2024/09/26 14:42

Hello!

I had the opportunity to work with the latest GPUs for a project between Supermicro and one of our shareholders, Shinden Hightex!

Supermicro
https://www.supermicro.com/ja/

Shinden Hightex
https://www.shinden.co.jp/

And the star of the show this time is...

AMD Instinct™ MI300X
https://www.amd.com/ja/products/accelerators/instinct/mi300/mi300x.html

This machine, made by AMD, has an incredible 192GB of VRAM. What's more, the models being offered for the domestic market by Supermicro and AMD come in configurations with 8 GPUs, totaling 1.5TB of VRAM in a single machine!

Since there are very few companies in Japan with applications that utilize AMD GPUs, we were approached to run LLM, RAG, and quantum computing software on this machine.

First, let's talk about AMD's MI300X, which is their latest GPU release. It has been generating buzz in certain circles, as it is said to be about 1.3 times faster than NVIDIA's H100 and boasts an overwhelming 192GB of VRAM.

Personally, I'm more familiar with using NVIDIA machines, so I'm not yet used to AMD. However, recently, PyTorch has added support for AMD's ROCm (a counterpart to NVIDIA's CUDA), which runs relatively smoothly. This time, I conducted a test to see if LLM, RAG, and quantum computing would work properly.

Normally, I use models like Mistral, Llama3, and Gemma2, but while I was on the train on the way here, I saw the news that Llama 3.2 had been released!

Meta releases Llama 3.2 with improved image recognition performance and a small version optimized for smartphones
https://gigazine.net/news/20240926-llama-3-2/#google_vignette

Without thinking too much, I first installed Mistral 7B, used Faiss for the vector DB in RAG, and chose a random embedding model. Everything ran smoothly and quickly, with no issues. I plan to take some benchmark comparisons, but I haven't done that yet.

That said, I couldn't resist trying out Llama 3.2.

Given that the AMD MI300X has 192GB of VRAM, I thought it might be possible to fit the 90B model onto a single GPU, so I decided to give it a shot with the following model:

meta-llama/Llama-3.2-90B-Vision-Instruct
https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct

Note that pre-approval is required to access this model, so it's a good idea to apply early. This model has 90 billion parameters, about 10 times larger than the models I usually work with, which are in the 7-9B range. But! Since it seemed like it could fit on a single GPU, I gave it a try. Be prepared for a long download process—it took a while, as the files are around 180GB. It's hard to believe this could all fit into VRAM at once...

After updating Transformers and downloading the model from HuggingFace, I ran the following sample code:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

It seems that the vision model has been built on top of the Llama 3.1 text model, and the example code generates a haiku based on an image of Peter Rabbit.

When running this on AMD GPUs, the only adjustment needed is to install the ROCm-compatible version of PyTorch:

Installing PyTorch for ROCm
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html

After that, the same code used for NVIDIA runs just fine with no errors.

When I ran it, I found that since there was plenty of VRAM available, the model could run without any need for quantization or other optimization for lightening the load. However, you do need to ensure the device is set correctly, or it will offload to the CPU.

It produced some kind of haiku in English, which reminded me of something from Harry Potter, though I'm not entirely sure what it was about. The execution time varied—sometimes around 3 seconds, other times up to 20 seconds, depending on the max_new_tokens setting.

The sample code uses a relatively low max_new_tokens value, but I extended it a bit to experiment. Even when I asked it to generate a haiku in Japanese, it worked without any noticeable issues. (Unfortunately, since I was using a borrowed machine for testing, I can't paste the exact text here, so you'll have to trust me on the atmosphere.)

In this case, we were working with 8 GPUs, but even for a 90B model, I was able to run it on a single GPU. The setup was similar to CUDA with cuda:0, and when using the device map set to "auto," it worked automatically, though it was harder to tell exactly which GPU was in use. The speed fluctuated slightly but only by a few seconds, so it wasn't a major difference overall.

In conclusion, downloading the weights over the network takes a lot of time, and fully utilizing the 1.5TB of VRAM on an 8-GPU AMD MI300X system is quite a challenge. In fields like HPC, where scaling the problem size and computation load is more straightforward, you might be able to fully utilize it. But for things like LLMs, where the model size is fixed and loaded into VRAM, 192GB per GPU is already an incredibly high spec, close to the best performance available right now. Utilizing it to its fullest potential would likely require advanced use cases like training, or it might be something meant to be shared by multiple users.

Given the relatively reasonable price (in industry terms), I think it could be worth considering for companies looking to catch up in the GPU space (after thorough testing).

For inquiries, the contact point is Shinden Hightex:
https://www.shinden.co.jp/

It was an amazing experience getting to work with cutting-edge hardware, and I'll keep pushing to fully utilize its potential.

© 2024, blueqat Inc. All rights reserved