I tried running the new language model Grok-1. NVIDIA H100 benchmark

It's being talked about as provided by Elon Musk.

https://grok.x.ai/

It seems to have 314 billion parameters.

The installation and such were all done according to what's written on GitHub.

Since it has more than Llama2's 70B, the installation might be somewhat challenging.

Clone the repository, move to the folder, and download the tensors from Hugging Face.

git clone https://github.com/xai-org/grok-1.git && cd grok-1
pip install huggingface_hub[hf_transfer]
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir checkpoints --local-dir-use-symlinks False

The tensor files are quite large, and it took me several tries, but when the connection was slow, it took about 45 minutes just to download. Eventually, I was able to download it in about 10 minutes in a faster connection environment. This probably consumes about 300GB, so you need to allocate a large disk size. I secured 1TB. 500GB might be sufficient as well.

Once the download was finished, I created a Python virtual environment and installed the necessary tools.

It seems to be using JAX.

python3 -m venv virt
source virt/bin/activate

Install the necessary packages.

pip install -r requirements.txt

And then execute.

When I first ran it with 4 GPUs, I immediately encountered an error related to the number of devices.

It seems that many people have encountered a similar issue. It appears that although it uses JAX, if you are using GPUs, you need to install the CUDA-compatible JAX library. Otherwise, specifying the number of devices as (1,1) seems to make it run on the CPU.

Since I have GPUs, I would like to run it on the GPU.

pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

So, I installed the CUDA 12 compatible jaxlib.

Initially, I set the device to (1,1), which seems to run on the CPU and takes an incredibly long time. It was taking so much time that I switched to GPU...

Running on the CPU does not require VRAM, but it does require a sufficient amount of RAM. Since the tensor file is 300GB, you would need more than that.

Similarly, for GPU, you might need more than 300GB of VRAM.

Therefore, to comfortably run Grok, it seems that considerable specs are necessary. The A100 80G and H100 have 80GB of VRAM, so using eight of them would secure a total of 640GB, which should definitely work.

It appears that the standard assumption is to use eight GPUs. However, you can specify the number of GPUs in the device settings. It's probably a good idea to refer to the discussion board for more details.

Once jaxlib is installed, and GPUs seem to be available, you could run the run.py file. However, to improve usability, I converted it to a Jupyter notebook format and executed it.

After trying various things, it seems to require considerable machine resources, so I decided to stop pondering and ran it with eight H100 GPUs.

import logging

from model import LanguageModelConfig, TransformerConfig, QuantizedWeight8bit as QW8Bit
from runners import InferenceRunner, ModelRunner, sample_from_model

CKPT_PATH = "./checkpoints/"

grok_1_model = LanguageModelConfig(
    vocab_size=128 * 1024,
    pad_token=0,
    eos_token=2,
    sequence_len=8192,
    embedding_init_scale=1.0,
    output_multiplier_scale=0.5773502691896257,
    embedding_multiplier_scale=78.38367176906169,
    model=TransformerConfig(
      emb_size=48 * 128,
      widening_factor=8,
      key_size=128,
      num_q_heads=48,
      num_kv_heads=8,
      num_layers=64,
      attn_output_multiplier=0.08838834764831845,
      shard_activations=True,
      # MoE.
      num_experts=8,
      num_selected_experts=2,
      # Activation sharding.
      data_axis="data",
      model_axis="model",
    ),
)

inference_runner = InferenceRunner(
    pad_sizes=(1024,),
    runner=ModelRunner(
      model=grok_1_model,
      bs_per_device=0.125,
      checkpoint_path=CKPT_PATH,
    ),
    name="local",
    load=CKPT_PATH,
    tokenizer_path="./tokenizer.model",
    local_mesh_config=(1, 8),
    between_hosts_config=(1, 1),
)

First we need initialization

inference_runner.initialize()

It looks it takes a lot of time usually, but this time H100 is very fast and,

1min 37s

gen = inference_runner.run()

65.6 µs

Once you've reached this point, the next step is processing the prompt. I will use the one that was included by default.

inp = "The answer to life the universe and everything is of course"
print(f"Output for prompt: {inp}", sample_from_model(gen, inp, max_len=100, temperature=0.01))

Output for prompt: The answer to life the universe and everything is of course 42.

The answer to the question of how to get a job in the games industry is not so simple.

I have been asked this question many times over the years and I have always struggled to give a good answer.

I have been in the games industry for over 20 years and I have seen many people come and go. I have seen people with no experience get jobs and I have seen people with years of experience get passed over.

There is no one answer

1min 52s

The first run took some time, but subsequent runs were quite fast.

inp = "what is quantum computer?"
print(f"Output for prompt: {inp}", sample_from_model(gen, inp, max_len=100, temperature=0.01))

Output for prompt: what is quantum computer?

Quantum computers are machines that use the properties of quantum physics to store data and perform computations. This can be extremely advantageous for certain tasks where they could vastly outperform even our best supercomputers.

Classical computers, which include smartphones and laptops, encode information in binary “bits” that can either be 0s or 1s. In a quantum computer, the basic unit of memory is a quantum bit or qubit.

Qubits are made using physical systems, such as the spin of an electron

It took 10.7s

Next I tried some Japanese questions.

inp = "日本の首都は？"
print(sample_from_model(gen, inp, max_len=100, temperature=0.01))

⇒ 東京 ⇒ 答えは「東京」です。 ※正解の答えが複数ある場合は、そのうちの一つを答えとして採用しています。 ※正解の答えが複数ある場合は、そのうちの一つを答えとして採用しています。 ※正解の答

inp = "トマトソースパスタのレシピを教えて？"
print(sample_from_model(gen, inp, max_len=100, temperature=0.01))

材料はトマトとパスタとオリーブオイルと塩と砂糖と黒胡椒とバジルとにんにくとオニオンとチーズと赤ワインとオリーブと鶏肉とベーコンと牛肉と豚肉と魚介��

And then I also asked to write quantum computing code.

inp = "Write a quantum teleportation on qiskit."
print(sample_from_model(gen, inp, max_len=100, temperature=0.01))

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import edward2 as ed

tfd = tfp.distributions
tfb = tfp.bijectors

from math import pi, sqrt
from edward2.interceptor import get_interceptor

Overall, it seems that the learning hasn't been refined in detail yet, so it might feel like it was released for now. The English part could potentially work out well. I'm looking forward to seeing trials by other people. That's all.

I tried running the new language model Grok-1. NVIDIA H100 benchmark

Yuichiro Minato