Gigazine has detailed information.

Mistral AI suddenly announces new large-scale language model '8x22B MOE', with a context length of 65k and a parameter size of up to 176 billion

https://gigazine.net/gsc_news/en/20240410-mistral-8x22b-moe/

This time, it is a Mixture of Experts model that combines eight 22B models. It contains multiple models internally and changes the expert based on the input for computation. The context length is 65k, and the parameter size is 176 billion.

Officially, the files are available on the P2P platform BitTorrent, but it seems they were also made available on Hugging Face through funding, so I tried using that.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistral-community/Mixtral-8x22B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

There were 59 files in total, amounting to approximately 260GB. As usual, it's heavy... Recently, I feel the competition among open-source software (OSS) models has become more intense, including cohere's command r+. While xAI's Grok required 8 H100 GPUs, this time, Mistral's mixed model could run on just 4 H100 GPUs... It seems using high-grade GPUs like 8 pieces has become the norm. I went ahead without setting up quantization or fp because it's bothersome.

The performance is said to be above GPT-3.5, which is incredible because it means we can use sufficiently powerful models locally, just like cohere. However, it's quite mysterious how such models are routinely offered in the cloud for inference, including what kind of operations are being conducted...

By default, it was set to max_new_tokens=20.

Hello my name is Katie and I am a 20 year old student at the University of Wisconsin-Madison

It was set like that, but it felt a bit insufficient, so I tried increasing the number of output tokens and asked in Japanese. It might be due to my machine, but it takes a tremendous amount of time to produce the output... The initial token count of 20 was unsatisfying, but I was hesitant to increase it slightly.

I set max_new_tokens to 50 and asked. Since it's not an instruct model, I asked as follows:

text = "In Japan, recommended tourist spots are,"

日本でおすすめの観光地は、京都です。京都は、日本の歴史的な都市で、多くの寺院や神社があります。京都では、桜の季節が最も

It feels good. Still, it's heavy.

text = "Recommended Thai dishes are"

おすすめのタイ料理は？

What is the best Thai food?

Thai food is one of the most popular cuisines in the world. It is known for its spicy, sour

Maybe I'm not using it correctly... I thought I should get used to using high-performance language models... I tried asking in English one last time.

text = "I am going to visit Japan. Where you recommend to visit in Japan during my stay?"

The place recommended during my stay in Japan is the “Kyoto International Manga Museum”.

The Kyoto International Manga Museum is a museum that collects and exhibits manga from all over the world.

The museum is located in the center of Kyoto, and

It seemed pretty good.

I tried running the new Mistral model, Mixtral-8x22B MoE

Yuichiro Minato

What is the best Thai food?

What is the best Thai food?