In the utilization of large language models, there exists a model that encompasses a multitude of experts, known as MoE / Mixture of Experts. Instead of addressing every problem with a single model, it delegates questions to models that specialize in them and integrates their responses. This approach is somewhat akin to how humans operate. This time, we'll explore MoE.
In our previous blog, we looked at constructing LLMs locally, known as local LLMs. With the ability to manipulate Python models at hand, you can experiment with and create various models. This time, we focus on MoE.
MoE stands for Mixture of Experts, a type of machine learning architecture also referred to as a mixed expert model. This model combines multiple 'expert' models (small neural networks or machine learning models) to more effectively address specific tasks or datasets.
The fundamental concept of MoE involves using a 'gate' mechanism to determine which 'expert' is most suited to respond to a given input data. The gate mechanism assigns weights to the outputs of each expert and combines them to make a final prediction. This allows each expert to specialize in learning specific aspects or patterns of the data, thereby enhancing the overall performance of the model.
MoE models can outperform traditional neural network models, especially in large datasets or complex tasks. However, their design and implementation are more complex and can incur higher computational costs. Currently, their application in various fields, such as natural language processing (NLP) and computer vision, is being explored.
As stated by GPT-4,
Let's search for MoE models on Hugging Face, foundational for local LLMs. It seems MistralAI from France is renowned for MoE,
There's an 8x7B model here
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
It appears to be an MoE model consisting of eight 7B models combined. The mechanism involves dividing a portion of the Transformer model into several experts, which then evaluate, segregate, and compute incoming information before finally integrating it. The model has about 47B parameters and requires loading into VRAM, thus it consumes more resources than simply summing up the 7B models.
Using it is the same as other models.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This time, I won't delve into the finer technical details such as gates or tokens, but considering the moderate VRAM consumption and the speed, it seems easier to train compared to larger models. Therefore, keeping it in mind as an option seems like a good idea. That's all.