Super Beginner's Guide to Local Image Generation AI

What is Image Generation AI?

Image Generation AI allows you to create the images you want from text or images. It's an AI that generates the desired images from base materials or keywords, such as through txt2img or img2img processes.

There are already several famous services for image generation AI, which typically run computations on the cloud rather than locally, making them easy to use. For example, OpenAI's DALL-E, integrated with ChatGPT, can create images from text.

DALL-E-3

https://openai.com/dall-e-3

Midjourney, primarily utilizing a chat service known as Discord, allows you to create images by inputting text and offers some ability to adjust images post-creation.

Midjourney

https://www.midjourney.com/

However, while these services can easily create images from text, fine control is limited, and only broad adjustments are possible. For more detailed image control and creation, programs can be installed and utilized on local or cloud machines.

A prime example is Stable Diffusion, which can be easily used by installing it on a local computer.

Why Quantum Computing and Generative AI?

Our company originally developed software for quantum computers. Why are we involved in generative AI? Let's briefly introduce our journey. We initially utilized GPUs for quantum computer simulations, employing cuQuantum for software development in machine learning. cuQuantum, launched by NVIDIA in 2021, became a global standard framework allowing many software applications to run quickly.

cuQuantum

https://developer.nvidia.com/cuquantum-sdk

Our company is listed as an official framework, but it's the Tensor Network technology in cuQuantum that has rapidly developed in recent years. This tensor network framework, advancing from the traditional quantum computer simulators, utilizes similarities to neural networks for software development. For instance, Google's TensorFlow performs calculations using elements called tensors. Quantum computing has begun to forge new paths using tensors, showing great compatibility with generative AI. Several quantum companies, including Terra Quantum and Multiverse in Europe, and Zapata Computing in the United States, have been focusing on generative AI using tensor network technology. Zapata Computing has even pivoted to a generative AI company and achieved a listing in the United States. Thus, a part of quantum computing and generative AI share excellent compatibility. We, representing Japan, are advancing tensor network technology, with engineers from physics backgrounds around the world developing models utilizing tensors.

The Mathematical Model of Image Generation AI

Image Generation AI is based on a theoretical model constructed around diffusion models. These models learn to efficiently generate images by gradually reverting noise back into images. To effectively convert the noise back into images, a denoising process employs primarily a neural network model known as U-Net. Recently, alternatives like the Diffusion Transformer, which utilizes Transformers in place of U-Net, have emerged, but U-Net remains widely used.

U-Net features a U-shaped structure, progressing through downsampling and upsampling steps using convolution.

Extensions to diffusion models, known as consistency models, have also been developed recently, though they fundamentally rely on the same principles. However, these extensions are not delved into deeply in introductory material.

Relation to Video Models

The models discussed here are for still images, designed to create still images from text. Models for creating videos are based on these still image models, with much effort dedicated to maintaining consistency and continuity between still images. This was also a point of inquiry in our study group, so it's briefly supplemented here.

Model Developments

Previously, models faced challenges with establishing procedures and achieving quality. However, recent advancements have ensured quality, making it relatively easy to generate desired images.

A crucial element in image generation AI is the checkpoint of the original model, consisting of a data block called tensors. These checkpoints, stored in files with extensions like .ckpt or, more recently, .safetensors, contain the numeric data for converting noise back into images. There are many creators worldwide, and checkpoints are easily accessible from several platforms. Notable sites for obtaining these checkpoint data models include:

Hugging Face

https://huggingface.co/

CivitAI

https://civitai.com/

For instance, details about Stable Diffusion v1.5 distributed by runway can be found at:

https://huggingface.co/runwayml/stable-diffusion-v1-5

Files, such as v1-5-pruned-emaonly.safetensors, corresponding to the checkpoint model files, are viewable under tabs on the page:

https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main

In image generation AI, these checkpoint files are the most crucial core files.

Using Image Generation AI in a Python Jupyter Notebook

With a checkpoint file, you can freely create images using image generation AI. The file size, at around 7GB, is relatively small compared to large language models, which means less machine power is required.

Image generation AI, dealing with images, works well with a GUI (Graphical User Interface), which allows for interaction through a browser using a mouse or trackpad. Recently, there have been advancements in professional-level adjustments, but we'll first explore using pure Python code for image generation.

A framework called "diffusers" is commonly used for loading image generation AI models and generating images. The code begins with installing "diffusers," which can be easily done via PyPI, a popular package management tool.

(Advertisement) In addition to quantum computing, blueqat offers a research and development platform that makes it easy to use generative AI, available to businesses for a fee. End of advertisement.

pip install diffusers transformers

This allows for installation. Its usage is simple.

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
image = pipeline("cat and cherry blossom", num_inference_steps=25).images[0]
image

The calculation can be completed in just a few lines. The initial noise used to restore the image affects the outcome, so the images generated can vary. To produce the same image every time, it's necessary to fix the random values, known as the seed, used to create the noise.

Creating a pipeline simplifies the process of loading tensor weights and preparing image generation. The `.from_pretrained` method indicates a model's location stored on Hugging Face, referring to the name assigned to the model's repository there. This time,

runwayml/stable-diffusion-v1-5

I specified it. Afterward, I entered a series of texts called a prompt, and for the inference step, I chose 25. This number of steps to revert noise back to an image generally means more steps can improve the drawing and quality of the image.

Since it's spring, I tried entering "a cat and cherry blossoms."

The time it took was,

25/25 [00:10<00:00, 2.36it/s]

Using a GPU, it took about 10 seconds for 25 steps. These 25 steps correspond to the diffusion steps required to restore the image from noise.

Image saving involves,

image.save('neko.png')

It can be saved in this way. It was easily executed in a Jupyter Notebook and can be immediately used for research and development. After the workshop, it was mentioned that the ability to infer using a CPU, even without a GPU, is advantageous.

The use of `.to("cuda")` specifies the use of a GPU, as utilizing a GPU is faster. However, calculations can also be performed on a CPU. In that case, it would take slightly less than 10 minutes, which is considerably longer compared to the roughly 10 seconds on a GPU. If a GPU is not available, using a CPU for calculations is still considered viable.

When setting up the pipeline, removing the specification for cuda will switch the computation to the CPU.

pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5")

By removing the cuda specification, calculations can be performed on a CPU as well.

Using a GUI

While implementation in Python, centered around Jupyter, is feasible, dealing with images often makes a GUI more convenient. There are several GUIs available for Stable Diffusion that allow for operation via a screen interface. The most famous is called Stable Diffusion Web UI, with the Automatic1111 version being particularly well-known.

stable-diffusion-webui

https://github.com/AUTOMATIC1111/stable-diffusion-webui

The UI is quite complex, but essentially, I believe it's used mainly for either the txt2img or img2img pages.

In the upper left, you can select a checkpoint. Uploading or downloading checkpoints from this page isn't possible, so it needs to be done separately, such as via SSH. There's a folder named "models," within which there is a folder for Stable Diffusion; that's where the checkpoints are stored. They are about 7GB in size.

You can use the file that comes pre-installed, named stable diffusion v1-5-pruned-emaonly.ckpt.

By entering the prompt in the text area and clicking the Generate button, the image will appear in the lower right.

In the lower left, there are various settings, but the width, height, and sampling steps are important—the more sampling steps, the generally higher the quality.

The Seed value at the bottom can reproduce the same image if the model and image size remain the same by setting a fixed seed number.

This time, I won't go into the details of each value, but generally, specifying the prompt, image size, sampling steps (varies by model but around 5-20), CFG Scale (1-7 depending on the model), and Seed should be sufficient. Setting the Seed to -1 will choose a random seed for you.

The Sampling Method may also enhance quality depending on the model. Check the settings for models downloaded from Hugging Face or CivitAI and match those recommendations.

What do you think? The setup is quite simple. Indeed, using a GUI simplifies the process.

In another workshop, we may explore the introduction of additional content such as adjusting the style of drawings by attaching something called LoRA to the checkpoint, as well as methods to create your own unique checkpoints.

Using diffusers or stable-diffusion-webui allows for easy image generation through CLI or GUI, so please give it a try.

That's all for now.