Internship lecture 1 : GBDT / PyTorch NN

Hello, blueqat is occasionally looking for interns to help acquiring skills, help people find jobs, and support entrepreneurship. We are developing quantum-based algorithms to solve social problems, and we often compare them with calculations on classical computers. Although the percentage of quantum has increased considerably recently, we are still in a transitional period to compare quantum and classical, and to master quantum-classical hybrids, so we need skills in both. This is the first time.
1st lecture : GBDT / PyTorch NN
2nd lecture : Discrete Optimization(GA) / Continuous Optimization(scipy optimize / torch optim) / Bayesian Optimization(optuna)
3rd lecture : Quantum Annealing and QUBO / QAOA
4th lecture : Linear Regression and Qboost on QAOA
5th lecture : Quantum Neural Network
GBDT
GBDT stands for Gradient Boosting Decision Tree, an algorithm consisting of gradient descent + boosting + decision tree. There are many references to classical models available on the Internet, so I will not explain them in depth, but I will review how to use them.

The following data is a summary of properties near Tama Plaza Station on the Tokyu Denentoshi Line in terms of walking time, number of floors, area, age of the building, and rent fee.
This time, we will use the xgboost model to predict rents from other parameters. numpy is a commonly used numerical library, and pandas is a commonly used data analysis library.

First, we will install the libraries we will use this time.

!pip install numpy pandas sklearn xgboost

Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (1.21.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (1.4.3)
Requirement already satisfied: sklearn in /opt/conda/lib/python3.10/site-packages (0.0)
Requirement already satisfied: xgboost in /opt/conda/lib/python3.10/site-packages (1.6.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas) (2022.1)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from sklearn) (1.1.2)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from xgboost) (1.8.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->sklearn) (3.1.0)
Requirement already satisfied: joblib>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.1.0)

import numpy as np
import pandas as pd

#data
data = np.array([[8,3,83.74,18,22],[5,2,75.72,19,19],[7,2,31.47,46,7.8],[18,2,46.62,36,8],[8,3,41.02,49,8],[8,1,70.43,25,15.8],[8,1,70.43,25,15.8],[12,1,48.02,3,12.5],[10,4,58.57,36,11.8]])
df = pd.DataFrame(data,columns=['walk','floor','area','age','rent'])
df

   toho  kaisu  hirosa  chiku  chinryo
0   8.0    3.0   83.74   18.0     22.0
1   5.0    2.0   75.72   19.0     19.0
2   7.0    2.0   31.47   46.0      7.8
3  18.0    2.0   46.62   36.0      8.0
4   8.0    3.0   41.02   49.0      8.0
5   8.0    1.0   70.43   25.0     15.8
6   8.0    1.0   70.43   25.0     15.8
7  12.0    1.0   48.02    3.0     12.5
8  10.0    4.0   58.57   36.0     11.8

This time, we will train this data and create a model to predict rents. First, we will extract only rents from the data and create data other than rents.

#rent data
y = df["rent"]

#others
X = df.drop(columns=["rent"], axis=1)

Next, these data are further divided into training and testing data. Since it is natural that the accuracy of training data becomes higher when it is tested, training data and data for accuracy evaluation are usually separated. Here, we use scikit-learn, a library for machine learning, to further separate the data into two parts.

The test_size=0.2 means that 20% of the total data is reserved for evaluation.

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
X_train

   toho  kaisu  hirosa  chiku
6   8.0    1.0   70.43   25.0
7  12.0    1.0   48.02    3.0
1   5.0    2.0   75.72   19.0
0   8.0    3.0   83.74   18.0
4   8.0    3.0   41.02   49.0
3  18.0    2.0   46.62   36.0
5   8.0    1.0   70.43   25.0

Thus, only the training data has been extracted from the above data. The rest is in y_train.

Now that we have the data, we can start using xgboost. Load the tool and prepare a training vessel called a model. Then we use fit to train the values of X_train and y_train.

# xgboost
import xgboost as xgb

model = xgb.XGBRegressor()
model.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

Now you have done it. The parameters in the model have been trained. Now let's try to make a prediction with some data.
This time, we will try to predict the rent for a property that is a 9-minute walk from the station, on the 5th floor, 58.3 square meters in size, and 34 years old.

Let's store the value in X_test and use predict to assign this value.

X_test = np.array([[9,5,58.3,34]])

predictions = model.predict(X_test)
predictions[0]

12.527947

The calculation resulted in 125,000 yen. The actual rent was 115,000 yen. We can see that we are not that far off. We were able to learn even with a small amount of data.

Next, let's look at which parameters were effective in the prediction.

xgb.plot_importance(model)

<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

<Figure size 432x288 with 1 Axes>

This shows that size, minutes to walk, number of floors, etc. are the most important factors. The accuracy can be improved by adjusting the parameters.
PyTorch NN
PyTorch is a well-known machine learning library, mainly used for a method called neural networks, and more recently it often appears in quantum computation libraries. In this article, I would like to perform a regression calculation using the same data. First, install pytorch.

!pip install torch

Requirement already satisfied: torch in /opt/conda/lib/python3.10/site-packages (1.12.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch) (4.3.0)

Then next we create the model. In the neural network model, there are four input values, one output value to predict, and the middle layer will be 12 nodes this time, resulting in a 4x12x1 network structure.

We set the algorithm to optimize this model to Adam and specify a function of the error between the predicted value from the input data and the actual correct data, called the loss function. The data is converted from the DataFrame data described earlier into a tensor format that can be read by PyTorch. The correct answer data is a one-dimensional array as it is, so we have converted it somewhat.

The training is performed 100 times.

import torch

model2 = torch.nn.Sequential()
model2.add_module('fc1', torch.nn.Linear(4, 12))
model2.add_module('relu', torch.nn.ReLU())
model2.add_module('fc2', torch.nn.Linear(12, 1))

#select optimizer
optimizer = torch.optim.Adam(model2.parameters(), lr=0.1)

#select loss function
lossf = torch.nn.MSELoss()

#input data and target data
input = torch.Tensor(X_train.values)
target = torch.Tensor([[i] for i in y_train.values])

#train
for _ in range(100):
  loss = lossf(model2(input), target)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

Now that we have finished training the model, we would like to do some estimation. We will use the same data as before.

pred = model2(torch.Tensor(X_test))
pred

tensor([[11.0694]], grad_fn=<AddmmBackward0>)

value was obtained. The actual rent is 115,000 yen, so we were able to estimate relatively close data.

This time, I tried to estimate the rent using xgboost and pytorch neural network. There are various ways to use it, so let's apply it to various problems. That is all.

Internship lecture 1 : GBDT / PyTorch NN

Yuichiro Minato