AdaBoost - Boosting Algorithm (Continued)

In the last article we were introduced to Boosting Algorithms, Ensemble Methods and a detailed overview of one of the most talked about boosting algorithm that is AdaBoost or Adaptive Boosting. That gave us an understanding of the how AdaBoost works, where can we use it and how it takes a linear ensemble of weak classifiers or base classifiers and creates a more powerful final classifier which in a progressive way learn about the wrongly classified points.

So we have an idea of how it works, it's pseudocode and a diagramatic representation of the updation of weights at each round. But it's still not clear why it works?. It's difficult to get a complete picture of why Adaboost works so well but we will look into one of the analyzing factor which can give an intuitive idea of the same. Let's look at it through a greedy optimization perspective or a loss function view. We can represent it as the optimization of the exponential loss of the function below.

We start with a function:

$f(x) = \frac{1}{2} \sum_m {\alpha}_m f_m(x)$

We saw in the previous article that the classifier can we rewritten as: $sign(f(x))$

The exponential loss can be written as:

$L_{exp}(x,y) = e^{-y f(x)}$

Given the training set ${(x_i,y_i)}^N$ we can write the classifier objective function as:

$L = \sum_i e^{\frac{-1}{2} y_i \sum_{m=1}^M {\alpha}_m f_m(x_i)}$

We try to do the optimization in a greedy and sequential way where we add a weak classifier at each step. Let's add a classifier $f(m)$ at the step m and the objective function can be written as:

$L = \sum_i e^{\frac{-1}{2} y_i \sum_{j=1}^{m-1} {\alpha}_j f_j(x_i) -\frac{1}{2} y_i {\alpha}_m f_m(x_i)}$

while the above equation can be split into the product of two exponentials:

$L = \sum_i e^{\frac{-1}{2} y_i \sum_{j=1}^{m-1} {\alpha}_j f_j(x_i)} e^{-\frac{1}{2} y_i {\alpha}_m f_m(x_i)}$
Since we are trying to optimize the classifier at step m we can write the first m-1 terms as a constant factor which won't effect the gradient at the final step. Thus writing this off as a weight constant, thus the weight constant for the $m^{th}$ classifier being:

${w_i}^{(m)} = e^{\frac{-1}{2} y_i \sum_{j=1}^{m-1} {\alpha}_j f_j(x_i)}$

Like we saw in the last article, this weight constant resembles the weight constant of the AdaBoost which was computed by recursion.

Thus we can write the above as:

$L = \sum_i {w_i}^{(m)} e^{\frac{-1}{2} y_i {\alpha}_m f_m(x_i)}$
Splitting the above equation into the one for data classified by $f_m$ , and one for the misclassified data:

$ L = \sum_{f_m(x_i) = y_i} w_i^{(m)} e^{-{\alpha}m/2} + \sum{f_m(x_i) \neq y_i} w_i^{(m)} e^{{\alpha}_m/2}$

This is done on the fact that for the classified data we have $y_i*f_m(x_i)$ = 1 and -1 otherwise.
Thus if we want to do the summation over all the data points we can rearrange the above equation into taking the error as $I(f_m(x_i) \neq y_i)$ :

$L = (e^{{\alpha}_m/2} - e^{-{\alpha}_m/2}) \sum_i {w_i}^{(m)} I(f_m(x_i) \neq y_i) + e^{{-\alpha}_m/2} \sum_i {w_i}^{(m)}$

Here we can write the error rate $I(f_m(x_i) \neq y_i)$ as ${\epsilon}_m$ . Now, for us to get the optimal value of the parameter ${\alpha}_m$ , we find the derivative of the loss function with repsect to the above parameter.
$\frac{dL}{d{\alpha}_m} = \frac{{\alpha}_m}{2} (e^{{\alpha}_m/2} - e^{-{\alpha}_m/2}) \sum_i {w_i}^{(m)} I(f_m(x_i) \neq y_i) - \frac{{\alpha}_m}{2} e^{-{\alpha}_m/2} \sum_i {w_i}^{(m)} = 0$
From the above equation we get :

$0 = e^{{\alpha}_m/2} {\epsilon}_m + e^{{-\alpha}_m/2} {\epsilon}_m - e^{{-\alpha}_m/2}$

Solving this we get ${\alpha}_m = ln \frac{1-{\epsilon}_m}{{\epsilon}_m}$
The exponential loss does not give the best picture of the Adaboost cause the optimization of the loss function here gives bad performance when done over all the variables of the classifier but yes it's a way to understand Adaboost.
Next we will try to implement the AdaBoost classifier using scikit-learn.
First we will import the necessary packages.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import train_test_split
from sklearn import metrics

We import the Hastie (10.2) dataset from scikit-learn dataset and we will be using the train_test_split method to split the data into training and test datasets.

x, y = make_hastie_10_2()
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)

We have 80% training set and 20% test set.

a = AdaBoostClassifier(n_estimators=50,
                         learning_rate=1)

We make an AdaBoost Classifier object (a) and use the given parameters where n_estimators are the number of weak classifiers we are using and the learning rate being the initial weights if the weak classifiers.

model = a.fit(X_train, Y_train)

Y_pred = model.predict(X_test)

Predicted values for our test datasets.

print(metrics.accuracy_score(Y_test, Y_pred))

0.8791666666666667

Here we predict the accuracy of our model which comes out to be around 87%.

Here we are trying to use the scikit-learn library methods to predict the error. In my next article we be writing the AdaBoost implementation from scratch based on the pseudocode we wrote in the last article.
References :

AdaBoost - Boosting Algorithm (Continued)

Gaurav Singh