In the last article we were introduced to Boosting Algorithms, Ensemble Methods and a detailed overview of one of the most talked about boosting algorithm that is AdaBoost or Adaptive Boosting. That gave us an understanding of the how AdaBoost works, where can we use it and how it takes a linear ensemble of weak classifiers or base classifiers and creates a more powerful final classifier which in a progressive way learn about the wrongly classified points.
So we have an idea of how it works, it's pseudocode and a diagramatic representation of the updation of weights at each round. But it's still not clear why it works?. It's difficult to get a complete picture of why Adaboost works so well but we will look into one of the analyzing factor which can give an intuitive idea of the same. Let's look at it through a greedy optimization perspective or a loss function view. We can represent it as the optimization of the exponential loss of the function below.
We start with a function:
f(x)=21∑mαmfm(x)
We saw in the previous article that the classifier can we rewritten as: sign(f(x))
The exponential loss can be written as:
Lexp(x,y)=e−yf(x)
Given the training set (xi,yi)N we can write the classifier objective function as:
L=∑ie2−1yi∑m=1Mαmfm(xi)
We try to do the optimization in a greedy and sequential way where we add a weak classifier at each step. Let's add a classifier f(m) at the step m and the objective function can be written as:
Since we are trying to optimize the classifier at step m we can write the first m-1 terms as a constant factor which won't effect the gradient at the final step. Thus writing this off as a weight constant, thus the weight constant for the mth classifier being:
wi(m)=e2−1yi∑j=1m−1αjfj(xi)
Like we saw in the last article, this weight constant resembles the weight constant of the AdaBoost which was computed by recursion.
Thus we can write the above as:
L=∑iwi(m)e2−1yiαmfm(xi)
Splitting the above equation into the one for data classified by fm, and one for the misclassified data:
Here we can write the error rate I(fm(xi)=yi) as ϵm. Now, for us to get the optimal value of the parameter αm, we find the derivative of the loss function with repsect to the above parameter.
The exponential loss does not give the best picture of the Adaboost cause the optimization of the loss function here gives bad performance when done over all the variables of the classifier but yes it's a way to understand Adaboost.
Next we will try to implement the AdaBoost classifier using scikit-learn.
First we will import the necessary packages.
Copy
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import train_test_split
from sklearn import metrics
We import the Hastie (10.2) dataset from scikit-learn dataset and we will be using the train_test_split method to split the data into training and test datasets.
Copy
x, y = make_hastie_10_2()
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)
We have 80% training set and 20% test set.
Copy
a = AdaBoostClassifier(n_estimators=50,
learning_rate=1)
We make an AdaBoost Classifier object (a) and use the given parameters where n_estimators are the number of weak classifiers we are using and the learning rate being the initial weights if the weak classifiers.
Copy
model = a.fit(X_train, Y_train)
Copy
Y_pred = model.predict(X_test)
Predicted values for our test datasets.
Copy
print(metrics.accuracy_score(Y_test, Y_pred))
0.8791666666666667
Here we predict the accuracy of our model which comes out to be around 87%.
Here we are trying to use the scikit-learn library methods to predict the error. In my next article we be writing the AdaBoost implementation from scratch based on the pseudocode we wrote in the last article.