Boosting algorithms are a general form of strategy which helps in improving the accuracy of weak classifiers by combining them. We tend to take a weak classifier which does "just" better than chance probability and make a much better classifier thus boosting the performance of the learning classifiers.

One of the most famous algorithm being AdaBoost which is also known as Adaptive Boosting. It is an ensemble learning method which increases the efficiency of binary classifiers. Here, we deal with sequential learning where we learn from the mistakes of the previous models. The ensemble methods being divided into:

2. Parallel learning (eg. Random Forest)

Some of the boosting algorithms are AdaBoost, gradient boosting and XGBoost. Here we will be looking into AdaBoost.

Sometimes we try to use a single classifier to group the points but it fails quite a few times. Thus we try to create a strong model by combining multiple weak classifiers which progressively learn wronlgy classified points.

Let's start with a training set ${( x_i,y_i)}^N$ where $x_i \in R^k$ and $y_i \in {-1,1}$. We can define a weak classifier as $g_m(x) \in {-1,1}$.

Thus we can create a loss function I$(g_m(x),y)$ which will give 0 when $g_m(x)=y_i$ and 1 otherwise.

The pseudocode for the Adaboost algorithm comes is as follows:

for i from 1 to N, ${w_i}^{(1)}$ = 1

for m = 1 to M do

${\epsilon}m=\frac{\sum{i=1}^{N} {w_i}^{(m)} I(f_m(x_i) \neq y_i)}{\sum_{i} {w_i}^{(m)}}$ #error or the total number of misclassifications

where I($g_m(x) \neq y_i)$ = 1 and 0 otherwise,

${\alpha}_m = ln(\frac{1 - {\epsilon}_m}{{\epsilon}_m}$)

for all i do

${w_i}^{(m+1)} = {w_i}^{(m)}e^{{\alpha}_m I(g_m(x_i) \neq y_i)}$

end

end

The below diagram shows the decision boundaries and the updation of the weights at each round.

Let's break down the pseudocode for easy understanding:

Taking the total number of data points to be N, we can start by weighing all the points equally. Thus $w_{initial} = 1/N \in [0,1]$. The weighted samples summing to 1. ${\alpha}_t$ is the influence a single stump will have in the final classification. We will be learning about decision stumps later on. We can plot the value of $\alpha$ is such a way

We can check the values of $\alpha$ varying from infinity to -infinity and it being 0 when the number of classified objects are half correct, it being largely positive when there are a few misclassifications and it being largely negative when the classified objects are all wrong.

The final classifier is a linear combination of weak classifiers(base classifiers) :

f(x)= sign ($\sum_{m=1}^{M} {\alpha}_m g_m(x))$

The below diagram shows the linear combination of weak classifiers into one final classifier.

#### Decision stumps:

These can be considered as a form of weak classifiers. It's representation is:

$f(x) = s(x_k > c)$

where the value of the function is either 1 or -1 i.e if the value of $x_k > c$ it comes out to be 1 and -1 otherwise. Here $x_k$ is the $k^{th}$ element of the vector x.

The number of parameters to a decision stump are relatively small. These are:

1. $c \in R$
2. $k \in {1,....K}$ where K is the dimensionality of x.
3. $s \in {-1,1}$

As the number of parameters are small, we can use brute force by discretizing the points from the lowest to the highest value in the training set, sequentialy mentioning all possible classifiers and picking the ones with the lowest training error.

We will be analyzing the working of Adaboost to get a better picture of it with a python implementation of the same in the next blog post.

References :