Ensemble Learning

Pratima Rathore

7 min readJun 24, 2020

refer my article on Decision Tree before starting this article for better understanding.

A single decision tree will rarely generalize well to data it wasn’t trained on. However, we can combine the predictions of a large number of decision trees to make very accurate predictions.
Mathematically speaking, a decision tree has low bias and high variance. Averaging the result of many decision trees reduces the variance while maintaining that low bias.

Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.
When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error).
Ensembles of models are somewhat analogous to teams of individual players. If you were to choose a football team, you would take the following two steps:

Choose people with different skill sets, such as defenders, attackers and a goalkeeper, to ensure diversity.

Choose good players, i.e., ensure that all players are acceptable from the point of view of their skill sets (and at least better than a regular person)

Diversity ensures that the models serve complementary purposes, which means that the individual models make predictions independent of each other.Acceptability implies that each model is at least better than a random model meaning the probability of it being correct should be more than 0.5 making it better than a random guess.

Ensemble learning, in general, is a model that makes predictions based on a number of different models.When we try to predict the target variable by any machine learning technique, the main causes of the difference between the actual and predicted values are noise, variance and bias. The set reduces these factors (except noise, which is an irreducible error).

Some Commonly used Ensemble learning techniques -

1. Voting

Voting combines the output of different algorithms by taking a vote. In the case of a classification model, if the majority of the classifiers predict a particular class, then the output of the model would be the same class. In the case of a regression problem, the model output is the average of all the predictions made by the individual models. In this way, every classifier/regressor has an equal say in the final prediction.

2. Weighted Average

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. For ex-

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)

3. Stacking and Blending

Stacking is a method that uses k-fold for training base models which then make predictions on the left out fold. These so-called out of fold predictions are then used to train another model — the meta model — which can use the information produced by the base models to make final predictions.To even get better results we can not only use the predictions of all the base models for training the meta model but also the initial features

Blending is a word introduced by the Netflix competition winners.Blending is very similar to stacking with to only difference being that instead of creating out-of-fold predictions using kfold you create a small holdout data-set which will then be used to train the meta-model.This is a technique where we can do weighted averaging of final result by using another calssifier .

Stacking and Blending : approach to carry out manual ensembling is to pass the outputs of the individual models to a level-2 classifier/regressor as derived meta features, which will decide what weights should be given to each of the model outputs in the final prediction. In this way, the outputs of the individual models are combined with different weightages in the final prediction. This is the high-level approach behind stacking and blending.

4. Bagging

Its is a Parallel Ensemble Learning meaning training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data.
It is a method to help decrease the variance of the classifier and reduce overfitting
Bagging creates different training subsets from the sample training data with replacement, and an algorithm with the same set of hyperparameters is built on these different subsets of data. In this way, the same algorithm with a similar set of hyperparameters is exposed to different parts of data, resulting in a slight difference between the individual models. The predictions of these individual models are combined by taking the average of all the values for regression or a majority vote for a classification problem.
Note that using larger subsamples is not guaranteed to improve your results. In bagging there is a tradeoff between base model accuracy and the gain you get through bagging. The aggregation from bagging may improve the ensemble greatly when you have an unstable model, yet when your base models are more stable — been trained on larger subsamples with higher accuracy — improvements from bagging reduces.
examples : Random Forest, Bagged Decision Trees, Extra Trees

Disadvantages of Bagging -

In this approach, you cannot really see the individual trees one by one and figure out what is going on behind the ensemble as it is a combination of n number of trees working together. This leads to a loss of interpretability.
It does not work well when any of the features dominate because of which all the trees look similar and hence the property of diversity in ensembles is lost.
Sometimes bagging can be computationally expensive and is applied depending on the case.

5. Boosting

Before we go further, here’s another question for you: If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results? Such situations are taken care of by boosting.

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.
Boosting is ensemble meta-algorithm for principally reducing bias, and furthermore variance .
In boosting a subset is created from the original dataset and given equal weights.This model is used to make predictions on the whole dataset and errors are calculated using the actual values and predicted values.The observations which are incorrectly predicted, are given higher weights.Another model is created and predictions are made on the dataset, this model tries to correct the errors from the previous model. Similarly, multiple models are created, each correcting the errors of the previous model.The final model (strong learner) is the weighted mean of all the models (weak learners).

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.

Example : Adaboost, Stochastic Gradient Boosting,GBM,XGBM,Light GBM,CatBoost

Disadvantages of Boosting -

Boosting is sensitive to outliers since every classifier is obliged to fix the errors in the predecessors. Thus, the method is too dependent on outliers.
The method is almost impossible to scale up. This is because every estimator bases its correctness on the previous predictors, thus making the procedure difficult to streamline.

Both Boosting & Bagging are homogenous ensemble methods( dealing with a single base algorithm), but Voting/averaging & Stacking are heterogeneous methods( dealing with different base algorithms).
Similarly, Stacking also deals with different types of algorithms as base models, but here the Voting/averaging is different. The weight of each base model in the final prediction is determined by another algorithm/model. Thus they create a strong ensemble model.

Thats all for now ….

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin