In the first part of this series we set the context for Random Forest algorithm by introducing the tree based algorithm for classification problems. In this post we will look at some of the limitations of the tree based model and how they were overcome paving the way to a powerful model – Random Forest. Two major methods that were employed to overcome those pitfalls are Bootstrapping and Bagging. We will discuss them first before delving into random forest.
Bootstrapping and Bagging
When we discussed the tree based model we saw that such models are very intuitive i.e. they are easy to interpret. However such models suffer from a major drawback i.e high variance.Let us understand what high variance means in this context. Suppose we were to have a data set which we divide it into three parts. If three different tree models were fit on these data sets and we were to predict the result of a new observation based on these three models. The result we might get from each of these three models for the same observation can be very different. This is what we call in statistical jargon as ” Model with high variance”. High variance obviously is bad as the reliability of the results we get is compromised. One effective way to overcome high variance is to do averaging. This would mean taking multiple data sets, fitting a tree based model on each of these data sets, do predictions on new observations and then averaging the results got from each of the tree model to get a more reliable result. This seems a very plausible solution. However we have a major problem here. Doing averaging would require having multiple data sets. But what if the data we have is quite limited and obtaining additional data is prohibitively expensive ?
……….. Lo and Behold, we have a powerful method to help us out of this predicament and it is called Bootstrapping.
The etymological meaning of the word Bootstrapping is “Pulling oneself up by ones bootstrap”.In essence it means doing some task considered impossible. In statistics bootstrapping procedure entails sampling from the available data set with replacement. Let me elaborate with an example. Suppose our data set were to have 10 observations ( rows 1:10). From this data set we were to randomly pick an observation, say row 6. After that we replace the row 6 into the data set and we randomly pick another number. Say this time we got row 8. We again put this observation back and repeat the process till we get around 10 observations. Let us assume that the first set of observations we picked looks like this : 6,8,4,8,5,6,9,1,2,5. You might have noticed that there are observations which repeat within the above set. That is perfectly all-right in bootstrapping. We continue this process till we get a collection of bootstrapped samples of 10 observations each. Once we get a collection or a bag of bootstrapped data sets, we fit a tree model for each of these sets, carry out predictions and then average the results. This whole process is called bagging. Bagging helps us get over our original problem of high variance and the results mirror more closely to reality.
Now that we have discussed bootstrapping and bagging we are in a position to get into the nuances of random forest. Random Forest algorithm provides an improvement over bagging in terms of de-correlating the trees. Let me elaborate the de-correlating part. When we were discussing the tree based methods in the last post, we talked about splitting the data set based on the best features.When we grow our trees on the bootstrapped samples , more often than not it is those set of best features which gets picked, to do the split and thereby grow trees. This will result in getting a bunch of trees which look almost the same or in statistical terms “co-related”. We also have discussed that the final result will be obtained by averaging results from all the tree models grown on the bootstrapped samples. It works out that averaging predictions from co-related trees will result in sub-optimal predictions.
To overcome this, Random Forest algorithm randomly picks a smaller subset of features to do split. If there were “P” features in the data set, the subset picked is approximately √P. The idea of randomly picking a subset of features for each tree is to avoid being biased towards the best predictors. In the new setting, all the predictors have equal chance of being picked and the tree models will be more “representative”. Averaging the results from these representative trees will provide more accurate predictions. In effect the combination of bootstrapping, bagging and random picking of features provides the robustness inherent in the random forest model.
Out of Bag Error Estimation
There is a very straight forward method to estimate the error in a bagged model and it is called “Out of Bag”(OOB) error estimation. In the example we discussed on bootstrapping ,we had 10 observations in our first sample, (6,8,4,8,5,6,9,1,2,5). We can see that the following observations ( 3,7,10) have not been picked in the first bootstrapped sample. These elements are called “Out of Bag” observations. In general it is seen that in the bootstrapping process approximately only 2/3rd of the observations are generally picked. That means about 1/3rd of the observations are OOB in each bootstrapped sample. OOBs have some very important purpose in the overall scheme of things i.e. they act as test beds for estimating error in the model. Let me emphasize this idea with an example. Let us take the case of observation 3. As seen, it is an OOB observation for the first bootstrapped sample. Let us assume that the same observation ends up as OOB for the 6th and 12th bootstrapped data set too. When a tree model is fit on the first, sixth and the twelfth bootstrapped set, the observation 3 will be used as a test set to predict three distinct results corresponding to each model. The three results for observation 3 will thereby be averaged(for regression) to get a single prediction. In case of classification problems the most prevalent class out of the three will be taken. Once we get one single prediction by averaging, the error is estimated by comparing against the true class the observation 3 fall into. Similarly the error estimation is done for all the OOB elements to get an overall aggregation of error. This method of error estimation eliminates the need for cross validation which can be cumbersome for large data sets.
The ideas behind random forest model i.e bootstrapping, bagging, random feature selection etc has aided the making of a very powerful algorithm. However random forest is not bereft of pitfalls. One major pitfalls of the model is that it cant be interpreted easily. However the positives of this model far outweighs the negative and because of this random forest is one of the most powerful algorithms providing realistic results.
It is time to wrap up our discussion on tree based algorithms and random forest in particular. From the next post onward we start a new series called the “Mind of a Data Scientist”. In this series we do an exploratory walk, through the thought process of a data scientist in enabling, data driven informed decision making. Watch out this space for more