Classification Algorithms : Random Forest – Part I, Setting the Context

In the last few posts, where we discussed  Logistic regression, we had a fair bit of discussions on classification problems. Classification problems are the most prevalent ones we encounter in the real world machine learning setting and it is important to deal with various facets of this problem.In the next few posts, we will decipher some of the popular algorithms used within the classification context. The first of those algorithms which we are discussing is called the Random Forest. It is one of the most popular and powerful algorithms, which is currently used in the classification setting. In addition to deciphering the dynamics of Random Forest, we will also be looking at a practical applications powered by Random Forest algorithm. The practical application we will be dealing with is a movie sentiment analyser using a Random Forest Model. Outlined below is the path we will traverse in our discussions,

• Random Forest :setting the context – An introduction to Tree based methods
• Introduction to bootstrapping and deciphering the dynamics of Random Forest
• Random Forest in action – Introduction of the movie sentiment analyser and Feature selection approaches
• Decoding the movie sentiment analyser with the Random Forest model.

Setting the stage with Tree based Methods

Random Forest algorithm falls under a genre of prediction method called the Tree Based Method. Understanding the logic behind Tree based methods will help immensely in getting a better handle on Random Forest. So let us start our discussion on Tree based methods.

Tree based methods resonates very closely to the way humans take decisions . Let me start off with a small example of how I take decision on whether I should take an umbrella when I want to go for a walk. The decision tree is as depicted below.

Figure 1

Looking at the above decision tree let us try to get some intuition on the key components. As seen from the tree, there are three decision gates which splits my decision space thereby helping me to take a more informed decision. In machine learning parlance these decision gates are called “Predictor space” or “Feature Space”. At each predictor space, there is also a value which splits the predictor space. For example for the first feature space, “Did it rain in the past few days”, the value is a binary values i.e Yes/No ( or 1/0 numerically). It is the combination of a predictor space and its corresponding value which will help in finally arriving at a decision. The values need not be binary in nature. It can be a continuous numbers or some ordinal value. Taking cues from the above toy example, let us look at a real data set and try to under stand the tree based method in more detail.

The data set we are going to analyse is called the “Heart” data set. It is available in the UCI Machine Learning Repository.  A snapshot of the data set is as given below.

`(Source: UCI Machine Learning Repository)`

This data set consists of  records of 269 patients,along the rows, with symptoms of heart ailments. The columns 1 to 13  are predictors/features which necessarily are various attributes helping in detection of heart disease. The 14th column (“Pred”) is the outcome variable, which indicates whether that patient has heart disease or not, a value of “2” indicates the prevalence of heart disease and “1” absence of heart disease. Our aim is to understand how a tree based algorithm learns from a training set like the above and helps in predicting  the prevalence of heart disease in a patient.

The first step in a tree based algorithm is to find a predictor(from the 13 predictors) and a value to split the data set into two distinct regions or groups. In our toy example, the predictor we used for the first split was the one which indicated the presence of rain in the previous days. For the time being let us assume that the best predictor to do our split is the 13th predictor “Thal”. This predictor is the measure of Thalium stress test and contains 3 types of values , 3,6 & 7. A value of 3 indicates normal behavior, 6 indicates a fixed defect and 7 a reversible defect. Let us not worry too much about what fixed defect and reversible defects are as they are medical jargon. However we will only be concerned about the values 3,6 & 7. Let us assume that the value at which we do the split is  3. In essence any patient who has the result of Thalium stress test as 3 will be in group1 and the ones who have more than 3 will be another group. The first split is show in figure below

Figure 2

As seen above, the first split divides the total of 269 patients into two group of 151 and 118 respectively. In group 1 out of the 151 patients 119 of them have the outcome variable as 1( absence of heart disease) and the other 32 have outcome of 2 ( presence of heart disease). The corresponding value for group 2 are 31 and 87 respectively.

The next step is to grow the trees deeper by splitting the two branches we got after the first split with some other predictor. The choice of predictor for each branch can be same or different. I will come to the criteria for the choice of predictor later on. For the time being let us assume the same predictor is used to split the two branches further. Let the split happen with the 10th predictor (ST) and a value of 0.5. Any record with value less than 0.5 will be in a group to the left and more than 0.5 will belong to the branch on the right. The split is depicted as below.

Figure 3

This process of splitting the nodes continues till a certain threshold is reached. The threshold can be, say ,till no region has more than 5 observations. The last layer we get after all the splits are called the terminal nodes.

Now that we have seen the process of splitting(growing a tree) let us deal with two important aspects we did not explain in detail,

1. How do we decide on the predictor/feature and values for making a split ?
2. How do we do the predictions for any test set observations?

Selecting the predictors and its values

The process of selecting a predictor is an iterative one.We pick one predictor at a time and an arbitrary value within the predictor space and carry out an assessment if the picked predictor and value is the best one possible for carrying out the split.The way we do assessment of whether the predictor is the best one has its paralells with the way we find the right set of parameters by overall cost minimization in logistic regression. In tree based method we do the assessment through a method called minimization of classification error rate. The name might look intimidating, however the idea behind it is quite simple. When we carry out the node splitting process the labels or outcomes of all the observations are approximated to the most prevalent outcome. In the first split we did (figure 2) we split the observations in two branches. In the left branch out of the total 151 observations 119 of them have outcome as 1. So the most prevalent outcome in the left branch is outcome 1. Therefore all observations which fall in the left branch will be approximated as outcome 1, irrespective of what its true outcome is. In the process of approximating to the most prevalent outcome, we also make some classification errors. For the case of the left branch, there were 32 observations which belonged to outcome 2 which we are approximating as outcome 1. This obviously is the error we will inherit because of our approximation. This error as percentage of the total number of observations in that branch is called the classification error rate ( i.e 32/151). The criteria for the selecting the right predictor and the right value to do the split is the one which yields the lowest classification error rate. In a nutshell the overall process  for selecting the best  predictor and split value is as follows

1. Pick a predictor and a value within the predictor space for doing a split
2. Calculate the classification error rate.
3. Repeat the process with another predictor and value noting the classification error rate obtained in each selection.
4. Pick the predictor-value combination which yields the lowest classification error rate to do the split of that particular node.

Predictions for any test set observations

So far, what we have discussed is the training process. At the end of the training process what we learn are a set of best predictors and its corresponding values to do splits ,at each node. We have also discussed about the approximation of outcomes to the most prevalent outcome for calculation of classification error rate. The approximation of outcome  has another utility, i.e to determine the class of the terminal nodes. After the tree is grown to its full depth the terminal nodes will be categorized to an outcome which is most prevalent in that node. This categorization is required when we have to do predictions for any test set.

Once the learning is done on the training set, the way we do prediction on a test set is quite straight forward. The test set examples are split according to the splits we learned during the training process. After the splits, the test set examples finally ends up in one of the many terminal nodes. The prediction for the test set example will be the same as the outcome of the terminal node where it ends up.

Wrapping Up

As mentioned earlier tree based methods is the basis for understanding advanced algorithms like Random Forest. Now that we have seen the tree based methods we are well equipped to decipher Random Forest. We will do that in the next post. Watch out this space for your safari of Random Forest.