Classification Algorithms: Random Forest – Part II



In the first part of this series we set the context for Random Forest algorithm by introducing the tree based algorithm for classification problems. In this post we will look at some of the limitations of the tree based model and how they were overcome paving the way to a powerful model – Random Forest. Two major methods that were employed to overcome those pitfalls are Bootstrapping and Bagging. We will discuss them first before delving into random forest.

Bootstrapping and Bagging

When we discussed the tree based model we saw that such models are very intuitive i.e. they are easy to interpret. However such models suffer from a major drawback i.e high variance.Let us understand what high variance means in this context. Suppose we were to have a data set which we divide it into three parts. If three different tree models were fit on these data sets and we were to predict the result of a new observation based on these three models. The result we might get from each of these three models for the same observation can be very different. This is what we call in statistical jargon as ” Model with high variance”. High variance obviously is bad as the reliability of the results we get is compromised. One effective way to overcome high variance is to do averaging. This would mean taking multiple data sets, fitting a tree based model on each of these data sets, do predictions on new observations and then averaging the results got from each of the tree model to get a more reliable result. This seems a very plausible solution. However we have a major problem here. Doing averaging would require having multiple data sets. But what if the data we have is quite limited and obtaining additional data is prohibitively expensive ?

……….. Lo and Behold, we have a powerful method to help us out of this predicament and it is called Bootstrapping.

The etymological meaning of the word Bootstrapping is “Pulling oneself up by ones bootstrap”.In essence it means doing some task considered impossible. In statistics bootstrapping procedure entails sampling from the available data set with replacement. Let me elaborate with an example. Suppose our data set were to have 10 observations ( rows 1:10). From this data set we were to randomly pick an observation, say row 6. After that we replace the row 6 into the data set and we randomly pick another number. Say this time we got row 8. We again put this observation back and repeat the process till we get around 10 observations. Let us assume that the first set of observations we picked looks like this : 6,8,4,8,5,6,9,1,2,5. You might have noticed that there are observations which repeat within the above set. That is perfectly all-right in bootstrapping. We continue this process till we get a collection of bootstrapped samples of 10 observations each. Once we get a collection or a bag of bootstrapped data sets, we fit a tree model for each of these sets, carry out predictions and then average the results. This whole process is called bagging. Bagging helps us get over our original problem of high variance and the results mirror more closely to reality.

Random Forest

Now that we have discussed bootstrapping and bagging we are in a position to get into the nuances of random forest. Random Forest algorithm provides an improvement over bagging in terms of de-correlating the trees. Let me elaborate the de-correlating part. When we were discussing the tree based methods in the last post, we talked about splitting the data set based on the best features.When we grow our trees on the bootstrapped samples , more often than not it is those set of best features which gets picked, to do the split and thereby grow trees. This will result in getting a bunch of trees which look almost the same or in statistical terms “co-related”. We also have discussed that the final result will be obtained by averaging results from all the tree models grown on the bootstrapped samples. It works out that averaging predictions from co-related trees will result in sub-optimal predictions.

To overcome this, Random Forest algorithm randomly picks a smaller subset of features to do split. If there were “P” features in the data set, the subset picked is approximately √P.  The idea of randomly picking a subset of features for each tree is to avoid being biased towards the best predictors. In the new setting, all the predictors have equal chance of being picked and the tree models will be more “representative”. Averaging the results from these representative trees will provide more accurate predictions. In effect the combination of bootstrapping, bagging and random picking of features provides the robustness inherent in the random forest model.

Out of Bag Error Estimation

There is a very straight forward method to estimate the error in a bagged model and it is called “Out of Bag”(OOB) error estimation. In the example we discussed on bootstrapping ,we had 10 observations in our first sample, (6,8,4,8,5,6,9,1,2,5). We can see that the following observations ( 3,7,10) have not been picked in the first bootstrapped sample. These elements are called “Out of Bag” observations. In general it is seen that in the bootstrapping process approximately only 2/3rd of the observations are generally picked. That means about 1/3rd of the observations are OOB in each bootstrapped sample. OOBs have some very important purpose in the overall scheme of things i.e. they act as test beds for estimating error in the model. Let me emphasize this idea with an example. Let us take the case of observation 3. As seen, it is an OOB observation for the first bootstrapped sample. Let us assume that the same observation ends up as OOB for the 6th and 12th bootstrapped data set too. When a tree model is fit on the first, sixth and the twelfth bootstrapped set, the observation 3 will be used as a test set to predict three distinct results corresponding to each model. The three results for observation 3 will thereby be averaged(for regression) to get a single prediction. In case of classification problems the most prevalent class out of the three will be taken. Once we get one single prediction by averaging, the error is estimated by comparing against the true class the observation 3 fall into. Similarly the error estimation is done for all the OOB elements to get an overall aggregation of error. This method of error estimation eliminates the need for cross validation which can be cumbersome for large data sets.

Wrapping Up

The ideas behind random forest model i.e bootstrapping, bagging, random feature selection etc has aided the making of a very powerful algorithm. However random forest is not bereft of pitfalls. One major pitfalls of the model is that it cant be interpreted easily. However the positives of this model far outweighs the negative and because of this random forest is one of the most powerful algorithms providing realistic results.

It is time to wrap up our discussion on tree based algorithms and random forest in particular. From the next post onward we start a new series called the “Mind of a Data Scientist”. In this series we do an exploratory walk, through the thought process of a data scientist in enabling, data driven informed decision making. Watch out this space for more

Classification Algorithms : Random Forest – Part I, Setting the Context

TreesIn the last few posts, where we discussed  Logistic regression, we had a fair bit of discussions on classification problems. Classification problems are the most prevalent ones we encounter in the real world machine learning setting and it is important to deal with various facets of this problem.In the next few posts, we will decipher some of the popular algorithms used within the classification context. The first of those algorithms which we are discussing is called the Random Forest. It is one of the most popular and powerful algorithms, which is currently used in the classification setting. In addition to deciphering the dynamics of Random Forest, we will also be looking at a practical applications powered by Random Forest algorithm. The practical application we will be dealing with is a movie sentiment analyser using a Random Forest Model. Outlined below is the path we will traverse in our discussions,

  • Random Forest :setting the context – An introduction to Tree based methods
  • Introduction to bootstrapping and deciphering the dynamics of Random Forest
  • Random Forest in action – Introduction of the movie sentiment analyser and Feature selection approaches
  • Decoding the movie sentiment analyser with the Random Forest model.

Setting the stage with Tree based Methods

Random Forest algorithm falls under a genre of prediction method called the Tree Based Method. Understanding the logic behind Tree based methods will help immensely in getting a better handle on Random Forest. So let us start our discussion on Tree based methods.

Tree based methods resonates very closely to the way humans take decisions . Let me start off with a small example of how I take decision on whether I should take an umbrella when I want to go for a walk. The decision tree is as depicted below.


Figure 1

Looking at the above decision tree let us try to get some intuition on the key components. As seen from the tree, there are three decision gates which splits my decision space thereby helping me to take a more informed decision. In machine learning parlance these decision gates are called “Predictor space” or “Feature Space”. At each predictor space, there is also a value which splits the predictor space. For example for the first feature space, “Did it rain in the past few days”, the value is a binary values i.e Yes/No ( or 1/0 numerically). It is the combination of a predictor space and its corresponding value which will help in finally arriving at a decision. The values need not be binary in nature. It can be a continuous numbers or some ordinal value. Taking cues from the above toy example, let us look at a real data set and try to under stand the tree based method in more detail.

The data set we are going to analyse is called the “Heart” data set. It is available in the UCI Machine Learning Repository.  A snapshot of the data set is as given below.


(Source: UCI Machine Learning Repository)

This data set consists of  records of 269 patients,along the rows, with symptoms of heart ailments. The columns 1 to 13  are predictors/features which necessarily are various attributes helping in detection of heart disease. The 14th column (“Pred”) is the outcome variable, which indicates whether that patient has heart disease or not, a value of “2” indicates the prevalence of heart disease and “1” absence of heart disease. Our aim is to understand how a tree based algorithm learns from a training set like the above and helps in predicting  the prevalence of heart disease in a patient.

The first step in a tree based algorithm is to find a predictor(from the 13 predictors) and a value to split the data set into two distinct regions or groups. In our toy example, the predictor we used for the first split was the one which indicated the presence of rain in the previous days. For the time being let us assume that the best predictor to do our split is the 13th predictor “Thal”. This predictor is the measure of Thalium stress test and contains 3 types of values , 3,6 & 7. A value of 3 indicates normal behavior, 6 indicates a fixed defect and 7 a reversible defect. Let us not worry too much about what fixed defect and reversible defects are as they are medical jargon. However we will only be concerned about the values 3,6 & 7. Let us assume that the value at which we do the split is  3. In essence any patient who has the result of Thalium stress test as 3 will be in group1 and the ones who have more than 3 will be another group. The first split is show in figure below


Figure 2

As seen above, the first split divides the total of 269 patients into two group of 151 and 118 respectively. In group 1 out of the 151 patients 119 of them have the outcome variable as 1( absence of heart disease) and the other 32 have outcome of 2 ( presence of heart disease). The corresponding value for group 2 are 31 and 87 respectively.

The next step is to grow the trees deeper by splitting the two branches we got after the first split with some other predictor. The choice of predictor for each branch can be same or different. I will come to the criteria for the choice of predictor later on. For the time being let us assume the same predictor is used to split the two branches further. Let the split happen with the 10th predictor (ST) and a value of 0.5. Any record with value less than 0.5 will be in a group to the left and more than 0.5 will belong to the branch on the right. The split is depicted as below.


Figure 3

This process of splitting the nodes continues till a certain threshold is reached. The threshold can be, say ,till no region has more than 5 observations. The last layer we get after all the splits are called the terminal nodes.

Now that we have seen the process of splitting(growing a tree) let us deal with two important aspects we did not explain in detail,

  1. How do we decide on the predictor/feature and values for making a split ?
  2. How do we do the predictions for any test set observations?

Selecting the predictors and its values

The process of selecting a predictor is an iterative one.We pick one predictor at a time and an arbitrary value within the predictor space and carry out an assessment if the picked predictor and value is the best one possible for carrying out the split.The way we do assessment of whether the predictor is the best one has its paralells with the way we find the right set of parameters by overall cost minimization in logistic regression. In tree based method we do the assessment through a method called minimization of classification error rate. The name might look intimidating, however the idea behind it is quite simple. When we carry out the node splitting process the labels or outcomes of all the observations are approximated to the most prevalent outcome. In the first split we did (figure 2) we split the observations in two branches. In the left branch out of the total 151 observations 119 of them have outcome as 1. So the most prevalent outcome in the left branch is outcome 1. Therefore all observations which fall in the left branch will be approximated as outcome 1, irrespective of what its true outcome is. In the process of approximating to the most prevalent outcome, we also make some classification errors. For the case of the left branch, there were 32 observations which belonged to outcome 2 which we are approximating as outcome 1. This obviously is the error we will inherit because of our approximation. This error as percentage of the total number of observations in that branch is called the classification error rate ( i.e 32/151). The criteria for the selecting the right predictor and the right value to do the split is the one which yields the lowest classification error rate. In a nutshell the overall process  for selecting the best  predictor and split value is as follows

  1. Pick a predictor and a value within the predictor space for doing a split
  2. Calculate the classification error rate.
  3. Repeat the process with another predictor and value noting the classification error rate obtained in each selection.
  4. Pick the predictor-value combination which yields the lowest classification error rate to do the split of that particular node.

Predictions for any test set observations

So far, what we have discussed is the training process. At the end of the training process what we learn are a set of best predictors and its corresponding values to do splits ,at each node. We have also discussed about the approximation of outcomes to the most prevalent outcome for calculation of classification error rate. The approximation of outcome  has another utility, i.e to determine the class of the terminal nodes. After the tree is grown to its full depth the terminal nodes will be categorized to an outcome which is most prevalent in that node. This categorization is required when we have to do predictions for any test set.

Once the learning is done on the training set, the way we do prediction on a test set is quite straight forward. The test set examples are split according to the splits we learned during the training process. After the splits, the test set examples finally ends up in one of the many terminal nodes. The prediction for the test set example will be the same as the outcome of the terminal node where it ends up.

Wrapping Up

As mentioned earlier tree based methods is the basis for understanding advanced algorithms like Random Forest. Now that we have seen the tree based methods we are well equipped to decipher Random Forest. We will do that in the next post. Watch out this space for your safari of Random Forest.

Logic of Logistic Regression – Part III



In our previous post on logistic regression we defined the concept of parameters and had a first hand glimpse on the dynamics between the data set and the parameters to obtain our first set of predictions. In this part we will go further into how we optimize the parameters in order to improve the accuracy of our predictions. We will be dealing with the following concepts

  1. Deciphering the prediction errors
  2. Minimizing errors through gradient descent and finding optimized parameters
  3. Prediction with the optimized set of parameters.

Deciphering Prediction Errors

Let us revisit the toy example we discussed in our last post and dissect the below table which represented the dynamics of prediction.


To recap, let us list down our discussions in  the last post on the dynamics involved in the above table.

  • We first assumed an initial set of parameters
  • Multiplied the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Converted the weighted sum into predictions ( column 6) by applying the activation function (sigmoid function).

Let us take a moment to reflect on what the predictions really mean ? The predictions are in fact the probabilities of the customer  buying the insurance policy. For example, for the first customer, we are predicting that the probability that the customer will buy the insurance policy is almost 17.9%.

However when we talk about predictions the first thing which comes to our mind is the veracity of those predictions. How close to reality are the first set of predictions which we made ? If we recall, in our last discussion on the training set, we introduced a new column called the labels. The labels in fact is the reality !! For example looking at the labels column we know that the first two customers did not buy the insurance policy ( label of ‘0’) and the next two bought the insurance policy. The veracity of our predictions can be realized by comparing our predictions with the reality manifested in the labels. By comparing we can see that the first and last customer predictions are somewhat close to reality and the middle ones are pretty off target. In ideal state, we want the first two predictions to be close to zero and the last two pretty close to ‘1’. However, what we predicted have obviously deviated from the reality. Such deviations are the errors we have inherited in our predictions.  However we need to note that the calculation of error for a classification problem like ours is a little mathematically oriented and is not as straight forward as subtracting the probability from the labels. For the sake of simplicity let us not get into those mathematical calculations and stick to our understanding that there  some errors inherited for each example. From the errors of each example we  can find the average error by summing up errors of all examples and dividing it by the number of examples ( 4 in our case). In machine learning parlance the average error so obtained can also be called the ‘Cost’.

Now that we know that there are ‘Cost’ involved in our predictions, our aim should be to minimize the cost so that our predictions are as close to the reality as possible.However the million dollar question is how do we minimize the cost ? What are the levers we have to reduce our costs ? Going back to our toy example, the two entities we have played around to get the predictions are the ‘data’ and the ‘parameters’ . We cannot change the given data because it is fixed. So all we have got to play around with is the parameters which we assumed. We have to try to change our parameters systematically so that we minimize the costs and get our predictions as close to the reality as possible. One of the ways we do this is by a procedure called gradient descent.

Gradient Descent

To understand the concept of gradient descent let us look at some graphical representations.


A pictorial representation of the cost function will look as the above. In the ‘X’ axis we have our parameters and in the ‘Y’ axis we have the cost. From the figure we can see that there are some set of parameters,’P’ with which we can get to the minimum cost ‘C_min’. Our aim is to find those parameters which will give us the minimum cost.

Let us represent the initial parameters we assumed as P_initial. For this set of parameters let us denote the  cost we derived as C1, as given in the figure. We can see from the figure that by moving the P value to the left ( decreasing the parameters ) by some value we can get to the minimum value of cost. Alternatively, if our initial ‘P’ value were to be on the left side of the graph, we would have to move to the right ( increase the value of parameters ) to get to the minimum cost. The procedure for achieving this is called the gradient descent.

The idea behind gradient descent is represented pictorially as below.


We decrease the parameters by small steps in an iterative fashion so as to get to the minimum cost. To find out  the “small steps” which I mentioned in the previous line we use a trick we learned in high school calculus called partial derivative. By taking the partial derivative at each point of the cost curve we get a value by which we have to reduce the parameters. With the new set of reduced parameters we find the new cost. Again we find the partial derivative at the new cost level to get the next steps which we have to take, and this process continues till we reach the minimum cost. An analogy to this process is like this. Suppose we are on top of a hill, blindfolded, and we want to find our way down the hill. The way we can do this is by feeling the ground with our foot to find those spots which are lower than the ones where we are currently and then move to the new spot. From the new spot we repeat the process till we finally reach the bottom of the hill. Gradient descent works somewhat similar to this.


Summarizing our discussions on gradient descent, these are the steps we take to get the optimum parameters.

  1. First start of with the assumed random parameters.
  2. Find the cost ( errors ) associated with the assumed parameters.
  3. Find the small steps we have to take to alter our parameters, by taking partial derivative of the cost.
  4. Reduce the parameters by the small steps and get a new set of parameters
  5. Find the new cost associated with the new parameters.
  6. Repeat the processes 3,4 & 5 till we get the most optimized cost.

The optimized parameters which we finally get are called the learned parameters.Getting to this optimized parameters is the most involved part of machine learning. Once we learn the parameters using, the training set, we are all set to do predictions which is the objective of any machine learning process.

Doing Predictions

Having learned our set of optimized parameters from the training set, we are now equipped with enough ammunition to do predictions. For doing predictions we take a new set of data called the test set. However there is a difference between the training set and test set. The test set will not have any labels. Our job is to predict the labels from the parameters we have learned. So in the insurance company example, the test set would be the new set of leads which the sales team generated. We have to predict the likelihood of these leads, buying an insurance policy. The way we do the prediction is as follows.

  • We take the optimized set of parameters learned from the training set
  • Multiply the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Convert the weighted sum into probabilities ( column 6) by applying the activation function (sigmoid function).
  • We take a threshold point ( say 0.5). So any probability less than the threshold point is predicted as ‘0’ ( Will not buy) and anything greater that the threshold point is predicted as ‘1’.

The threshold point which we take to make a decision on our predictions is called the decision boundary.Needless to say, the logistic regression is the basic model among a vast set of powerful classification algorithms. The significance of logistic regression is that it is the building block for the development of powerful algorithms like Support Vector machines, Neural Networks etc. Having said that there are many problem areas where we have to go for simple algorithms like logistic regression. Having dealt with the basic building blocks of classification problems we will have further discussions on some of the most powerful algorithms in future posts. Until then watch out this space for more.

Logic of Logistic Regression – Part II


In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

  1. Weights or parameters which we learn
  2. The activation function, and
  3. The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a  question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its  features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being).  Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.


The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

  • Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
  • Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data setIf we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’                                                                                                                                           Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5                                                                         Range of ‘Age’ = 46 – 28 = 18                                                                                                                Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167                                                  Similarly we do it for the complete data set. The scaled data set is as represented below.    Please note that we do not scale the labels.                                                                                                                                                          scale
  • Prediction with initial parameters : Once the data is scaled,  we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations.   Weight   Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function.  One of the most common activation function used in classification problems is the  Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities. activation The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the  customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy.                                                                                                                                                                                                                                   Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

The Logic of Logistic Regression

At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.


In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions  we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall  from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

Classification Problems Unplugged – Setting the context

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

  • Age of the customer
  • Income of the customer and
  • A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

Customer                Age                 Income                Propensity
Cust-1                                   22                      1000                           1
Cust-2                                   36                     5000                           6
Cust-3                                   62                     4500                            8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ‘W’ suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

Customer           Age                 Income             Propensity           Total Score
Cust-1                  22 x 8         +     1000 x 3     +    1 x 10                  3186
Cust-2                 36 x 8         +    5000 x 3     +     6 x 10                  15,348
Cust-3                  62 x 8          +   4500 x 3     +    8 x 10                 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of  more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

  1. The derivation of the parameters for each variable – In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.
  2. The derivation of the total score – In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.
  3. The decision boundary – The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction  is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.