Logic of Logistic Regression – Part III

 

data

In our previous post on logistic regression we defined the concept of parameters and had a first hand glimpse on the dynamics between the data set and the parameters to obtain our first set of predictions. In this part we will go further into how we optimize the parameters in order to improve the accuracy of our predictions. We will be dealing with the following concepts

  1. Deciphering the prediction errors
  2. Minimizing errors through gradient descent and finding optimized parameters
  3. Prediction with the optimized set of parameters.

Deciphering Prediction Errors

Let us revisit the toy example we discussed in our last post and dissect the below table which represented the dynamics of prediction.

activation

To recap, let us list down our discussions in  the last post on the dynamics involved in the above table.

  • We first assumed an initial set of parameters
  • Multiplied the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Converted the weighted sum into predictions ( column 6) by applying the activation function (sigmoid function).

Let us take a moment to reflect on what the predictions really mean ? The predictions are in fact the probabilities of the customer  buying the insurance policy. For example, for the first customer, we are predicting that the probability that the customer will buy the insurance policy is almost 17.9%.

However when we talk about predictions the first thing which comes to our mind is the veracity of those predictions. How close to reality are the first set of predictions which we made ? If we recall, in our last discussion on the training set, we introduced a new column called the labels. The labels in fact is the reality !! For example looking at the labels column we know that the first two customers did not buy the insurance policy ( label of ‘0’) and the next two bought the insurance policy. The veracity of our predictions can be realized by comparing our predictions with the reality manifested in the labels. By comparing we can see that the first and last customer predictions are somewhat close to reality and the middle ones are pretty off target. In ideal state, we want the first two predictions to be close to zero and the last two pretty close to ‘1’. However, what we predicted have obviously deviated from the reality. Such deviations are the errors we have inherited in our predictions.  However we need to note that the calculation of error for a classification problem like ours is a little mathematically oriented and is not as straight forward as subtracting the probability from the labels. For the sake of simplicity let us not get into those mathematical calculations and stick to our understanding that there  some errors inherited for each example. From the errors of each example we  can find the average error by summing up errors of all examples and dividing it by the number of examples ( 4 in our case). In machine learning parlance the average error so obtained can also be called the ‘Cost’.

Now that we know that there are ‘Cost’ involved in our predictions, our aim should be to minimize the cost so that our predictions are as close to the reality as possible.However the million dollar question is how do we minimize the cost ? What are the levers we have to reduce our costs ? Going back to our toy example, the two entities we have played around to get the predictions are the ‘data’ and the ‘parameters’ . We cannot change the given data because it is fixed. So all we have got to play around with is the parameters which we assumed. We have to try to change our parameters systematically so that we minimize the costs and get our predictions as close to the reality as possible. One of the ways we do this is by a procedure called gradient descent.

Gradient Descent

To understand the concept of gradient descent let us look at some graphical representations.

cost

A pictorial representation of the cost function will look as the above. In the ‘X’ axis we have our parameters and in the ‘Y’ axis we have the cost. From the figure we can see that there are some set of parameters,’P’ with which we can get to the minimum cost ‘C_min’. Our aim is to find those parameters which will give us the minimum cost.

Let us represent the initial parameters we assumed as P_initial. For this set of parameters let us denote the  cost we derived as C1, as given in the figure. We can see from the figure that by moving the P value to the left ( decreasing the parameters ) by some value we can get to the minimum value of cost. Alternatively, if our initial ‘P’ value were to be on the left side of the graph, we would have to move to the right ( increase the value of parameters ) to get to the minimum cost. The procedure for achieving this is called the gradient descent.

The idea behind gradient descent is represented pictorially as below.

gradient_descent

We decrease the parameters by small steps in an iterative fashion so as to get to the minimum cost. To find out  the “small steps” which I mentioned in the previous line we use a trick we learned in high school calculus called partial derivative. By taking the partial derivative at each point of the cost curve we get a value by which we have to reduce the parameters. With the new set of reduced parameters we find the new cost. Again we find the partial derivative at the new cost level to get the next steps which we have to take, and this process continues till we reach the minimum cost. An analogy to this process is like this. Suppose we are on top of a hill, blindfolded, and we want to find our way down the hill. The way we can do this is by feeling the ground with our foot to find those spots which are lower than the ones where we are currently and then move to the new spot. From the new spot we repeat the process till we finally reach the bottom of the hill. Gradient descent works somewhat similar to this.

 

Summarizing our discussions on gradient descent, these are the steps we take to get the optimum parameters.

  1. First start of with the assumed random parameters.
  2. Find the cost ( errors ) associated with the assumed parameters.
  3. Find the small steps we have to take to alter our parameters, by taking partial derivative of the cost.
  4. Reduce the parameters by the small steps and get a new set of parameters
  5. Find the new cost associated with the new parameters.
  6. Repeat the processes 3,4 & 5 till we get the most optimized cost.

The optimized parameters which we finally get are called the learned parameters.Getting to this optimized parameters is the most involved part of machine learning. Once we learn the parameters using, the training set, we are all set to do predictions which is the objective of any machine learning process.

Doing Predictions

Having learned our set of optimized parameters from the training set, we are now equipped with enough ammunition to do predictions. For doing predictions we take a new set of data called the test set. However there is a difference between the training set and test set. The test set will not have any labels. Our job is to predict the labels from the parameters we have learned. So in the insurance company example, the test set would be the new set of leads which the sales team generated. We have to predict the likelihood of these leads, buying an insurance policy. The way we do the prediction is as follows.

  • We take the optimized set of parameters learned from the training set
  • Multiply the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Convert the weighted sum into probabilities ( column 6) by applying the activation function (sigmoid function).
  • We take a threshold point ( say 0.5). So any probability less than the threshold point is predicted as ‘0’ ( Will not buy) and anything greater that the threshold point is predicted as ‘1’.

The threshold point which we take to make a decision on our predictions is called the decision boundary.Needless to say, the logistic regression is the basic model among a vast set of powerful classification algorithms. The significance of logistic regression is that it is the building block for the development of powerful algorithms like Support Vector machines, Neural Networks etc. Having said that there are many problem areas where we have to go for simple algorithms like logistic regression. Having dealt with the basic building blocks of classification problems we will have further discussions on some of the most powerful algorithms in future posts. Until then watch out this space for more.

Logic of Logistic Regression – Part II

images

In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

  1. Weights or parameters which we learn
  2. The activation function, and
  3. The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a  question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its  features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being).  Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.

trng

The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

  • Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
  • Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data setIf we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’                                                                                                                                           Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5                                                                         Range of ‘Age’ = 46 – 28 = 18                                                                                                                Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167                                                  Similarly we do it for the complete data set. The scaled data set is as represented below.    Please note that we do not scale the labels.                                                                                                                                                          scale
  • Prediction with initial parameters : Once the data is scaled,  we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations.   Weight   Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function.  One of the most common activation function used in classification problems is the  Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities. activation The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the  customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy.                                                                                                                                                                                                                                   Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

Bayesian Inference – A naive perspective

Many people have been asking me on the unusual name I have given for this Blog – “Bayesian Quest”. Well, the name is inspired from one of the important theorems in statistics ‘The Bayes Theorem’. There is also a branch in statistics called Bayesian Inference whose foundation is  the Bayes Theorem. Bayesian Inference has shot into prominence in this age of ‘Big Data’ and is therefore widely used in machine learning. This week, I will give a perspective on Bayes Theorem.

The essence of Statistics is to draw inference on an unknown population, from samples. Let me elaborate this with an example.  Suppose you are part of an agency specializing in predicting poll outcomes of general elections. To publish the most accurate predictions, the ideal method would be to ask  all the eligible voters within your country  which party they are going to vote. Obviously we all know that this is not possible as the cost and time required to conduct such a survey will be prohibitively expensive. So what do you, as a Psephologist do ? That’s where statistics and statistical inference methods comes in handy. What you would do in such a scenario is to select representative samples of people  from across the country and ask them questions on their voting preferences. In statistical parlance this is called sampling. The idea behind sampling is that, the sample sizes so selected( if selected carefully) will reflect the mood and voting preferences of the general population.  This act of inferring the unknown parameters of the population from the known parameters of the sample is the essence of statistics.There are predominantly two philosophical approaches for doing statistical inference. The first one, which is the more classical of the two is called the Frequentist approach and the second the Bayesian approach.

Let us first see how a frequentist will approach the problem of predictions. For the sake of simplicity let us assume that there are only two political parties, party A and party B.Any party which gets more than 50% of popular votes wins in the election. A frequentist will start their inference by first defining a set of hypothesis. The first hypothesis, which is called the null hypothesis, will ascertain that party A will get more than 50% vote. The other hypothesis, called the alternate hypothesis, will state the contrary i.e. party A will not get more than 50% vote. Given these hypothesis, the next task is to test the validity of these hypothesis from the sample data. Please note here, that the two hypothesis are defined with respect to population(all the eligible voters in the country) and not the sample.

Let  our sample size consist of 100 people who were interviewed. Out of this sample 46 people said they will vote for party A, 38 people said that they will vote for party B and the balance 16 people were undecided. The task at hand is to predict whether party A will get more than 50% in the general election given the numbers we have observed in the sample. To do the inference the frequentist will calculate a probability statistic called the ‘P’ statistic. The ‘P’ statistic in this case can be defined as follows – It is the probability of observing 46 people from a sample of  100 people who would vote for party A, assuming 50% or more of the population will vote for party A. Confused ????? ………….. Let me simplify this a bit more. Suppose there is a definite mood among the public in favor of party A, then there is a high chance of seeing a sample where  40 people or 50 or even 60 people out of the 100 saying that they will vote for party A. However there is very low chance to see a sample with only 10 people out of 100 saying that they will vote for party A. Please remember that these chances are with respect to our hypothesis that party A is very popular. On the contrary if party A were very unpopular, then the chance of seeing  10 people out of 100 saying they will vote for party A, is very plausible. The chance or probability of seeing the number we saw in our sample under the condition that our hypothesis is true is the ‘P’ statistic. Once the ‘P’ statistic is calculated , it is then compared to a threshold value usually 5%. If the ‘P’ value is less than the threshold value we will junk our null hypothesis that 50% or more people will vote for party A and will go with the alternate hypothesis. On the contrary if the P value is more than 5% we will stick with our null hypothesis. This in short is how a frequentist will approach the problem.

A Bayesian will approach this problem in a different way. A Bayesian will take into account historical data of past elections and then assume the probability of party A getting more than 50% of popular vote. This assumption is called the Prior probability.Looking at the historical data of the past 10 elections,  we find that only in 4 of them party A has got more than 50% of votes. In that scenario we will assume the prior probability of party A getting more than 50% of votes as .4( 4 out of 10). Once we have assumed a prior probability, we then look at our observed sample data ( 46 out of 100 saying they will vote for party A) and determine the possibility of seeing such data under the assumed prior. This possibility is called the Likelihood. The likelihood and the prior is multiplied together to get the final probability called the posterior probability. The posterior probability is our updated belief based on the data we observed and also the historical prior we assumed. So if party A has higher posterior probability than party B, we will assume that Party A has higher chance of getting more than 50% of votes than party B. This is rather a very naive explanation to the Bayesian approach.

Now that you have seen both Bayesian and Frequentist approaches you might be tempted to ask which is the better among the two. Well this debate has been going on for many years and there is no right answer. It all depends on the context and the problem which is at hand. However, in the recent past Bayesian inference has gained a definite edge over the Frequentist methods due to its ability to update prior beliefs through observation of more data. In addition, computing power is also getting cheaper and faster making Bayesian inference much more fulfilling than Frequentist methods. I will get into more examples of Bayesian inference in a future post.