Logic of Logistic Regression – Part II

In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

1. Weights or parameters which we learn
2. The activation function, and
3. The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a  question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its  features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being).  Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.

The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

• Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
• Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data setIf we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’                                                                                                                                           Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5                                                                         Range of ‘Age’ = 46 – 28 = 18                                                                                                                Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167                                                  Similarly we do it for the complete data set. The scaled data set is as represented below.    Please note that we do not scale the labels.
• Prediction with initial parameters : Once the data is scaled,  we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations.      Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function.  One of the most common activation function used in classification problems is the  Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities.  The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the  customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy.                                                                                                                                                                                                                                   Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

The Logic of Logistic Regression

At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.

In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions  we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall  from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

Classification Problems Unplugged – Setting the context

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

• Age of the customer
• Income of the customer and
• A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

Customer                Age                 Income                Propensity
Cust-3                                   62                     4500                            8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ‘W’ suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

Cust-3                  62 x 8          +   4500 x 3     +    8 x 10                 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of  more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

1. The derivation of the parameters for each variable – In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.
2. The derivation of the total score – In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.
3. The decision boundary – The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction  is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.