In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,
- Weights or parameters which we learn
- The activation function, and
- The decision boundary
In this second, part of the series we will look deeper into the first two of those dynamic forces.
Concept of Parameters
In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.
Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.
As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its features and the ones along the rows are the examples. In short each customer record in this data set is an example.
Now that we have seen the data set, let us now see the dynamics between the parameters and the data.
The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being). Please note here that the number of parameters does not depend on the number of examples.
Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.
Learning Parameters from data
The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.
The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.
The way, we go about the task of learning the parameters from the training set is as follows
- Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
- Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data set. If we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’ Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5 Range of ‘Age’ = 46 – 28 = 18 Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167 Similarly we do it for the complete data set. The scaled data set is as represented below. Please note that we do not scale the labels.
- Prediction with initial parameters : Once the data is scaled, we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations. Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function. One of the most common activation function used in classification problems is the Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities. The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy. Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.