# Mind of a Data Scientist – Part II

In the last post of this series we had a glimpse into the nuances of the business discovery and data engineering phases. These phases dealt with breaking down a business problem into the factors which influence the problem and collating data points related to the business problem. In this post, we will go further as to how the data we collected is further analysed to give us insights into our modeling process. This phase is called the data discovery phase.

Data Discovery Phase

This phase is one of the most critical phases in the whole life cycle where one gets acclimatized with the data structure and the inter relationships between the variables. There are two perspectives as to how we approach the data discovery phase.  One perspective is the business perspective and the second is the statistical perspective. Both these perspectives can be depicted as follows.

The business perspective deals with relationship between the variables from the domain of the business problem.  In contrast the statistical perspective will look more on the statistical characteristics of the data at hand like its distributions, normality,skew etc. To help us elucidate these concepts let us take a case study.

Let us assume that a client of ours who have various cell sites approaches us with a problem they are grappling with. They would like to know in advance the state of health of the batteries which are powering their cell sites. They want our help in predicting when their batteries would fail. For this they have given us historical data related to the measurements they have taken over time. Some of the key variables involved are readings related to conductance, voltage, current, temperature, cell site location etc. Our client has also given us some clues as to what might constitute the failure of a battery. They have asked us to look at trends where the conductance values show precipitous fall over time which might be an indicator of failing batteries. Equipped with these information let us see how we can go about our task of data discovery. Let us first look at it from the business perspective.

Data Discovery – Business Perspective

The best way to embark the data discovery phase is to think from the perspective of our business problem. Our business problem was to predict the impending failure of batteries. The obvious question which comes to our mind is what constitutes failure of batteries ? We might not have a clear cut recipe for failure at this point of time however what we have is a trail which we have to follow. The trail we have, is that of batteries which show a trend of dropping conductance over time. To follow this trail we need to first separate those batteries with falling trend from those which do not show that trend. The next question would be how do we separate out those batteries which have a falling trend from the rest ? The best way to do that is to go for some aggregating metric for the basic unit connected with our business problem. Let me elaborate the last sentence by going into a pictorial representation of our data set.

Let the sample of the  data we have at hand be as shown in the figure above. We have number of batteries, say around 20,000 of them. For each battery we have readings of conductance over a time period of around 2- 3 years. Each battery is associated with a plant ( cell location) . A plant may have multiple batteries however a battery will be associated with only one plant.

Now that we have seen the structure of our data set let us come back to the earlier statement  i.e. ” aggregating metric  for the basic unit connected with the business problem“. Looking at this statement there are two main terms which are important.

1. Basic unit &
2. Aggregating metric.

In our case the basic unit connected with the business problem are the individual batteries themselves. If our business problem were to predict plant sites which can potentially fail, then our basic unit would be each plant site. Talking about the second term, the aggregating metric, it is an aggregated measure of variable associated with the basic unit under consideration. In our case it would be some aggregation of the conductance of each battery. Again the type of aggregation metric would depend on the business problem. So let us take a step back into the problem we set out for ourselves. We were concerned about identifying the batteries which had a falling trend. The more pronounced the falling trend, more likely for it to be a failing battery. So when we think about an aggregating metric we should think about a metric which will accentuate the spread of data. A very handy metric to represent the spread of data would be the standard deviation. So if we aggregate the values of each battery by taking the standard deviation of its conductance we have a very effective method to identify the set of batteries we want. The same is represented in the plot below.

The above figure is a plot of the batteries along x axis and the standard deviation of conductance along y axis. We can clearly see that using our aggregating metric we clearly have two groups of batteries, one with standard deviation less than 100 and the other with more than 300. The second group i.e batteries A & C whose standard deviation is way above the rest are potentially the cases we are looking for. Let us also try and plot the real conductance value of these batteries over time to corroborate our hypothesis.

We can clearly see from the above plot that battery A & C shows a dropping trend which was indicated by the high standard deviation for these batteries. So taking an aggregating metric like this will help us in zeroing on to the cases where we want to further dig our hands into.

Deep Diving

Now that we have identified our set of batteries which potentially could be problematic, the next step is to dive deep into those cases and try to identify other indicators which are associated with falling conductance. We need to look closely at some pictorial representation of the data and then ask further questions

1. Are there any period of time when such trends are happening ?
2. Are there any specific patterns which we can unearth before the falling trend in conductance
3. Are there any thing special about the slope of the curve which shows a falling trend… etc

We need to look at all discernible patterns within that variable and build our intuitions on them. Once we build our intuitions on one variable it is time to move further and associate other variables. We can bring in  variables like voltage, current, temperature etc and see how they behave with respect to the specific trends which we saw when we analysed only one variable (Conductance) . Some of the trends we can look at are the following

1. How has voltage, current or temperature behaved during the period when we saw a drop in conductance ?
2. Are there any specific trends for these variables before we saw the trend in falling conductance ?
3. How have these variables behaved after the fall in conductance values ?
4. Are there any prospects for any more variables other than the ones we have ? … etc

These are the kind of questions we have to ask to help us in unearthing various relationships which exists within the variables in our data set. Asking all these questions and slicing and dicing into each of the variables help us achieve the following

1. Helps in determining relative importance of variables
2. Provides a rough idea about relationships between variables
3. Gives insights into any variables that needs to be derived out of the existing variables
4. Gives us intuitions on any new variables which needs to be brought in

All insights we unearth by asking such questions will help us immensely when we get into the downstream modelling activities.

Summing Up

Now that we have seen the business perspective of the data discovery phase, let us encapsulate the main steps in the process

1. Identify a variable which potentially give indication of the problem we are trying to solve
2. Derive some aggregation metric for the identified variable to help us split the basic unit related to our problem
3. Dive down deep into cases we have earmarked and look for trends with respect to the variable we are looking for
4. Introduce other variables and look for association of the newly introduced variables with the trends we saw in the first variable.
5. Look for relationship between variables which give clues to the problem statement
6. Build intuitions on any new variable that can be introduced which can help in solving the problem.

The above are a set of broad guideline as to how we can structure our thought process for business perspective of the data discovery phase. In the next post we will deal with the statistical perspective of data discovery and how we can connect the dots between both these perspectives so as to give us intuitions for feature engineering and modelling. Watch out this space for more.

# Mind of the Data Scientist – Part I

Over the past few months various people have been asking me to give them an end to end view on what it entails to be a data scientist. When I was contemplating on this request I thought,rather than just providing an end to end process, lets go a little deeper into how she or he thinks when confronted with an analytic problem. So from this week we are starting a new series called the “The Mind of a Data Scientist”. The name of the series might ring a bell to many of you due to its similarity with Kenichi  Omhae’s famous book ” The mind of a strategist”. Well the name of the series is inspired from Kenichi Omhae’s book. However the similarity ends with the name. The path we would tread when trying to unravel the thinking process of a data scientist is as depicted below.

The above depiction is a birds eye view of the maze, a data scientist has to traverse in trying to address a problem .  So let us tread this path and embark on a safari through the mind of a data scientist.

Business Discovery : In the Beginning……

As always, in the beginning there was some business challenge or problem which paved way to a data science initiative. To be more contextual let us take an example.Lets assume Eggs Incorporated,an agro products company,approached us to help them in predicting the yield of eggs. To help them solve this business problem they gave us historic data available in their internal systems.

So where do you think we will start in our quest to solve the problem at hand. The best way to start is by building our intuitions and hypothesis on the factors which are detrimental to the variable which we are going to predict. We can call this variable the response variable, which in our case is the yield of egg production. To gain intuitions on key factors which affect our response variable we have to embark on some secondary research and also engage with the business folks of Eggs Inc. We can call this phase of our safari ,business discovery phase. During this phase we build our intuitions on the key factors which affect our response variable. These key factors are called the independent variables or features. Through our business discovery phase we find that the key features which affect the yield of egg production are temperature, availability of electricity, good water, nutrients, quality of chicken feed, prevalence of diseases, vaccinations etc.  In addition to the identification of key features, we also build our intuitions on the relationships between features and the response variable, like  ….

What kind of relationship exist between temperature and the yield of eggs ?

Do the kind of chicken feed  affect the yield ?

Is there an association between availability of electricity and the yield ?

…… etc.

These intuitions we build in the beginning will help us when we do our discovery of the data at later phases. After gaining intuitions on the variables that come into play  and the relationships that exists between the variables, next task is to validate our intuitions and hypothesis. Let us see how we do that

The Grind …… : Getting the data ready to test our intuitions and hypothesis

To validate our hypothesis and intuitions we need to have data points related to the problem we are trying to solve. Aggregating these data points in the format we want is the most tedious part of our journey. Many of these data points might be available in various forms and modes within the organisation. There would also be a need to supplement the data available within the organisation with what is available outside. For example social media data or open data available in public domain.  Our aim would be to get all the relevant data points in a neat form and shape so that we can work our way through it. There are no set rules as to how we do it. The only guide for us in getting this task accomplished is the problem statement we are set to solve. However this task is one of the most time consuming task in our whole journey.

When we talk about getting the data ready, we have to do an assessment of the four V’s connected with data

1. Volume of data
2. Variety of data
3. Velocity of data and
4. Veracity of data.

Volume deals with the quantum of data we have at our disposal to play with. In most cases larger the volume better it is in creating a more representative model. However bigger volumes also pose challenges in terms of speed and ability of the resources we have at hand to process this data. Volume assessment will help us in our decision on adopting  suitable parallel processing technologies so as to speed up the processing time.

Variety refers to the disparate forms in which our data points are generated at the source. Data might reside in many forms i.e traditional RDBMS, text, images, videos, log files etc. The more disparate the data sets are, the more complex our aggregation process is. The variety of data points will give clues on the adoption of the right data aggregating technologies.

The third ‘V’ i.e velocity deals with the frequency in which data points are generated. There could be data points which are generated very regularly like web stream data, whereas there could also be data which are generated intermittently. The velocity of data is an important consideration in feature engineering and also in adoption of the right data aggregation technologies.

The last ‘V’, i.e veracity is the value each data point provides in the overall context of the problem. If we are not judicious in the selection of variables based on its veracity we will be inundated in a deluge of noisy variables, making it difficult to extract signals from the data we have.

All the above factors have to be borne in mind when we set about our task of molding the data points in a form which will make later analysis easy. The complexity and the importance involved in the whole process has given rise to a stream called the Data Engineering stream. In short Data Engineering is all about extracting, collecting and processing the myriad data points so that it become congenial for downstream value realization processes.

Wrapping up the first part…

So far we have seen the formulation of the business problem and engineering the data points to give shape and direction to our subsequent steps in the data science journey.  In the next post we will deal with two other critical elements in our life-cycle namely Exploratory data analysis and Feature engineering. These processes are detrimental in the formulation of the right model for the problem. Watch out this space for more as we take our safari through the mind of the data scientist.

# Classification Algorithms: Random Forest – Part II

In the first part of this series we set the context for Random Forest algorithm by introducing the tree based algorithm for classification problems. In this post we will look at some of the limitations of the tree based model and how they were overcome paving the way to a powerful model – Random Forest. Two major methods that were employed to overcome those pitfalls are Bootstrapping and Bagging. We will discuss them first before delving into random forest.

Bootstrapping and Bagging

When we discussed the tree based model we saw that such models are very intuitive i.e. they are easy to interpret. However such models suffer from a major drawback i.e high variance.Let us understand what high variance means in this context. Suppose we were to have a data set which we divide it into three parts. If three different tree models were fit on these data sets and we were to predict the result of a new observation based on these three models. The result we might get from each of these three models for the same observation can be very different. This is what we call in statistical jargon as ” Model with high variance”. High variance obviously is bad as the reliability of the results we get is compromised. One effective way to overcome high variance is to do averaging. This would mean taking multiple data sets, fitting a tree based model on each of these data sets, do predictions on new observations and then averaging the results got from each of the tree model to get a more reliable result. This seems a very plausible solution. However we have a major problem here. Doing averaging would require having multiple data sets. But what if the data we have is quite limited and obtaining additional data is prohibitively expensive ?

……….. Lo and Behold, we have a powerful method to help us out of this predicament and it is called Bootstrapping.

The etymological meaning of the word Bootstrapping is “Pulling oneself up by ones bootstrap”.In essence it means doing some task considered impossible. In statistics bootstrapping procedure entails sampling from the available data set with replacement. Let me elaborate with an example. Suppose our data set were to have 10 observations ( rows 1:10). From this data set we were to randomly pick an observation, say row 6. After that we replace the row 6 into the data set and we randomly pick another number. Say this time we got row 8. We again put this observation back and repeat the process till we get around 10 observations. Let us assume that the first set of observations we picked looks like this : 6,8,4,8,5,6,9,1,2,5. You might have noticed that there are observations which repeat within the above set. That is perfectly all-right in bootstrapping. We continue this process till we get a collection of bootstrapped samples of 10 observations each. Once we get a collection or a bag of bootstrapped data sets, we fit a tree model for each of these sets, carry out predictions and then average the results. This whole process is called bagging. Bagging helps us get over our original problem of high variance and the results mirror more closely to reality.

Random Forest

Now that we have discussed bootstrapping and bagging we are in a position to get into the nuances of random forest. Random Forest algorithm provides an improvement over bagging in terms of de-correlating the trees. Let me elaborate the de-correlating part. When we were discussing the tree based methods in the last post, we talked about splitting the data set based on the best features.When we grow our trees on the bootstrapped samples , more often than not it is those set of best features which gets picked, to do the split and thereby grow trees. This will result in getting a bunch of trees which look almost the same or in statistical terms “co-related”. We also have discussed that the final result will be obtained by averaging results from all the tree models grown on the bootstrapped samples. It works out that averaging predictions from co-related trees will result in sub-optimal predictions.

To overcome this, Random Forest algorithm randomly picks a smaller subset of features to do split. If there were “P” features in the data set, the subset picked is approximately √P.  The idea of randomly picking a subset of features for each tree is to avoid being biased towards the best predictors. In the new setting, all the predictors have equal chance of being picked and the tree models will be more “representative”. Averaging the results from these representative trees will provide more accurate predictions. In effect the combination of bootstrapping, bagging and random picking of features provides the robustness inherent in the random forest model.

Out of Bag Error Estimation

There is a very straight forward method to estimate the error in a bagged model and it is called “Out of Bag”(OOB) error estimation. In the example we discussed on bootstrapping ,we had 10 observations in our first sample, (6,8,4,8,5,6,9,1,2,5). We can see that the following observations ( 3,7,10) have not been picked in the first bootstrapped sample. These elements are called “Out of Bag” observations. In general it is seen that in the bootstrapping process approximately only 2/3rd of the observations are generally picked. That means about 1/3rd of the observations are OOB in each bootstrapped sample. OOBs have some very important purpose in the overall scheme of things i.e. they act as test beds for estimating error in the model. Let me emphasize this idea with an example. Let us take the case of observation 3. As seen, it is an OOB observation for the first bootstrapped sample. Let us assume that the same observation ends up as OOB for the 6th and 12th bootstrapped data set too. When a tree model is fit on the first, sixth and the twelfth bootstrapped set, the observation 3 will be used as a test set to predict three distinct results corresponding to each model. The three results for observation 3 will thereby be averaged(for regression) to get a single prediction. In case of classification problems the most prevalent class out of the three will be taken. Once we get one single prediction by averaging, the error is estimated by comparing against the true class the observation 3 fall into. Similarly the error estimation is done for all the OOB elements to get an overall aggregation of error. This method of error estimation eliminates the need for cross validation which can be cumbersome for large data sets.

Wrapping Up

The ideas behind random forest model i.e bootstrapping, bagging, random feature selection etc has aided the making of a very powerful algorithm. However random forest is not bereft of pitfalls. One major pitfalls of the model is that it cant be interpreted easily. However the positives of this model far outweighs the negative and because of this random forest is one of the most powerful algorithms providing realistic results.

It is time to wrap up our discussion on tree based algorithms and random forest in particular. From the next post onward we start a new series called the “Mind of a Data Scientist”. In this series we do an exploratory walk, through the thought process of a data scientist in enabling, data driven informed decision making. Watch out this space for more

# Classification Algorithms : Random Forest – Part I, Setting the Context

In the last few posts, where we discussed  Logistic regression, we had a fair bit of discussions on classification problems. Classification problems are the most prevalent ones we encounter in the real world machine learning setting and it is important to deal with various facets of this problem.In the next few posts, we will decipher some of the popular algorithms used within the classification context. The first of those algorithms which we are discussing is called the Random Forest. It is one of the most popular and powerful algorithms, which is currently used in the classification setting. In addition to deciphering the dynamics of Random Forest, we will also be looking at a practical applications powered by Random Forest algorithm. The practical application we will be dealing with is a movie sentiment analyser using a Random Forest Model. Outlined below is the path we will traverse in our discussions,

• Random Forest :setting the context – An introduction to Tree based methods
• Introduction to bootstrapping and deciphering the dynamics of Random Forest
• Random Forest in action – Introduction of the movie sentiment analyser and Feature selection approaches
• Decoding the movie sentiment analyser with the Random Forest model.

Setting the stage with Tree based Methods

Random Forest algorithm falls under a genre of prediction method called the Tree Based Method. Understanding the logic behind Tree based methods will help immensely in getting a better handle on Random Forest. So let us start our discussion on Tree based methods.

Tree based methods resonates very closely to the way humans take decisions . Let me start off with a small example of how I take decision on whether I should take an umbrella when I want to go for a walk. The decision tree is as depicted below.

###### Figure 1

Looking at the above decision tree let us try to get some intuition on the key components. As seen from the tree, there are three decision gates which splits my decision space thereby helping me to take a more informed decision. In machine learning parlance these decision gates are called “Predictor space” or “Feature Space”. At each predictor space, there is also a value which splits the predictor space. For example for the first feature space, “Did it rain in the past few days”, the value is a binary values i.e Yes/No ( or 1/0 numerically). It is the combination of a predictor space and its corresponding value which will help in finally arriving at a decision. The values need not be binary in nature. It can be a continuous numbers or some ordinal value. Taking cues from the above toy example, let us look at a real data set and try to under stand the tree based method in more detail.

The data set we are going to analyse is called the “Heart” data set. It is available in the UCI Machine Learning Repository.  A snapshot of the data set is as given below.

`(Source: UCI Machine Learning Repository)`

This data set consists of  records of 269 patients,along the rows, with symptoms of heart ailments. The columns 1 to 13  are predictors/features which necessarily are various attributes helping in detection of heart disease. The 14th column (“Pred”) is the outcome variable, which indicates whether that patient has heart disease or not, a value of “2” indicates the prevalence of heart disease and “1” absence of heart disease. Our aim is to understand how a tree based algorithm learns from a training set like the above and helps in predicting  the prevalence of heart disease in a patient.

The first step in a tree based algorithm is to find a predictor(from the 13 predictors) and a value to split the data set into two distinct regions or groups. In our toy example, the predictor we used for the first split was the one which indicated the presence of rain in the previous days. For the time being let us assume that the best predictor to do our split is the 13th predictor “Thal”. This predictor is the measure of Thalium stress test and contains 3 types of values , 3,6 & 7. A value of 3 indicates normal behavior, 6 indicates a fixed defect and 7 a reversible defect. Let us not worry too much about what fixed defect and reversible defects are as they are medical jargon. However we will only be concerned about the values 3,6 & 7. Let us assume that the value at which we do the split is  3. In essence any patient who has the result of Thalium stress test as 3 will be in group1 and the ones who have more than 3 will be another group. The first split is show in figure below

###### Figure 2

As seen above, the first split divides the total of 269 patients into two group of 151 and 118 respectively. In group 1 out of the 151 patients 119 of them have the outcome variable as 1( absence of heart disease) and the other 32 have outcome of 2 ( presence of heart disease). The corresponding value for group 2 are 31 and 87 respectively.

The next step is to grow the trees deeper by splitting the two branches we got after the first split with some other predictor. The choice of predictor for each branch can be same or different. I will come to the criteria for the choice of predictor later on. For the time being let us assume the same predictor is used to split the two branches further. Let the split happen with the 10th predictor (ST) and a value of 0.5. Any record with value less than 0.5 will be in a group to the left and more than 0.5 will belong to the branch on the right. The split is depicted as below.

###### Figure 3

This process of splitting the nodes continues till a certain threshold is reached. The threshold can be, say ,till no region has more than 5 observations. The last layer we get after all the splits are called the terminal nodes.

Now that we have seen the process of splitting(growing a tree) let us deal with two important aspects we did not explain in detail,

1. How do we decide on the predictor/feature and values for making a split ?
2. How do we do the predictions for any test set observations?

Selecting the predictors and its values

The process of selecting a predictor is an iterative one.We pick one predictor at a time and an arbitrary value within the predictor space and carry out an assessment if the picked predictor and value is the best one possible for carrying out the split.The way we do assessment of whether the predictor is the best one has its paralells with the way we find the right set of parameters by overall cost minimization in logistic regression. In tree based method we do the assessment through a method called minimization of classification error rate. The name might look intimidating, however the idea behind it is quite simple. When we carry out the node splitting process the labels or outcomes of all the observations are approximated to the most prevalent outcome. In the first split we did (figure 2) we split the observations in two branches. In the left branch out of the total 151 observations 119 of them have outcome as 1. So the most prevalent outcome in the left branch is outcome 1. Therefore all observations which fall in the left branch will be approximated as outcome 1, irrespective of what its true outcome is. In the process of approximating to the most prevalent outcome, we also make some classification errors. For the case of the left branch, there were 32 observations which belonged to outcome 2 which we are approximating as outcome 1. This obviously is the error we will inherit because of our approximation. This error as percentage of the total number of observations in that branch is called the classification error rate ( i.e 32/151). The criteria for the selecting the right predictor and the right value to do the split is the one which yields the lowest classification error rate. In a nutshell the overall process  for selecting the best  predictor and split value is as follows

1. Pick a predictor and a value within the predictor space for doing a split
2. Calculate the classification error rate.
3. Repeat the process with another predictor and value noting the classification error rate obtained in each selection.
4. Pick the predictor-value combination which yields the lowest classification error rate to do the split of that particular node.

Predictions for any test set observations

So far, what we have discussed is the training process. At the end of the training process what we learn are a set of best predictors and its corresponding values to do splits ,at each node. We have also discussed about the approximation of outcomes to the most prevalent outcome for calculation of classification error rate. The approximation of outcome  has another utility, i.e to determine the class of the terminal nodes. After the tree is grown to its full depth the terminal nodes will be categorized to an outcome which is most prevalent in that node. This categorization is required when we have to do predictions for any test set.

Once the learning is done on the training set, the way we do prediction on a test set is quite straight forward. The test set examples are split according to the splits we learned during the training process. After the splits, the test set examples finally ends up in one of the many terminal nodes. The prediction for the test set example will be the same as the outcome of the terminal node where it ends up.

Wrapping Up

As mentioned earlier tree based methods is the basis for understanding advanced algorithms like Random Forest. Now that we have seen the tree based methods we are well equipped to decipher Random Forest. We will do that in the next post. Watch out this space for your safari of Random Forest.

# Logic of Logistic Regression – Part III

In our previous post on logistic regression we defined the concept of parameters and had a first hand glimpse on the dynamics between the data set and the parameters to obtain our first set of predictions. In this part we will go further into how we optimize the parameters in order to improve the accuracy of our predictions. We will be dealing with the following concepts

1. Deciphering the prediction errors
2. Minimizing errors through gradient descent and finding optimized parameters
3. Prediction with the optimized set of parameters.

Deciphering Prediction Errors

Let us revisit the toy example we discussed in our last post and dissect the below table which represented the dynamics of prediction.

To recap, let us list down our discussions in  the last post on the dynamics involved in the above table.

• We first assumed an initial set of parameters
• Multiplied the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
• Converted the weighted sum into predictions ( column 6) by applying the activation function (sigmoid function).

Let us take a moment to reflect on what the predictions really mean ? The predictions are in fact the probabilities of the customer  buying the insurance policy. For example, for the first customer, we are predicting that the probability that the customer will buy the insurance policy is almost 17.9%.

However when we talk about predictions the first thing which comes to our mind is the veracity of those predictions. How close to reality are the first set of predictions which we made ? If we recall, in our last discussion on the training set, we introduced a new column called the labels. The labels in fact is the reality !! For example looking at the labels column we know that the first two customers did not buy the insurance policy ( label of ‘0’) and the next two bought the insurance policy. The veracity of our predictions can be realized by comparing our predictions with the reality manifested in the labels. By comparing we can see that the first and last customer predictions are somewhat close to reality and the middle ones are pretty off target. In ideal state, we want the first two predictions to be close to zero and the last two pretty close to ‘1’. However, what we predicted have obviously deviated from the reality. Such deviations are the errors we have inherited in our predictions.  However we need to note that the calculation of error for a classification problem like ours is a little mathematically oriented and is not as straight forward as subtracting the probability from the labels. For the sake of simplicity let us not get into those mathematical calculations and stick to our understanding that there  some errors inherited for each example. From the errors of each example we  can find the average error by summing up errors of all examples and dividing it by the number of examples ( 4 in our case). In machine learning parlance the average error so obtained can also be called the ‘Cost’.

Now that we know that there are ‘Cost’ involved in our predictions, our aim should be to minimize the cost so that our predictions are as close to the reality as possible.However the million dollar question is how do we minimize the cost ? What are the levers we have to reduce our costs ? Going back to our toy example, the two entities we have played around to get the predictions are the ‘data’ and the ‘parameters’ . We cannot change the given data because it is fixed. So all we have got to play around with is the parameters which we assumed. We have to try to change our parameters systematically so that we minimize the costs and get our predictions as close to the reality as possible. One of the ways we do this is by a procedure called gradient descent.

To understand the concept of gradient descent let us look at some graphical representations.

A pictorial representation of the cost function will look as the above. In the ‘X’ axis we have our parameters and in the ‘Y’ axis we have the cost. From the figure we can see that there are some set of parameters,’P’ with which we can get to the minimum cost ‘C_min’. Our aim is to find those parameters which will give us the minimum cost.

Let us represent the initial parameters we assumed as P_initial. For this set of parameters let us denote the  cost we derived as C1, as given in the figure. We can see from the figure that by moving the P value to the left ( decreasing the parameters ) by some value we can get to the minimum value of cost. Alternatively, if our initial ‘P’ value were to be on the left side of the graph, we would have to move to the right ( increase the value of parameters ) to get to the minimum cost. The procedure for achieving this is called the gradient descent.

The idea behind gradient descent is represented pictorially as below.

We decrease the parameters by small steps in an iterative fashion so as to get to the minimum cost. To find out  the “small steps” which I mentioned in the previous line we use a trick we learned in high school calculus called partial derivative. By taking the partial derivative at each point of the cost curve we get a value by which we have to reduce the parameters. With the new set of reduced parameters we find the new cost. Again we find the partial derivative at the new cost level to get the next steps which we have to take, and this process continues till we reach the minimum cost. An analogy to this process is like this. Suppose we are on top of a hill, blindfolded, and we want to find our way down the hill. The way we can do this is by feeling the ground with our foot to find those spots which are lower than the ones where we are currently and then move to the new spot. From the new spot we repeat the process till we finally reach the bottom of the hill. Gradient descent works somewhat similar to this.

Summarizing our discussions on gradient descent, these are the steps we take to get the optimum parameters.

1. First start of with the assumed random parameters.
2. Find the cost ( errors ) associated with the assumed parameters.
3. Find the small steps we have to take to alter our parameters, by taking partial derivative of the cost.
4. Reduce the parameters by the small steps and get a new set of parameters
5. Find the new cost associated with the new parameters.
6. Repeat the processes 3,4 & 5 till we get the most optimized cost.

The optimized parameters which we finally get are called the learned parameters.Getting to this optimized parameters is the most involved part of machine learning. Once we learn the parameters using, the training set, we are all set to do predictions which is the objective of any machine learning process.

Doing Predictions

Having learned our set of optimized parameters from the training set, we are now equipped with enough ammunition to do predictions. For doing predictions we take a new set of data called the test set. However there is a difference between the training set and test set. The test set will not have any labels. Our job is to predict the labels from the parameters we have learned. So in the insurance company example, the test set would be the new set of leads which the sales team generated. We have to predict the likelihood of these leads, buying an insurance policy. The way we do the prediction is as follows.

• We take the optimized set of parameters learned from the training set
• Multiply the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
• Convert the weighted sum into probabilities ( column 6) by applying the activation function (sigmoid function).
• We take a threshold point ( say 0.5). So any probability less than the threshold point is predicted as ‘0’ ( Will not buy) and anything greater that the threshold point is predicted as ‘1’.

The threshold point which we take to make a decision on our predictions is called the decision boundary.Needless to say, the logistic regression is the basic model among a vast set of powerful classification algorithms. The significance of logistic regression is that it is the building block for the development of powerful algorithms like Support Vector machines, Neural Networks etc. Having said that there are many problem areas where we have to go for simple algorithms like logistic regression. Having dealt with the basic building blocks of classification problems we will have further discussions on some of the most powerful algorithms in future posts. Until then watch out this space for more.

# Logic of Logistic Regression – Part II

In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

1. Weights or parameters which we learn
2. The activation function, and
3. The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

#### Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a  question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its  features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being).  Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

#### Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.

The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

• Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
• Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data setIf we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’                                                                                                                                           Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5                                                                         Range of ‘Age’ = 46 – 28 = 18                                                                                                                Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167                                                  Similarly we do it for the complete data set. The scaled data set is as represented below.    Please note that we do not scale the labels.
• Prediction with initial parameters : Once the data is scaled,  we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations.      Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function.  One of the most common activation function used in classification problems is the  Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities.  The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the  customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy.                                                                                                                                                                                                                                   Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

# The Logic of Logistic Regression

At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.

In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions  we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall  from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

Classification Problems Unplugged – Setting the context

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

• Age of the customer
• Income of the customer and
• A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

##### Customer                Age                 Income                Propensity
###### Cust-3                                   62                     4500                            8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ‘W’ suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

###### Cust-3                  62 x 8          +   4500 x 3     +    8 x 10                 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of  more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

1. The derivation of the parameters for each variable – In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.
2. The derivation of the total score – In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.
3. The decision boundary – The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction  is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.