At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.

In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

**Classification Problems Unplugged – Setting the context**

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

- Age of the customer
- Income of the customer and
- A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

##### Customer Age Income Propensity

###### Cust-1 22 1000 1

###### Cust-2 36 5000 6

###### Cust-3 62 4500 8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ** ‘W’ **suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

###### Customer Age Income Propensity Total Score

###### Cust-1 22 x 8 + 1000 x 3 + 1 x 10 3186

###### Cust-2 36 x 8 + 5000 x 3 + 6 x 10 15,348

###### Cust-3 62 x 8 + 4500 x 3 + 8 x 10 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

**The derivation of the parameters for each variable**– In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.**The derivation of the total score –**In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.**The decision boundary –**The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.