Many people have been asking me on the unusual name I have given for this Blog – “Bayesian Quest”. Well, the name is inspired from one of the important theorems in statistics ‘The Bayes Theorem’. There is also a branch in statistics called Bayesian Inference whose foundation is the Bayes Theorem. Bayesian Inference has shot into prominence in this age of ‘Big Data’ and is therefore widely used in machine learning. This week, I will give a perspective on Bayes Theorem.
The essence of Statistics is to draw inference on an unknown population, from samples. Let me elaborate this with an example. Suppose you are part of an agency specializing in predicting poll outcomes of general elections. To publish the most accurate predictions, the ideal method would be to ask all the eligible voters within your country which party they are going to vote. Obviously we all know that this is not possible as the cost and time required to conduct such a survey will be prohibitively expensive. So what do you, as a Psephologist do ? That’s where statistics and statistical inference methods comes in handy. What you would do in such a scenario is to select representative samples of people from across the country and ask them questions on their voting preferences. In statistical parlance this is called sampling. The idea behind sampling is that, the sample sizes so selected( if selected carefully) will reflect the mood and voting preferences of the general population. This act of inferring the unknown parameters of the population from the known parameters of the sample is the essence of statistics.There are predominantly two philosophical approaches for doing statistical inference. The first one, which is the more classical of the two is called the Frequentist approach and the second the Bayesian approach.
Let us first see how a frequentist will approach the problem of predictions. For the sake of simplicity let us assume that there are only two political parties, party A and party B.Any party which gets more than 50% of popular votes wins in the election. A frequentist will start their inference by first defining a set of hypothesis. The first hypothesis, which is called the null hypothesis, will ascertain that party A will get more than 50% vote. The other hypothesis, called the alternate hypothesis, will state the contrary i.e. party A will not get more than 50% vote. Given these hypothesis, the next task is to test the validity of these hypothesis from the sample data. Please note here, that the two hypothesis are defined with respect to population(all the eligible voters in the country) and not the sample.
Let our sample size consist of 100 people who were interviewed. Out of this sample 46 people said they will vote for party A, 38 people said that they will vote for party B and the balance 16 people were undecided. The task at hand is to predict whether party A will get more than 50% in the general election given the numbers we have observed in the sample. To do the inference the frequentist will calculate a probability statistic called the ‘P’ statistic. The ‘P’ statistic in this case can be defined as follows – It is the probability of observing 46 people from a sample of 100 people who would vote for party A, assuming 50% or more of the population will vote for party A. Confused ????? ………….. Let me simplify this a bit more. Suppose there is a definite mood among the public in favor of party A, then there is a high chance of seeing a sample where 40 people or 50 or even 60 people out of the 100 saying that they will vote for party A. However there is very low chance to see a sample with only 10 people out of 100 saying that they will vote for party A. Please remember that these chances are with respect to our hypothesis that party A is very popular. On the contrary if party A were very unpopular, then the chance of seeing 10 people out of 100 saying they will vote for party A, is very plausible. The chance or probability of seeing the number we saw in our sample under the condition that our hypothesis is true is the ‘P’ statistic. Once the ‘P’ statistic is calculated , it is then compared to a threshold value usually 5%. If the ‘P’ value is less than the threshold value we will junk our null hypothesis that 50% or more people will vote for party A and will go with the alternate hypothesis. On the contrary if the P value is more than 5% we will stick with our null hypothesis. This in short is how a frequentist will approach the problem.
A Bayesian will approach this problem in a different way. A Bayesian will take into account historical data of past elections and then assume the probability of party A getting more than 50% of popular vote. This assumption is called the Prior probability.Looking at the historical data of the past 10 elections, we find that only in 4 of them party A has got more than 50% of votes. In that scenario we will assume the prior probability of party A getting more than 50% of votes as .4( 4 out of 10). Once we have assumed a prior probability, we then look at our observed sample data ( 46 out of 100 saying they will vote for party A) and determine the possibility of seeing such data under the assumed prior. This possibility is called the Likelihood. The likelihood and the prior is multiplied together to get the final probability called the posterior probability. The posterior probability is our updated belief based on the data we observed and also the historical prior we assumed. So if party A has higher posterior probability than party B, we will assume that Party A has higher chance of getting more than 50% of votes than party B. This is rather a very naive explanation to the Bayesian approach.
Now that you have seen both Bayesian and Frequentist approaches you might be tempted to ask which is the better among the two. Well this debate has been going on for many years and there is no right answer. It all depends on the context and the problem which is at hand. However, in the recent past Bayesian inference has gained a definite edge over the Frequentist methods due to its ability to update prior beliefs through observation of more data. In addition, computing power is also getting cheaper and faster making Bayesian inference much more fulfilling than Frequentist methods. I will get into more examples of Bayesian inference in a future post.