In my previous blog on machine learning, I explained the science behind how a machine learns from its parameters. In this week, I will delve on a very common application which we use in our day to day life – Next Word Prediction.
When we text with our smartphones all of us would have appreciated how our phones make our typing so easy by predicting or suggesting the word which we have in mind. And many would also have noticed the fact that, our phones predict words which we tend to use regularly in our personal lexicon. Our phones have learned from our pattern of usage and is giving us a personalized offering. This genre of machine learning falls under a very potent field called the Natural Language Processing ( NLP).
Natural Language Processing, deals with ways in which machines derives its learning from human languages. The basic input within the NLP world is something called a Corpora, which essentially is a collection of words or groups of words, within the language. Some of the most prominent corpora for English are Brown Corpus, American National Corpus etc. Even Google has its own linguistic corpora with which it achieves many of the amazing features in many of its products. Deriving learning out of the corpora is the essence of NLP. In the context which we are discussing, i.e. word prediction, its about learning from the corpora to do prediction. Let us now see, how we do it.
The way we do learning from the corpora is through the use of some simple rules in probabilities. It all starts with calculating the frequencies of words or group of words within the corpora. For finding the frequencies, what we use is something called a n-gram model, where the “n” stands for the number of words which are grouped together. The most common n-gram models are the trigram and the bigram models. For example the sentence “the quick red fox jumps over the lazy brown dog” has the following word level trigrams:(Source : Wikipedia)
the quick red quick red fox red fox jumps fox jumps over jumps over the over the lazy the lazy brown lazy brown dog
Similarly a bi-gram model will split a given sentence into combinations of two word groups. These groups of trigrams or bigrams forms the basic building blocks for calculating the frequencies of word combinations. The idea behind calculation of frequencies of word groups goes like this. Suppose we want to calculate the frequency of the trigram “the quick red”. What we look for in this calculation is how often we find the combination of the words “the” and “quick” followed by “red” within the whole corpora. Suppose in our corpora there were other 5 instances where the words “the” and “quick” was followed by the word “red”, then the frequency of this trigram is 5.
Once the frequencies of the words are found, the next step is to calculate the probabilities of the trigram. The probability is just the frequency divided by the total number of trigrams within the corpora.Suppose there are around 500,000 trigrams in our corpora, then the probability of our trigram “the quick red” will be 5/500,000.The probabilities so calculated comes under a subjective probability model called the Hidden Markov Model(HMM).By the term subjective probability what we mean is the probability of an event happening subject to something else happening. In our trigram model context it means,the probability of seeing the word “red” subject to having preceded with words “the” and “quick”. Extending the same concept to bigrams, it would mean probability of seeing the second word subject to have seen the first word. So if “My God” is a bigram, then the subjective probability would be the probability of seeing the word “God” followed by the word “My”
The trigrams and bigrams along with the calculated probabilities arranged in a huge table forms the basis of the word prediction algorithm.The mechanism of prediction works like this. Suppose you were planning to type “Oh my God” and you typed the first word “Oh”. The algorithm will quickly go through the n-gram table and identify those n-grams starting with word “Oh” in the order of its probabilities. So if the top words in the n-gram table starting with “Oh” are “Oh come on”,”Oh my God” and “Oh Dear Lord” in decreasing order of probabilities, the algorithm will predict the words “Come” ,”my” and “Dear” as your three choices as soon as you type the first word “Oh”.After you type “Oh” you also type “my” the algorithm reworks the prediction and looks at the highest probabilities of n-gram combinations preceded with words “Oh” and “my”. In this case the word “God” might be the most probable choice which is predicted. The algorithm will keep on giving prediction as you keep on typing more and more words. At every instance of your texting process the algorithm will look at the penultimate two words you have already typed to do the prediction of the running word and the process continues.
The algorithm which I have explained here is a very simple algorithm involving n-grams and HMM models. Needless to say there are more complex models which involves more complex models like Neural Networks. I will explain about Neural Networks and its applications in a future post.