Over the past few months various people have been asking me to give them an end to end view on what it entails to be a data scientist. When I was contemplating on this request I thought,rather than just providing an end to end process, lets go a little deeper into how she or he thinks when confronted with an analytic problem. So from this week we are starting a new series called the “The Mind of a Data Scientist”. The name of the series might ring a bell to many of you due to its similarity with Kenichi Omhae’s famous book ” The mind of a strategist”. Well the name of the series is inspired from Kenichi Omhae’s book. However the similarity ends with the name. The path we would tread when trying to unravel the thinking process of a data scientist is as depicted below.
The above depiction is a birds eye view of the maze, a data scientist has to traverse in trying to address a problem . So let us tread this path and embark on a safari through the mind of a data scientist.
Business Discovery : In the Beginning……
As always, in the beginning there was some business challenge or problem which paved way to a data science initiative. To be more contextual let us take an example.Lets assume Eggs Incorporated,an agro products company,approached us to help them in predicting the yield of eggs. To help them solve this business problem they gave us historic data available in their internal systems.
So where do you think we will start in our quest to solve the problem at hand. The best way to start is by building our intuitions and hypothesis on the factors which are detrimental to the variable which we are going to predict. We can call this variable the response variable, which in our case is the yield of egg production. To gain intuitions on key factors which affect our response variable we have to embark on some secondary research and also engage with the business folks of Eggs Inc. We can call this phase of our safari ,business discovery phase. During this phase we build our intuitions on the key factors which affect our response variable. These key factors are called the independent variables or features. Through our business discovery phase we find that the key features which affect the yield of egg production are temperature, availability of electricity, good water, nutrients, quality of chicken feed, prevalence of diseases, vaccinations etc. In addition to the identification of key features, we also build our intuitions on the relationships between features and the response variable, like ….
What kind of relationship exist between temperature and the yield of eggs ?
Do the kind of chicken feed affect the yield ?
Is there an association between availability of electricity and the yield ?
These intuitions we build in the beginning will help us when we do our discovery of the data at later phases. After gaining intuitions on the variables that come into play and the relationships that exists between the variables, next task is to validate our intuitions and hypothesis. Let us see how we do that
The Grind …… : Getting the data ready to test our intuitions and hypothesis
To validate our hypothesis and intuitions we need to have data points related to the problem we are trying to solve. Aggregating these data points in the format we want is the most tedious part of our journey. Many of these data points might be available in various forms and modes within the organisation. There would also be a need to supplement the data available within the organisation with what is available outside. For example social media data or open data available in public domain. Our aim would be to get all the relevant data points in a neat form and shape so that we can work our way through it. There are no set rules as to how we do it. The only guide for us in getting this task accomplished is the problem statement we are set to solve. However this task is one of the most time consuming task in our whole journey.
When we talk about getting the data ready, we have to do an assessment of the four V’s connected with data
- Volume of data
- Variety of data
- Velocity of data and
- Veracity of data.
Volume deals with the quantum of data we have at our disposal to play with. In most cases larger the volume better it is in creating a more representative model. However bigger volumes also pose challenges in terms of speed and ability of the resources we have at hand to process this data. Volume assessment will help us in our decision on adopting suitable parallel processing technologies so as to speed up the processing time.
Variety refers to the disparate forms in which our data points are generated at the source. Data might reside in many forms i.e traditional RDBMS, text, images, videos, log files etc. The more disparate the data sets are, the more complex our aggregation process is. The variety of data points will give clues on the adoption of the right data aggregating technologies.
The third ‘V’ i.e velocity deals with the frequency in which data points are generated. There could be data points which are generated very regularly like web stream data, whereas there could also be data which are generated intermittently. The velocity of data is an important consideration in feature engineering and also in adoption of the right data aggregation technologies.
The last ‘V’, i.e veracity is the value each data point provides in the overall context of the problem. If we are not judicious in the selection of variables based on its veracity we will be inundated in a deluge of noisy variables, making it difficult to extract signals from the data we have.
All the above factors have to be borne in mind when we set about our task of molding the data points in a form which will make later analysis easy. The complexity and the importance involved in the whole process has given rise to a stream called the Data Engineering stream. In short Data Engineering is all about extracting, collecting and processing the myriad data points so that it become congenial for downstream value realization processes.
Wrapping up the first part…
So far we have seen the formulation of the business problem and engineering the data points to give shape and direction to our subsequent steps in the data science journey. In the next post we will deal with two other critical elements in our life-cycle namely Exploratory data analysis and Feature engineering. These processes are detrimental in the formulation of the right model for the problem. Watch out this space for more as we take our safari through the mind of the data scientist.