Data Science Process
Data Science Question Inquiry
Give a high level problem, it's import to frame the problem into concrete questions that may help
to solve the problem. For example most of the problem
Does the above data provide enough information about the customer that can translate into sales or how effectively the sales team can organized to focus on selected customer that can be converted. This way one can organize the sales team accordingly.
Once you have frame the problem and at high level you know your goal, its time to play with collected data.
Most of the customer data are stored in CRM software managed by specific responsible organization. The backend are SQL database with several linked tables. There are several way to extract data from the database.
The database has customer personal information, hence may sure the tool used or written to extract the data shouldn't extract PII (Personally Identifiable Information).
In most of the data science project, one will be using the data that's already exists. It's very rare that an effort is being made to collect new data as it requires huge effort from engineering and other business entity and it will take a while to bear fruit.
The data retrieved from database can be converted to CSV format or JSON format and should be anonymized so that it can't be traced back to the specific customer.
Before you start diving into raw data, it's worthwhile to subject the data through data wrangling, as the raw data may have mistakes, corruption and corrupt value.
Data Wrangling
It's one of the most time consuming process but worth an effort to invest your time. During this process, as a data scientist we eliminate the corrupted data also make sense of the data available. One example would customer engagement time stamp(01/19/1970) with your organization, and if your organization didn't came to existence in the year 1970 that means who customer data validity is in question. As a data scientist one has decide either to eliminate this data as whole or come up with some reasonable value. Additional data such as customer converted date, converted flag etc.. These additional data will provide further insight as time duration required to convert potential customer to a consumer.
Finally, after lot of data wrangling, you have a clean dataset to start drawing some insights from the data.
Exploratory Data Analysis (EDA)
In this process as a data scientist you start making more sense of the data available, eagerly exploring the data to find out what information it provides to answer your questions (Data Science Question Inquiry). It's very important to avoid the pitfall when doing EDA and time to revisit the questionnaire. You can right away start plotting how many potential got converted and how many didn't. You can plot how histogram as how much effort it took to convert potential customer and average conversion time. You can submit your preliminary report and meanwhile generate further report on conversion based on age, sex group which will give further insight. Further plotting based on what communication method was used for converting customer i.e email, social media or phone. These finding during EDA will help to utilizes the sales team effectively.
Feature Vector & Label
So, far we have collected enough information to create a predictive model. In Machine learning technique to create predictive model each data point is expressed as feature vector. During EDA phase we have identified several factor that can be used to predict customer conversion, i.e age, marketing method(email, phone, social media).
Besides feature vector, we also need label that tell model which data point you want to predict i.e converted and not converted. One label is determined predict model can be generated using simple machine learning classifier algorithm called logistics regression. This simple technique learns a model based from label and report binary prediction but also probability of conversion.
Communicating Results
Well you have done predictive modeling but one of the last and most critical step is how to convey the model in comprehensive and compelling way. To communicate you work effectively and to have maximum impact is called 'data storytelling'. The story will include important conclusion for most important questionnaire raised during EDA and other phase. The biggest data storytelling our model should convey in the explained scenario
- Frame the problem: Who is your client. What are there problem that needs to be solved. Translate ambiguous request into concrete, well defined and understood problem.
- Collect Raw Data to solve the problem. What part of data are useful ? If more data needed to solve the customer problem, what resource(time, money and infrastructure) it will take to solve the problem.
- Process the Raw/Collected Data. Collected Data are error prone (missing value, corrupt, ???). Convert or clean the data.
- Explore the data: Once the collected/raw data has been cleaned. Understand the data and gather all the information data convey, in order to understand the trend or correlation.
- Perform in-depth analysis (machine-learning/algorithm/model) to provide high value insight to the customer.
- Present the result to the stakeholder.
Data Science Question Inquiry
Give a high level problem, it's import to frame the problem into concrete questions that may help
to solve the problem. For example most of the problem
- Who are the target audience.
- What's the current process.
- What type of data they collect from targeted audience.
Does the above data provide enough information about the customer that can translate into sales or how effectively the sales team can organized to focus on selected customer that can be converted. This way one can organize the sales team accordingly.
Once you have frame the problem and at high level you know your goal, its time to play with collected data.
Most of the customer data are stored in CRM software managed by specific responsible organization. The backend are SQL database with several linked tables. There are several way to extract data from the database.
The database has customer personal information, hence may sure the tool used or written to extract the data shouldn't extract PII (Personally Identifiable Information).
In most of the data science project, one will be using the data that's already exists. It's very rare that an effort is being made to collect new data as it requires huge effort from engineering and other business entity and it will take a while to bear fruit.
The data retrieved from database can be converted to CSV format or JSON format and should be anonymized so that it can't be traced back to the specific customer.
Before you start diving into raw data, it's worthwhile to subject the data through data wrangling, as the raw data may have mistakes, corruption and corrupt value.
Data Wrangling
It's one of the most time consuming process but worth an effort to invest your time. During this process, as a data scientist we eliminate the corrupted data also make sense of the data available. One example would customer engagement time stamp(01/19/1970) with your organization, and if your organization didn't came to existence in the year 1970 that means who customer data validity is in question. As a data scientist one has decide either to eliminate this data as whole or come up with some reasonable value. Additional data such as customer converted date, converted flag etc.. These additional data will provide further insight as time duration required to convert potential customer to a consumer.
Finally, after lot of data wrangling, you have a clean dataset to start drawing some insights from the data.
Exploratory Data Analysis (EDA)
In this process as a data scientist you start making more sense of the data available, eagerly exploring the data to find out what information it provides to answer your questions (Data Science Question Inquiry). It's very important to avoid the pitfall when doing EDA and time to revisit the questionnaire. You can right away start plotting how many potential got converted and how many didn't. You can plot how histogram as how much effort it took to convert potential customer and average conversion time. You can submit your preliminary report and meanwhile generate further report on conversion based on age, sex group which will give further insight. Further plotting based on what communication method was used for converting customer i.e email, social media or phone. These finding during EDA will help to utilizes the sales team effectively.
Feature Vector & Label
So, far we have collected enough information to create a predictive model. In Machine learning technique to create predictive model each data point is expressed as feature vector. During EDA phase we have identified several factor that can be used to predict customer conversion, i.e age, marketing method(email, phone, social media).
Besides feature vector, we also need label that tell model which data point you want to predict i.e converted and not converted. One label is determined predict model can be generated using simple machine learning classifier algorithm called logistics regression. This simple technique learns a model based from label and report binary prediction but also probability of conversion.
Communicating Results
Well you have done predictive modeling but one of the last and most critical step is how to convey the model in comprehensive and compelling way. To communicate you work effectively and to have maximum impact is called 'data storytelling'. The story will include important conclusion for most important questionnaire raised during EDA and other phase. The biggest data storytelling our model should convey in the explained scenario
- Age: What age group are being converted ?
- Marketing Method: Use of social media works for age below 30, email campaign for people over 30.