Real World Data Science Case Studies, Projects With Python
So how do we make algorithms find useful patterns in data? The main difference between machine learning and conventionally programmed algorithms is the ability to process data without being explicitly programmed. This actually means that an engineer isn’t required to provide elaborate instructions to a machine on how to treat each type of data record. Instead, a machine defines these rules itself relying on input data.
Regardless of a particular machine learning application, the general workflow remains the same and iteratively repeats once the results become dated or need higher accuracy. This section is focused on introducing the basic concepts that constitute the machine learning workflow.
The core artifact of any machine learning execution is a mathematical model, which describes how an algorithm processes new data after being trained with a subset of historic data. The goal of training is to develop a model capable of formulating a target value (attribute), some unknown value of each data object. While this sounds complicated, it really isn’t.
For example, you need to predict whether customers of your eCommerce store will make a purchase or leave. These predictions buy or leave are the target attributes that we are looking for. To train a model in doing this type of predictions you “feed” an algorithm with a dataset that stores different records of customer behaviors and the results (whether customers left or made a purchase). By learning from this historic data a model will be able to make predictions on future data.
Machine Learning Workflow
Generally, the workflow follows these simple steps:
Collect data. Use your digital infrastructure and other sources to gather as many useful records as possible and unite them into a dataset.
Prepare data. Prepare your data to be processed in the best possible way. Data preprocessing and cleaning procedures can be quite sophisticated, but usually, they aim at filling the missing values and correcting other flaws in data, like different representations of the same values in a column (e.g. December 14, 2016 and 12.14.2016 won’t be treated the same by the algorithm).
Split data. Separate subsets of data to train a model and further evaluate how it performs against new data.
Train a model. Use a subset of historic data to let the algorithm recognize the patterns in it.
Test and validate a model. Evaluate the performance of a model using testing and validation subsets of historic data and understand how accurate the prediction is.
Deploy a model. Embed the tested model into your decision-making framework as a part of an analytics solution or let users leverage its capabilities (e.g. better target your product recommendations).
Iterate. Collect new data after using the model to incrementally improve it.