The Machine Learning Process
Machine learning is very process oriented. Data scientists and machine learning engineers follow the same core steps every time they are presented with a problem.
Once a problem has been identified then you’ll proceed to the first step in the process. A synonym for the machine learning process is the machine learning pipeline. This process can be distilled into four core steps.
Source Data – Most applied machine learning is supervised machine learning. Therefore, we need some data before we start modeling. Data can be sourced in many ways. Currently, many machine learning models are sourced from relational databases. The machine learning engineer or data scientist will often be responsible for authoring the queries needed to massage the data into a single modellable entity, the array.
Data Wrangling – Once the data is in an array the next step will be massaging the data into the best possible state for modeling. This includes but not limited to removing unneeded attributes, replacing missing values and altering any textual attribute to numbers. The model’s performance is only as good as the data we feed it. The axiom often used here is garbage in, garbage out.
Modeling – Building and tuning our models is the next step in the process. Tuning models is often referred to hyperparameter tuning. At a high-level, this means passing parameters into the model that affect the model’s performance. For example, XGBoost uses decision trees as the foundation model in our ensemble, so altering the number of trees in the building process can affect the outcome of our models.
Production – Once the model has been built, trained, tested and tuned it’s ready to be used with fresh data. The true test for a model will be its success on data it’s never seen before. Once the model is in production, the data scientist or machine learning engineer will need to monitor the model’s performance to ensure its prediction are like those achieved in the training and testing phases.
In supervised machine learning the first step is sourcing your data. For example, we often use a gradient booster called XGBoost, which is a supervised learning machine learning algorithm that works on highly structured datasets so your data will need to be in the shape of an array. Additionally, most machine learning models only accept numerical inputs and massaging the data from text to numbers will be the responsibility of the data scientist or machine learning engineer.
Sourcing data in the applied space means sitting down with others in the organization and asking questions about the data you’ll be using to build your models with.
Companies have amassed tons of data in relational databases. That data is already structured; however, it will often be the responsibility of the machine learning engineer to author SQL queries to extract that data and export it for modeling.
Once the data is sourced it will often be in raw form. The term raw means that additional steps will need to be taken before it can be modelled successfully. The data may have missing values, primary keys, erroneous values and other artifacts you’ll need to remove prior to modeling.
This part of the process is referred to as data wrangling. Data wrangling refers to the process of cleaning, restructuring and enriching the raw data available into a more usable format.
Countless surveys and studies have found this is part of the process machine learning engineers and data scientists spend most of their time on. Data wrangling is a process-oriented endeavor.
In conclusion, there are 4 core steps in the machine learning process. They are sourcing data, wrangling and cleansing data, building and tuning your models and lastly, putting those models into production.