Once your data is in an array the next step will be massaging the data into the best possible state for modeling. This includes but not limited to removing unneeded attributes, replacing missing values and altering any textual attribute to numbers. This is the process of data wrangling.
Let's walk through data wrangling at a high level. The first step in the process of wrangling your data is attribute selection. What attributes or columns are really needed for the problem? What attributes can be easily removed? For example, most relational database tables have an object called a primary key. A primary key is a monotonically increasing numeric field used to uniquely identify each row. This key is often exported with the dataset. Most primary keys are known as surrogate keys, meaning they have no value relative to the dataset so it can be safely removed.
The second step is handling missing values. Data is the applied space is often dirty and incomplete. Your dataset will often have attributes where the data isn’t complete. Handling these missing values is critical to the outcome of your model. For example, if an attribute is critical to your dataset and 20% of those values are missing what step do you take to correct this attribute? If you have hundred million observations, then removing that 20% may be a good option. If you only have a few hundred observations, then this might not be the best approach.
The third step is imputing missing values.
In statistics, imputation is the process of replacing missing data with substituted values. Instead of removing missing attributes in a dataset, replacing those values with the mean, median or mode removes the null values in your data.
For example, you have an attribute in your dataset that’s critical and 10% of the values are missing. Instead of removing that 10% you replace each of those observation with the mean value.
The fourth step is noise removal. Noise is data that has no meaning. A few include invalid values, outliers or corrupted values.
Machine learning models like nicely cleansed noise free data.
For example, an attribute in your dataset has a numeric range between 1 and 100, however, two rows or observations have values of 2001 and 2002. It’s likely that those two entries are outliers and you’ll need to remove or correct those values prior to modeling.
The fifth step is numeric transformation. Most models only accept numeric inputs. Transforming categorial data into numeric data is often the final step prior to modeling.
Once your data has been completely cleansed it’s on to modeling. Machine learning models are often built in stages. The building or modeling process can be broken down into two separate stages.
The first stage is to train the model on your highly cleansed dataset. Specific to structured data, the model will use all the attributes in the dataset to learn the best approach to predict the target variable. This process is called fitting the model to the data. The fitting process will be different depending on the algorithm being used. With gradient boosters like XGBoost, the model learns how to use the attributes in the data to create groups of data points with similar values. Additionally, the results within each group will be adjusted to arrive at a final answer.
In conclusion, most of applied machine learning is data wrangling. In the real-world, data is dirty and massaging that data into a modellable state often isn't so easy.