Mike

# Statistics in Machine Learning

Updated: Feb 28, 2020

Statistics and machine learning are two very closely related ﬁelds. The line between the two can be very blurry at times. Nevertheless, there are methods that clearly belong to the ﬁeld of statistics that are not only useful, but invaluable when working on a machine learning project. It would be fair to say that statistical methods are required to eﬀectively work through a machine learning predictive modeling project.

## Understanding the Problem

Perhaps the point of biggest leverage in predictive modeling is understanding the problem. This is the selection of the type of problem, e.g. regression or classiﬁcation, and perhaps the structure and types of the inputs and outputs for the problem. Understanding the problem is not always straightforward. For newcomers to this space, it may require signiﬁcant exploration of the observations in the domain.

Statistical methods that can aid in the exploration of the data during the framing of a problem include data exploration. Summarization and visualization are also used in order to explore ad hoc views of the data. Lastly, there is pattern analysis, discovery of structured relationships and patterns in the data.

## Understanding the Data

Understanding data means having an intimate grasp of both the distributions of variables and the relationships between variables. This may come from domain expertise, or require domain expertise in order to interpret. Either way, both experts and novices to a ﬁeld of study will beneﬁt from actually handling real observations form the domain. Two areas where understanding your data comes into play are summary statistics, approaches used to summarize the distribution and relationships between variables using statistical quantities.

Additionally, there are data visualization techniques. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

## Data Cleaning

In the real-world data is dirty. Although the data is digital, it may be subjected to processes that can damage the ﬁdelity of the data, and in turn any downstream processes or models that make use of the data. Some examples are data loss and data corruption. Additioanlly,

there are approaches used to correct data problems. Two examples are outlier detection and imputation. An outlier is a data value far outside the norm. Imputation is replacing missing values.

## Data Selection

Rarely if ever is data in a model ready state. Transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms. Data preparation is performed using statistical methods. Scaling and encoding are two examples. Scaling includes standardization and normalization while an often used encoding approach is one-hot encoding.

## Evaluating the Model

An important aspect of a predictive modeling problem is evaluating a method. This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model. Generally, the planning of this process of training and evaluating a predictive model is called experimental design. When implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model. For example, there are resampling methods. These methods are for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

## Model Tuning

A machine learning model often has a set of hyperparameters that allow the learning method to be tailored to a speciﬁc problem. The conﬁguration of the hyperparameters is often empirical in nature, rather than analytical, requiring large ranges of experiments in order to evaluate the eﬀect of diﬀerent hyperparameter values on the skill of the model. The interpretation and comparison of the results between diﬀerent hyperparameter conﬁgurations is made using one of two subﬁelds of statistics. Two are statistical hypothesis testing and estimation statistics.

## Choosing the Right Model

There may be many models appropriate for a given problem. The process of choosing one method as the solution is called model selection. This may involve a suite of criteria both from stakeholders in the project and the careful interpretation of the estimated skill of the methods evaluated for the problem. Similar to model tuning, two classes of statistical methods can be used to interpret the estimated skill of diﬀerent models for the purposes of model selection are statistical hypothesis testing and estimation statistics.

## Scoring the Model

Once a ﬁnal model has been trained, it can be shared with other team members prior to being deployed to make actual predictions on real data. A part of presenting a ﬁnal model involves presenting the estimated skill of the model. Methods from the ﬁeld of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and conﬁdence intervals. Scoring the model includes estimation statistics. These are methods that quantify the uncertainty in the skill of a model.

## Model Prediction

Lastly, when it's time to analyze the ﬁnal model to make predictions for new data where we do not know the real outcome. As part of making predictions, it is important to quantify the conﬁdence of the prediction. Similar to the process of model presentation, we can use methods from the ﬁeld of estimation statistics to quantify this uncertainty, such as conﬁdence intervals and prediction intervals.