Mike

# Gradient Boosting - Part 3

**Bagging** is used when the goal is to reduce variance. The idea is to create several subsets of data from training samples chosen randomly. Each collection of subset data is used to train the decision trees. As a result, we end up with an ensemble of different models. When bagging with decision trees, we are less concerned about individual trees overfitting the training data. The individual decision trees can grow deeply, and *the trees are not pruned*. These trees will have both high variance and low bias. In the example below, several base models execute on different bootstrap samples and build an ensemble model that *averages* the results of these weak learners.

**Boosting** is another ensemble technique to create a collection of models. In this technique, models are learned *sequentially* with early models fitting simple models to the data and then analyzing the data for errors. Recall that bagging had each model run independently and then aggregate the outputs at the end without preference to any model. In other words, with boosting, we fit consecutive trees and at *every step*. The goal is to solve for the net error from the prior tree.

Unlike bagging, boosting is all about teamwork. Each decision tree, the weak learner, dictates what features the next model will focus on.

When an input is misclassified, its weight is increased so that next decision is more likely to classify it correctly.

In the example below a weak learner, in our case a decision tree, is trained on the dataset. On the model’s first iteration, the weak learner makes a prediction; however, its accuracy isn’t that good. Recall that’s the very definition of a weak learner. Once the first iteration is completed the training dataset is updated based on the previous results.

Let’s examine the picture below to see how the first iteration in our boosting ensemble model might have happened. In the Titanic dataset the root node was split on sex. More specifically, was the passenger a male or not? If we were using a booster, that split might have produced a shallow tree. These shallow trees are often referred as decision stumps.

**Gradient boosting** is a machine learning technique for regression and classification problems, which produces a model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Gradient boosting involves three main steps. The first step that is required is that a **loss function** be optimized. The loss function must be diﬀerentiable. A loss function measures how well a machine learning model fits the data of a certain phenomenon. Different loss function may be used depending on the type of problem. Different loss function can be used on speech or image recognition, predicting the price of real estate, and describing user behavior on a web site. The loss function depends on the type of problem. For example, regression may use a squared error and classiﬁcation may use logarithmic loss.

The second step is the use of a **weak learner**.

In gradient boosters, the weak learner is a decision tree. Speciﬁcally *regression trees* are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added to correct the residuals in the predictions of the previous iteration. The algorithms for classification problems and for regression problems use a different algorithm, however, they both use the same approach for splitting the data into groups. That approach is *regression decision trees*.

Even classification problems use regression decision trees.

In regression decision trees, the final answer is a range of real numbers, this makes it’s relatively simple to split the data based on the remaining error at each step.

Steps are taken to ensure the weak learner remain weak yet is still constructed in a greedy fashion. It is common to constrain the weak learners in sundry ways. Often, weak learners can be constrained using a maximum number of layers, nodes, splits or leaf nodes.

The third step is combing many weak learners in an **additive fashion**. Decision trees are added one at a time. A gradient descent procedure is used to minimize the loss when adding trees. That’s the *gradient* part of gradient boosters.

Gradient descent optimization in the machine learning world is typically used to find the parameters associated with a single model that optimizes some loss function.

In contrast, gradient boosters are meta-models consisting of multiple weak models whose output is added together to get an overall prediction. The gradient descent optimization occurs on the output of the model and not the parameters of the weak models.

Let’s look at this process pictorially. Below we can see that gradient boosting adds sub-models incrementally to minimize a loss function. Earlier we said that gradient boosting involved three main steps. In our example below the weak learner being used is a *decision tree*. Secondly, the trees are *added sequentially*. Lastly, the *error *of the model is being reduced.