• Mike

XGBoost. A Short Introduction


XGBoost is a supervised machine learning algorithm. That means all the models we build will be done so using an existing dataset. XGBoost only accepts numerical inputs. Therefore, it will be up to us ensure the array type structure you pass to the model is numerical and in the best cleansed state possible.


In our first example we are going to use the famous Titanic dataset. There are a few reasons we are going to start with this problem.


  • It’s part of an ongoing competitive modeling competition. Scoring high on this problem can provide us with a resume bullet point

  • It’s a binary classification problem. These are very common in the applied space and one of the easiest problems to model.

  • There are tons of resources online that can help you with every facet from data wrangling, exploratory data analysis and modeling.

Machine learning is very process oriented, even the modeling process can be distilled into several high-level steps.


  • Import libraries – The libraries you will choose will depend on the problem you are faced with. You’ll use several core libraries for almost every project and in this book, XGBoost for modeling.

  • Load and Wrangle Data – XGBoost is a supervised learning model and that means we will always need a dataset to work with.

  • Separate Data – Overfitting is a big problem in machine learning, and we need approaches to prevent it. There are different approaches to segmenting our data. One of the best ways to prevent overfitting is to divide your data into separate training and testing dataset.

  • Define and Fit the Model – The model will depend on the problem. All our problems are classification and regression and our model of choice is XGBoost. Once the model has been defined, you’ll fit it to your training data first. Once the training process has completed, you’ll test your model on training data.

  • Make Predictions – There are different metrics for different problems. However, you’ll always needs an approach to evaluate the performance of your models.


Import Libraries


In our first cell let’s import some libraries. The first library we will import is Pandas, next we will bring in train_test_split from SciKit-Learn and lastly let’s import the XGBoost classifier from SciKit-Learn. We will use Pandas to massage our data and a utility from within SciKit-Learn called train_test_split to slice our dataset into a training set and a testing set. Take note that sklearn is SciKit-Learn.


import pandas as pd

from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier


Load and Prepare Data


In the next cell let’s use Pandas to import our data. The word data is a variable that will house our dataset. We are using the read csv function to add our dataset to our data variable. The name of our dataset is titanic and it’s a CSV file. Take note a path is not specified because the titanic.csv file is in the Python working directory.


data = pd.read_csv("titanic.csv")


The Titanic dataset has quite a few attributes. We are only interested in a few of them so let’s specify only the attributes we want to view from our data variable. Take note that we used the same variable called data. When we did that, we altered the contents of that variable to only hold the attributes we want. If we wanted to preserve the contents of the original data variable, we would use another name for our new variable.


data = data [['Pclass', 'Sex', 'Age','Survived','Parch','SibSp']]


The next thing we want to do is look at our data. We call the head function to view the contents of the data variable.


data.head()


The results are below. On the far-left side is the index and it’s not part of the dataset. Additionally, Survived is our target variable so XGBoost will only be using five attributes to build our model.

In our dataset we have one column that isn’t textual. We are going to use a label encoder to transform the Sex column to numbers. Label encoding refers to converting the attributes into numeric form. In the first line of code we are importing preprocessing from SciKit-Learn. In the second line we are creating a variable lc to hold our label encoder. The last line of code is transforming the Sex attribute to numbers.


from sklearn import preprocessing

lc = preprocessing.LabelEncoder()

data['Sex']= lc.fit_transform(data['Sex'])


In the next cell we are specifying our X and y axis. We are dropping our target variable from the dataset and telling the model that the survived attribute is our target variable.


X = data.drop ('Survived', axis=1)

y = data['Survived']


Separate Data


Next, we are using the train_test_split function to separate our data into two sections. One section will be for training the model and the other will be for testing the completed trained model. We are using the parameter random_state and passing in a value of 1. The random_state parameter is used so we can reproduce the models results time after time.


X_train, X_test, y_train, y_test = train_test_split (X, y, random_state=1)


Define and Fit Model


In the next cell we’ve created a new variable model to hold our classifier. All models in SciKit-Learn are called classifiers. Therefore, the SciKit-Learn XGBoost implementation is called XGBClassifier for classification models. Because XGBoost is wrapped into SciKit-Learn, we can use the full SciKit-Learn library with XGBoost models. On the second line of code we are fitting the model to our training data.


model = XGBClassifier()

model.fit(X_train, y_train)


When you define your classifier, the parameters are printed out to the Jupyter Notebook for the execution of your classifier.


In the next cell we are creating a variable y_pred and executing our model against the testing data. Additionally, a variable predictions is created to round the results of our model.


Make Predictions


y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]


In our last cell we are importing accuracy_score from SciKit-Learn. We are creating a variable accuracy to hold the results of the model’s predictions against the test data. Lastly, we are printing out the accuracy of the model.


from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))


The accuracy of our model should be somewhere around 77%. That was straightforward and in less than 20 lines of code we just beat 90% of all the those who participated in the Titanic Kaggle Competition. Not bad.


The complete code is below.


import pandas as pd

from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

data = pd.read_csv("titanic.csv")

data = data [['Pclass', 'Sex', 'Age','Survived','Parch','SibSp']]

data.head()

from sklearn import preprocessing

lc = preprocessing.LabelEncoder()

data['Sex']= lc.fit_transform(data['Sex'])

X = data.drop ('Survived', axis=1)

y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split (X, y, random_state=1)

model = XGBClassifier()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

0 views