Data Wrangling and XGBoost
In the real-world data is dirty. Massaging data into the best state possible for modeling is called data wrangling. Let’s wrangle the Titanic dataset to see if feature selection and data cleanliness really do affect the model’s outcome.
The dataset we used to build our binary classification model with was already cleansed but that’s not how we sourced it. The picture below is the raw Titanic dataset. The term raw simply means we haven’t done anything to it in order to prepare it for modeling.
Let’s use an abbreviated approach to wrangle the Titanic dataset. Those steps are below.
Remove the unneeded attributes
Transform textual values to numeric
Find and replace the missing values
The first column on the left-hand side of our CSV file, PassengerID, appears to be a monotonically increasing key from a relational database. Most primary keys in relational databases are surrogate keys. That simply means they are there to enforce uniqueness and have no business value. Let’s remove the PassengerID attribute. The second attribute that can be easily removed is the Name column. A textual unique attribute isn’t going to help us find patterns; so, let’s get rid of it. Removing those attributes didn’t take very much domain knowledge. However, moving forward we are going to need to know more about our dataset. Let’s look at the dataset after PassengerID and Name were removed.
With some research we find out that Embarked is where the passenger boarded the boat. That’s a very random attribute so we can remove it. We also learned that the tickets were randomly generated for each passenger and that means we can safely remove that attribute. Since we have the class of the passenger in the Pclass attribute the Cabin attribute can be removed. The fare attribute is the amount the passenger paid for the ticket. If we didn’t have the class of the passenger in the Pclass attribute, we could tease out the class using the Fare column. However, since we have the class the Fare attribute can be removed.
Let’s visualize our dataset after those attributes have been removed. The first part of our data wrangling exercise has been completed. All the unneeded attributes have been removed.
At this juncture almost all our attributes are numeric. Recall the second step in our abbreviated data wrangling process is transforming textual values into numeric values. The sex of the passenger is textual so we will need to convert that column to numbers. We can convert the string values to integer values using the LabelEncoder. Label Encoding refers to converting the labels into numbers. Label encoding in Python can be achieved using SciKit-Learn Library. In the code below we need to import preprocessing from SciKit-learn. Next, we create a variable label_encoder to hold the encoder. Lastly, we fit our variable, label_encoder to the Sex attribute.
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
After you’ve applied the label encoder to the Sex attribute all the features are numeric. The second step of our data wrangling process is complete. Let’s visualize our dataset after we’ve label encoded the Sex attribute.
The final step in our abbreviated data wrangling process is to replace any missing values. Machine learning models don’t like missing values and that means we need to come up with the best approach to replacing the holes in your data. The first thing we need to do is to see what attributes have NAN values and how many exist for each. The code below will sum the null values for each attribute in our dataset. The term NAN is an acronym for Not a Number. This is the nomenclature Pandas uses for identifying empty or missing values.
After we execute our code, we can see that the age attribute has 177 missing values. That is a lot of missing values. While many models can’t handle missing values XGBoost was designed to handle a spare matrix. A sparse matrix is a dataset with missing values. XGBoost can model this dataset without making any adjustments to those values. If we build our model leaving the NAN values as is our accuracy is 78.03%.
A common practice when filling in missing values is to replace those vales with the average value in the attribute. The code below will replace all our NAN values with the mean value for that attribute. When inplace=True is passed, the data is renamed in place, nothing is returned as output. When inplace=False is passed (this is the default value) it performs the operation and returns a copy of the object.
Using this approach and rebuilding the model after making this change our accuracy has dropped to 77.13%. Let’s try another approach to handling our missing values. Let’s drop only the missing values or NAN values in the Age attribute. We can do that by calling dropna on that column.
data.dropna(subset=['Age'], how='all', inplace = True)
Using this approach increases our accuracy to 83.80%. That puts us in the top 3% in the Kaggle Titanic competition. That’s a worthy resume bullet point.