Kaggle Titanic survival - data preprocessing
Contents
Kaggle Titanic survival - data preprocessing#
Can we predict which passengers would survive the sinking of the Titanic?
Original kaggle page:https://www.kaggle.com/c/titanic
Subsequent machine learning notebooks using Titanic survival also provide links to load preprocessed data directly, so this notebook is not strictly needed before using other notebooks, but processing data into a useable form is often a key stage of any machine learning project, and so all practitioners will want to get to grips with common methods.
This Notebook introduces the following:
Using Pandas to load and process data (though some familiarity with Pandas is assumed)
Looking at data types
Listing feature headings
Showing data
Showing a statistical summary of data
Filling in (imputing) missing data
Encoding non-numerical fields
Removing unwanted columns
Saving processed data
The data includes.
Variable |
Definition |
---|---|
survival |
Survival (0 = No, 1 = Yes) |
pclass |
Ticket class |
sex |
Sex |
Age |
Age in years |
sibsp |
# of siblings / spouses aboard the Titanic |
parch |
# of parents / children aboard the Titanic |
ticket |
Ticket number |
fare |
Passenger fare |
cabin |
Cabin number |
embarked |
Port of Embarkation(C=Cherbourg, Q=Queenstown, S=Southampton) |
Load modules#
import pandas as pd
import numpy as np
Load data#
Data should be in a sub folder named data.
It may be downloaded from:
https://gitlab.com/michaelallen1966/1908_coding_club_kaggle_titanic/tree/master/data
Usually the first thing we will do is split data in training and test (usually with randomisation first), and we hold back the test data until model building is complete. In the case of this kaggle data a separate test data set is supplied, so we do not need to hold back and of the data.
We will load the kaggle data and make a copy we will work on (so we can always refer back to the original data if we wish).
download_required = False
if download_required:
# Download processed data:
address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
'1804_python_healthcare/master/titanic/data/train.csv'
data = pd.read_csv(address)
# Create a data subfolder if one does not already exist
import os
data_directory ='./data/'
if not os.path.exists(data_directory):
os.makedirs(data_directory)
# Save data
data.to_csv(data_directory + 'train.csv', index=False)
original_data = pd.read_csv('./data/train.csv')
data = original_data.copy()
Let’s have a look at some general information on the table.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
At this point we can note we have 891 passengers, but that ‘Age’, ‘Cabin’ and ‘Embarked’ have some data missing.
Let’s list the data fields:
list(data)
['PassengerId',
'Survived',
'Pclass',
'Name',
'Sex',
'Age',
'SibSp',
'Parch',
'Ticket',
'Fare',
'Cabin',
'Embarked']
Let’s look at the top of our data.
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
We can count the number of empty values. We can see that we will need to deal with ‘age’, ‘cabin’, and ‘embarked’.
data.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Showing a summary of the data#
We can use the pandas describe()
method to show a summary of the data. Note that this only shows numerical data.
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Of most likely useful fields we are missing sex and whether a passenger embarked or not. So let’s code those numerically.
Filling in (imputing) missing data#
For numerical data we may commonly choose to impute missing values with zero, mean or median. We will use the median for age.
We will also create a new column showing which values were imputed (this may be useful information in a machine learning model).
def impute_missing_with_median(_series):
"""
Replace missing values in a Pandas series with median,
Returns a comppleted series, and a series shwoing which values are imputed
"""
# Copy the series to avoid change to the original series.
series = _series.copy()
median = series.median()
missing = series.isna()
series[missing] = median
return series, missing
age, imputed = impute_missing_with_median(data['Age'])
data['Age'] = age
data['AgeImputed'] = imputed
We will impute missing embarked text with a ‘missing’ label.
def impute_missing_with_missing_label(_series):
"""Replace missing values in a Pandas series with the text 'missing'"""
# Copy the series to avoid change to the original series.
series = _series.copy()
missing = series.isna()
series[missing] = 'missing'
return series, missing
embarked, imputed = impute_missing_with_missing_label(data['Embarked'])
data['Embarked'] = embarked
data['EmbarkedImputed'] = imputed
Sorting out cabin data#
Cabin data is messy! Some passengers have more than one cabin (in which case we will split out the multiple cabins and just use the first one). Cabin numbers are a letter followed by a number. We will separate out the letter and the number.
# Get cabin data from dataframe
cabin = data['Cabin']
# Set up strings to add each passenger data to
CabinLetter = []
CabinLetterImputed = []
CabinNumber = []
CabinNumberImputed = []
# Convert all cabin data to string (empty cells are current stored as 'float')
cabin = cabin.astype(str)
# Iterate through rows
for index, value in cabin.items():
# If cabin info is missing (string is 'nan' then add imputed data)
if value == 'nan':
CabinLetter.append('missing')
CabinLetterImputed.append(True)
CabinNumber.append(0)
CabinNumberImputed.append(True)
# Otherwise split string by spaces where there are multiple cabins
else:
# Split multiple cabins
cabins = value.split(' ')
# Take first cabin
use_cabin = cabins[0]
letter = use_cabin[0] # First letter
CabinLetter.append(letter)
CabinLetterImputed.append(False)
if len(use_cabin) > 1:
number = use_cabin[1:]
CabinNumber.append(number)
CabinNumberImputed.append(False)
else:
CabinNumber.append(0)
CabinNumberImputed.append(True)
data['CabinLetter'] = CabinLetter
data['CabinLetterImputed'] = CabinLetterImputed
data['CabinNumber'] = CabinNumber
data['CabinNumberImputed'] = CabinNumberImputed
data.drop('Cabin', axis=1, inplace=True)
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | AgeImputed | EmbarkedImputed | CabinLetter | CabinLetterImputed | CabinNumber | CabinNumberImputed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | False | False | missing | True | 0 | True |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | False | False | C | False | 85 | False |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | False | False | missing | True | 0 | True |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | False | False | C | False | 123 | False |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | False | False | missing | True | 0 | True |
Let’s check our missing numbers totals again
data.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
AgeImputed 0
EmbarkedImputed 0
CabinLetter 0
CabinLetterImputed 0
CabinNumber 0
CabinNumberImputed 0
dtype: int64
Encoding non-numerical fields.#
There are three types of non-numerical field:
Dichotomous, which have two, and only two, possibilities (e.g. male/female, alive/dead). These may be recoded as 0 or 1.
Categorical, which have any number of possibilties that cannot be ordered in any sensible way (e.g. colour of car’). Each possibility is coded seperately as 0/1 (e.g red = 0 or 1, green = 0 or 1, blue = 0 or 1). This is called ‘one-hot encoding’ as there will be one ‘1’ (hot) in a set of columns (with all other values being zero).
Ordinal, which have any number of possibilties but which may be ordered in a sensible way and coded by order of list. For example the zise of shirts may be xs, s, m, l and xl. These may be re-coded as size 0, 1, 2, 3, 4 (or scalled in another way if appropriate).
We’ll look at sex first. Let’s pull that out as a separate ‘series’
sex = data['Sex']
sex.head()
0 male
1 female
2 female
3 female
4 male
Name: Sex, dtype: object
From looking at the data it appears passengers are either male or female, but data can contain missing values or spelling mistakes, so let’s check all the values present. An easy way to do this is to use Python’s set
command which only allows one instance of each value.
set(sex)
{'female', 'male'}
That’s good. We have just ‘female’ and ‘male’. Let’s code a new ‘male’ column manually, and check the mean (the proportion of passengers who are male).
male = data['Sex'] == 'male'
male.mean()
0.6475869809203143
That’s looks reasonable. We’ll add our new column to our dataframe, and remove the old ‘sex’ column.
To remove a column we use the pandas drop()
method. To show it is a column we specigy axis=1
. To instruct removal from the data itself we use inplace=True
. This is the equivalent of saying data = data.drop()
.
data['male'] = male
data.drop(['Sex'], axis=1, inplace=True)
Let’s look at our table now.
data.head()
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | AgeImputed | EmbarkedImputed | CabinLetter | CabinLetterImputed | CabinNumber | CabinNumberImputed | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | False | False | missing | True | 0 | True | True |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | False | False | C | False | 85 | False | False |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | False | False | missing | True | 0 | True | False |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | S | False | False | C | False | 123 | False | False |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | S | False | False | missing | True | 0 | True | True |
Let’s do the same with ‘embarked’.
embarked = data['Embarked']
set(embarked)
{'C', 'Q', 'S', 'missing'}
Ah, we have four possibilties!
We could frame this as a series of if/elif/esle statements. That is reasonable for a few possibilties, but what if have have many? We could write our own function to ‘one-hot’ encode this column, but pandas can already do this for us with the get_dummies
method.
Note that we pass a couple of useful arguments: prefix
allows us to add some text to each label, and dummy_na=True
allows us to specifically code missing values (though we have already given them the label ‘missing’).
As ever, it is often useful to look at the help for these methods (help(pd.get_dummies
).
embarked_coded = pd.get_dummies(embarked, prefix='Embarked')
embarked_coded.head()
Embarked_C | Embarked_Q | Embarked_S | Embarked_missing | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 |
Nice! We’ll add our new table to the data table and drop the original ‘Embarked’ column. Pandas concat
method will join our dataframes.
Pandas has concat
, merge
and join
methods for combining dataframes
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
data = pd.concat([data, embarked_coded], axis=1)
data.drop(['Embarked'], axis=1, inplace=True)
data.head()
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | AgeImputed | EmbarkedImputed | CabinLetter | CabinLetterImputed | CabinNumber | CabinNumberImputed | male | Embarked_C | Embarked_Q | Embarked_S | Embarked_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | False | False | missing | True | 0 | True | True | 0 | 0 | 1 | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | False | False | C | False | 85 | False | False | 1 | 0 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | False | False | missing | True | 0 | True | False | 0 | 0 | 1 | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | False | False | C | False | 123 | False | False | 0 | 0 | 1 | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | False | False | missing | True | 0 | True | True | 0 | 0 | 1 | 0 |
cabin_coded = pd.get_dummies(CabinLetter, prefix='CabinLetter')
cabin_coded.head()
CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Now let’s add those back to the table
data = pd.concat([data, cabin_coded], axis=1)
data.drop(['CabinLetter'], axis=1, inplace=True)
data.head()
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | AgeImputed | ... | Embarked_missing | CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | False | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | False | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | False | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | False | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | False | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 28 columns
Now we will drop the Name and Ticket column (they may perhaps be useful in some way, but we’ll simplify things by remiving them)
Drop columns#
cols_to_drop = ['Name', 'Ticket']
data.drop(cols_to_drop, axis=1, inplace=True)
data.head()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | AgeImputed | EmbarkedImputed | CabinLetterImputed | ... | Embarked_missing | CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | False | False | True | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | False | False | False | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | False | False | True | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | False | False | False | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | False | False | True | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 26 columns
Having a quick look at differences between survived and non-survived passengers#
Phew, the data-preprocessing is done! This is often a tedious and time-consuming stage with few ‘endorphin rush’ rewards to be had.
Let’s split our data into survived and non-survived and have a quick look to see anything obvious.
mask = data['Survived'] == 1 # mask for survived passengers
survived = data[mask]
# Invert mask (for passengers who died
mask = mask == False
died = data[mask]
Now let’s have a quick look at mean values for our two groups. We’ll put them side by side in a new DataFrame.
summary = pd.DataFrame()
summary['survived'] = survived.mean()
summary['died'] = died.mean()
summary
survived | died | |
---|---|---|
PassengerId | 444.368421 | 447.016393 |
Survived | 1.000000 | 0.000000 |
Pclass | 1.950292 | 2.531876 |
Age | 28.291433 | 30.028233 |
SibSp | 0.473684 | 0.553734 |
Parch | 0.464912 | 0.329690 |
Fare | 48.395408 | 22.117887 |
AgeImputed | 0.152047 | 0.227687 |
EmbarkedImputed | 0.005848 | 0.000000 |
CabinLetterImputed | 0.602339 | 0.876138 |
CabinNumberImputed | 0.611111 | 0.885246 |
male | 0.318713 | 0.852459 |
Embarked_C | 0.271930 | 0.136612 |
Embarked_Q | 0.087719 | 0.085610 |
Embarked_S | 0.634503 | 0.777778 |
Embarked_missing | 0.005848 | 0.000000 |
CabinLetter_A | 0.020468 | 0.014572 |
CabinLetter_B | 0.102339 | 0.021858 |
CabinLetter_C | 0.102339 | 0.043716 |
CabinLetter_D | 0.073099 | 0.014572 |
CabinLetter_E | 0.070175 | 0.014572 |
CabinLetter_F | 0.023392 | 0.009107 |
CabinLetter_G | 0.005848 | 0.003643 |
CabinLetter_T | 0.000000 | 0.001821 |
CabinLetter_missing | 0.602339 | 0.876138 |
Save processed data#
data.to_csv('./data/processed_data.csv', index=False)