Kaggle Titanic survival - data preprocessing#

Can we predict which passengers would survive the sinking of the Titanic?

Original kaggle page:https://www.kaggle.com/c/titanic

Subsequent machine learning notebooks using Titanic survival also provide links to load preprocessed data directly, so this notebook is not strictly needed before using other notebooks, but processing data into a useable form is often a key stage of any machine learning project, and so all practitioners will want to get to grips with common methods.

This Notebook introduces the following:

Using Pandas to load and process data (though some familiarity with Pandas is assumed)
Looking at data types
Listing feature headings
Showing data
Showing a statistical summary of data
Filling in (imputing) missing data
Encoding non-numerical fields
Removing unwanted columns
Saving processed data

The data includes.

Variable	Definition
survival	Survival (0 = No, 1 = Yes)
pclass	Ticket class
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation(C=Cherbourg, Q=Queenstown, S=Southampton)

Load modules#

import pandas as pd
import numpy as np

Load data#

Data should be in a sub folder named data.

It may be downloaded from:

https://gitlab.com/michaelallen1966/1908_coding_club_kaggle_titanic/tree/master/data

Usually the first thing we will do is split data in training and test (usually with randomisation first), and we hold back the test data until model building is complete. In the case of this kaggle data a separate test data set is supplied, so we do not need to hold back and of the data.

We will load the kaggle data and make a copy we will work on (so we can always refer back to the original data if we wish).

download_required = False

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/train.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'train.csv', index=False)

original_data = pd.read_csv('./data/train.csv')
data = original_data.copy()

Let’s have a look at some general information on the table.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

At this point we can note we have 891 passengers, but that ‘Age’, ‘Cabin’ and ‘Embarked’ have some data missing.

Let’s list the data fields:

list(data)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Let’s look at the top of our data.

data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

We can count the number of empty values. We can see that we will need to deal with ‘age’, ‘cabin’, and ‘embarked’.

data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Showing a summary of the data#

We can use the pandas describe() method to show a summary of the data. Note that this only shows numerical data.

data.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Of most likely useful fields we are missing sex and whether a passenger embarked or not. So let’s code those numerically.

Filling in (imputing) missing data#

For numerical data we may commonly choose to impute missing values with zero, mean or median. We will use the median for age.

We will also create a new column showing which values were imputed (this may be useful information in a machine learning model).

def impute_missing_with_median(_series):
    """
    Replace missing values in a Pandas series with median,
    Returns a comppleted series, and a series shwoing which values are imputed
    """
    # Copy the series to avoid change to the original series.
    series = _series.copy()
    median = series.median()
    missing = series.isna()
    series[missing] = median
    
    return series, missing

age, imputed = impute_missing_with_median(data['Age'])
data['Age'] = age
data['AgeImputed'] = imputed

We will impute missing embarked text with a ‘missing’ label.

def impute_missing_with_missing_label(_series):
    """Replace missing values in a Pandas series with the text 'missing'"""
    # Copy the series to avoid change to the original series.
    series = _series.copy()
    missing = series.isna()
    series[missing] = 'missing'
    
    return series, missing

embarked, imputed = impute_missing_with_missing_label(data['Embarked'])
data['Embarked'] = embarked
data['EmbarkedImputed'] = imputed

Sorting out cabin data#

Cabin data is messy! Some passengers have more than one cabin (in which case we will split out the multiple cabins and just use the first one). Cabin numbers are a letter followed by a number. We will separate out the letter and the number.

# Get cabin data from dataframe
cabin = data['Cabin']

# Set up strings to add each passenger data to
CabinLetter = []
CabinLetterImputed = []
CabinNumber = []
CabinNumberImputed = []

# Convert all cabin data to string (empty cells are current stored as 'float')
cabin = cabin.astype(str)

# Iterate through rows
for index, value in cabin.items():
    # If cabin info is missing (string is 'nan' then add imputed data)
    if value == 'nan':
        CabinLetter.append('missing')
        CabinLetterImputed.append(True)
        CabinNumber.append(0)
        CabinNumberImputed.append(True)
    # Otherwise split string by spaces where there are multiple cabins
    else:
        # Split multiple cabins
        cabins = value.split(' ')
        # Take first cabin
        use_cabin = cabins[0]
        letter = use_cabin[0] # First letter 
        CabinLetter.append(letter)
        CabinLetterImputed.append(False)
        if len(use_cabin) > 1:
            number = use_cabin[1:]
            CabinNumber.append(number)
            CabinNumberImputed.append(False)
        else:
            CabinNumber.append(0)
            CabinNumberImputed.append(True)

data['CabinLetter'] = CabinLetter
data['CabinLetterImputed'] = CabinLetterImputed
data['CabinNumber'] = CabinNumber
data['CabinNumberImputed'] = CabinNumberImputed

data.drop('Cabin', axis=1, inplace=True)       
    

data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked	AgeImputed	EmbarkedImputed	CabinLetter	CabinLetterImputed	CabinNumber	CabinNumberImputed
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	S	False	False	missing	True	0	True
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C	False	False	C	False	85	False
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	S	False	False	missing	True	0	True
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	S	False	False	C	False	123	False
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	S	False	False	missing	True	0	True

Let’s check our missing numbers totals again

data.isna().sum()

PassengerId           0
Survived              0
Pclass                0
Name                  0
Sex                   0
Age                   0
SibSp                 0
Parch                 0
Ticket                0
Fare                  0
Embarked              0
AgeImputed            0
EmbarkedImputed       0
CabinLetter           0
CabinLetterImputed    0
CabinNumber           0
CabinNumberImputed    0
dtype: int64

Encoding non-numerical fields.#

There are three types of non-numerical field:

Dichotomous, which have two, and only two, possibilities (e.g. male/female, alive/dead). These may be recoded as 0 or 1.
Categorical, which have any number of possibilties that cannot be ordered in any sensible way (e.g. colour of car’). Each possibility is coded seperately as 0/1 (e.g red = 0 or 1, green = 0 or 1, blue = 0 or 1). This is called ‘one-hot encoding’ as there will be one ‘1’ (hot) in a set of columns (with all other values being zero).
Ordinal, which have any number of possibilties but which may be ordered in a sensible way and coded by order of list. For example the zise of shirts may be xs, s, m, l and xl. These may be re-coded as size 0, 1, 2, 3, 4 (or scalled in another way if appropriate).

We’ll look at sex first. Let’s pull that out as a separate ‘series’

sex = data['Sex']
sex.head()

    male
  female
  female
  female
    male
Name: Sex, dtype: object

From looking at the data it appears passengers are either male or female, but data can contain missing values or spelling mistakes, so let’s check all the values present. An easy way to do this is to use Python’s set command which only allows one instance of each value.

set(sex)

{'female', 'male'}

That’s good. We have just ‘female’ and ‘male’. Let’s code a new ‘male’ column manually, and check the mean (the proportion of passengers who are male).

male = data['Sex'] == 'male'
male.mean()

0.6475869809203143

That’s looks reasonable. We’ll add our new column to our dataframe, and remove the old ‘sex’ column.

To remove a column we use the pandas drop() method. To show it is a column we specigy axis=1. To instruct removal from the data itself we use inplace=True. This is the equivalent of saying data = data.drop().

data['male'] = male
data.drop(['Sex'], axis=1, inplace=True)

Let’s look at our table now.

data.head()

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	Embarked	AgeImputed	EmbarkedImputed	CabinLetter	CabinLetterImputed	CabinNumber	CabinNumberImputed	male
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	S	False	False	missing	True	0	True	True
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	C	False	False	C	False	85	False	False
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	S	False	False	missing	True	0	True	False
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	S	False	False	C	False	123	False	False
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	S	False	False	missing	True	0	True	True

Let’s do the same with ‘embarked’.

embarked = data['Embarked']
set(embarked)

{'C', 'Q', 'S', 'missing'}

Ah, we have four possibilties!

We could frame this as a series of if/elif/esle statements. That is reasonable for a few possibilties, but what if have have many? We could write our own function to ‘one-hot’ encode this column, but pandas can already do this for us with the get_dummies method.

Note that we pass a couple of useful arguments: prefix allows us to add some text to each label, and dummy_na=True allows us to specifically code missing values (though we have already given them the label ‘missing’).

As ever, it is often useful to look at the help for these methods (help(pd.get_dummies).

embarked_coded = pd.get_dummies(embarked, prefix='Embarked')
embarked_coded.head()

	Embarked_C	Embarked_S
0	0	1
1	1	0
2	0	1
3	0	1
4	0	1

Nice! We’ll add our new table to the data table and drop the original ‘Embarked’ column. Pandas concat method will join our dataframes.

Pandas has concat, merge and join methods for combining dataframes https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

data = pd.concat([data, embarked_coded], axis=1)
data.drop(['Embarked'], axis=1, inplace=True)
data.head()

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	AgeImputed	EmbarkedImputed	CabinLetter	CabinLetterImputed	CabinNumber	CabinNumberImputed	male	Embarked_C	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	False	False	missing	True	0	True	True	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	False	False	C	False	85	False	False	1	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	False	False	missing	True	0	True	False	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	False	False	C	False	123	False	False	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	False	False	missing	True	0	True	True	0	1

cabin_coded = pd.get_dummies(CabinLetter, prefix='CabinLetter')
cabin_coded.head()

	CabinLetter_C	CabinLetter_missing
0	0	1
1	1	0
2	0	1
3	1	0
4	0	1

Now let’s add those back to the table

data = pd.concat([data, cabin_coded], axis=1)
data.drop(['CabinLetter'], axis=1, inplace=True)

data.head()

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	AgeImputed	...	CabinLetter_C	CabinLetter_missing
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	False	...	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	False	...	1	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	False	...	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	False	...	1	0
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	False	...	0	1

5 rows × 28 columns

Now we will drop the Name and Ticket column (they may perhaps be useful in some way, but we’ll simplify things by remiving them)

Drop columns#

cols_to_drop = ['Name', 'Ticket']
data.drop(cols_to_drop, axis=1, inplace=True)
data.head()

	PassengerId	Survived	Pclass	Age	SibSp	Fare	AgeImputed	EmbarkedImputed	CabinLetterImputed	...	CabinLetter_C	CabinLetter_missing
0	1	0	3	22.0	1	7.2500	False	False	True	...	0	1
1	2	1	1	38.0	1	71.2833	False	False	False	...	1	0
2	3	1	3	26.0	0	7.9250	False	False	True	...	0	1
3	4	1	1	35.0	1	53.1000	False	False	False	...	1	0
4	5	0	3	35.0	0	8.0500	False	False	True	...	0	1

5 rows × 26 columns

Having a quick look at differences between survived and non-survived passengers#

Phew, the data-preprocessing is done! This is often a tedious and time-consuming stage with few ‘endorphin rush’ rewards to be had.

Let’s split our data into survived and non-survived and have a quick look to see anything obvious.

mask = data['Survived'] == 1 # mask for survived passengers
survived = data[mask]

# Invert mask (for passengers who died
mask = mask == False
died = data[mask]

Now let’s have a quick look at mean values for our two groups. We’ll put them side by side in a new DataFrame.

summary = pd.DataFrame()
summary['survived'] = survived.mean()
summary['died'] = died.mean()
summary

	survived	died
PassengerId	444.368421	447.016393
Survived	1.000000	0.000000
Pclass	1.950292	2.531876
Age	28.291433	30.028233
SibSp	0.473684	0.553734
Parch	0.464912	0.329690
Fare	48.395408	22.117887
AgeImputed	0.152047	0.227687
EmbarkedImputed	0.005848	0.000000
CabinLetterImputed	0.602339	0.876138
CabinNumberImputed	0.611111	0.885246
male	0.318713	0.852459
Embarked_C	0.271930	0.136612
Embarked_Q	0.087719	0.085610
Embarked_S	0.634503	0.777778
Embarked_missing	0.005848	0.000000
CabinLetter_A	0.020468	0.014572
CabinLetter_B	0.102339	0.021858
CabinLetter_C	0.102339	0.043716
CabinLetter_D	0.073099	0.014572
CabinLetter_E	0.070175	0.014572
CabinLetter_F	0.023392	0.009107
CabinLetter_G	0.005848	0.003643
CabinLetter_T	0.000000	0.001821
CabinLetter_missing	0.602339	0.876138

Save processed data#

data.to_csv('./data/processed_data.csv', index=False)

Titanic Survival Machine Learning

Kaggle Titanic survival - data preprocessing

Contents