Logistic regression model
Contents
Logistic regression model#
The data includes:
Variable |
Definition |
---|---|
survival |
Survival (0 = No, 1 = Yes) |
pclass |
Ticket class |
sex |
Sex |
Age |
Age in years |
sibsp |
# of siblings / spouses aboard the Titanic |
parch |
# of parents / children aboard the Titanic |
ticket |
Ticket number |
fare |
Passenger fare |
cabin |
Cabin number |
embarked |
Port of Embarkation(C=Cherbourg, Q=Queenstown, S=Southampton) |
Logistic regression#
In this example we will use logistic regression (see https://en.wikipedia.org/wiki/Logistic_regression).
For an introductory video on logistic regression see: https://www.youtube.com/watch?v=yIYKR4sgzI8
Logistic regression takes a range of features (which we will normalise/standardise to put on the same scale) and returns a probability that a certain classification (survival in this case) is true.
We will go through the following steps:
Download and save pre-processed data
Split data into features (X) and label (y)
Split data into training and test sets (we will test on data that has not been used to fit the model)
Standardise data
Fit a logistic regression model (from sklearn)
Predict survival of the test set, and assess accuracy
Review model coefficients (weights) to see importance of features
Show probability of survival for passengers
Load modules#
A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.
import numpy as np
import pandas as pd
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Load data#
The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.
Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb
download_required = True
if download_required:
# Download processed data:
address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
'1804_python_healthcare/master/titanic/data/processed_data.csv'
data = pd.read_csv(address)
# Create a data subfolder if one does not already exist
import os
data_directory ='./data/'
if not os.path.exists(data_directory):
os.makedirs(data_directory)
# Save data
data.to_csv(data_directory + 'processed_data.csv', index=False)
data = pd.read_csv('data/processed_data.csv')
# Make all data 'float' type
data = data.astype(float)
Examine loaded data#
The data is in the form of a Pandas DataFrame, so we have column headers providing information of what is contained in each column.
We will use the DataFrame .head()
method to show the first few rows of the imported DataFrame. By default this shows the first 5 rows. Here we will look at the first 10.
data.head(10)
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | AgeImputed | EmbarkedImputed | CabinLetterImputed | ... | Embarked_missing | CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 3.0 | 22.0 | 1.0 | 0.0 | 7.2500 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 2.0 | 1.0 | 1.0 | 38.0 | 1.0 | 0.0 | 71.2833 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 3.0 | 1.0 | 3.0 | 26.0 | 0.0 | 0.0 | 7.9250 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 4.0 | 1.0 | 1.0 | 35.0 | 1.0 | 0.0 | 53.1000 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 5.0 | 0.0 | 3.0 | 35.0 | 0.0 | 0.0 | 8.0500 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 | 6.0 | 0.0 | 3.0 | 28.0 | 0.0 | 0.0 | 8.4583 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
6 | 7.0 | 0.0 | 1.0 | 54.0 | 0.0 | 0.0 | 51.8625 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 8.0 | 0.0 | 3.0 | 2.0 | 3.0 | 1.0 | 21.0750 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
8 | 9.0 | 1.0 | 3.0 | 27.0 | 0.0 | 2.0 | 11.1333 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
9 | 10.0 | 1.0 | 2.0 | 14.0 | 1.0 | 0.0 | 30.0708 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
10 rows × 26 columns
We can also show a summary of the data with the .describe()
method.
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | AgeImputed | EmbarkedImputed | CabinLetterImputed | ... | Embarked_missing | CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | ... | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.361582 | 0.523008 | 0.381594 | 32.204208 | 0.198653 | 0.002245 | 0.771044 | ... | 0.002245 | 0.016835 | 0.052750 | 0.066218 | 0.037037 | 0.035915 | 0.014590 | 0.004489 | 0.001122 | 0.771044 |
std | 257.353842 | 0.486592 | 0.836071 | 13.019697 | 1.102743 | 0.806057 | 49.693429 | 0.399210 | 0.047351 | 0.420397 | ... | 0.047351 | 0.128725 | 0.223659 | 0.248802 | 0.188959 | 0.186182 | 0.119973 | 0.066890 | 0.033501 | 0.420397 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.910400 | 0.000000 | 0.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 0.000000 | 0.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
75% | 668.500000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 31.000000 | 0.000000 | 0.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 26 columns
The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.
# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data
data.drop('PassengerId', inplace=True, axis=1)
Looking at a summary of passengers who survived or did not survive#
Before running machine learning models, it is always good to have a look at your data. Here we will separate passengers who survived from those who died, and we will have a look at differences in features.
We will use a mask to select and filter passengers.
mask = data['Survived'] == 1 # Mask for passengers who survive
survived = data[mask] # filter using mask
mask = data['Survived'] == 0 # Mask for passengers who died
died = data[mask] # filter using mask
Now let’s look at average (mean) values for survived
and died
.
survived.mean()
Survived 1.000000
Pclass 1.950292
Age 28.291433
SibSp 0.473684
Parch 0.464912
Fare 48.395408
AgeImputed 0.152047
EmbarkedImputed 0.005848
CabinLetterImputed 0.602339
CabinNumber 18.961988
CabinNumberImputed 0.611111
male 0.318713
Embarked_C 0.271930
Embarked_Q 0.087719
Embarked_S 0.634503
Embarked_missing 0.005848
CabinLetter_A 0.020468
CabinLetter_B 0.102339
CabinLetter_C 0.102339
CabinLetter_D 0.073099
CabinLetter_E 0.070175
CabinLetter_F 0.023392
CabinLetter_G 0.005848
CabinLetter_T 0.000000
CabinLetter_missing 0.602339
dtype: float64
died.mean()
Survived 0.000000
Pclass 2.531876
Age 30.028233
SibSp 0.553734
Parch 0.329690
Fare 22.117887
AgeImputed 0.227687
EmbarkedImputed 0.000000
CabinLetterImputed 0.876138
CabinNumber 6.074681
CabinNumberImputed 0.885246
male 0.852459
Embarked_C 0.136612
Embarked_Q 0.085610
Embarked_S 0.777778
Embarked_missing 0.000000
CabinLetter_A 0.014572
CabinLetter_B 0.021858
CabinLetter_C 0.043716
CabinLetter_D 0.014572
CabinLetter_E 0.014572
CabinLetter_F 0.009107
CabinLetter_G 0.003643
CabinLetter_T 0.001821
CabinLetter_missing 0.876138
dtype: float64
We can make looking at them side by side more easy by putting these values in a new DataFrame.
summary = pd.DataFrame() # New empty DataFrame
summary['survived'] = survived.mean()
summary['died'] = died.mean()
Now let’s look at them side by side. See if you can spot what features you think might have influenced survival.
summary
survived | died | |
---|---|---|
Survived | 1.000000 | 0.000000 |
Pclass | 1.950292 | 2.531876 |
Age | 28.291433 | 30.028233 |
SibSp | 0.473684 | 0.553734 |
Parch | 0.464912 | 0.329690 |
Fare | 48.395408 | 22.117887 |
AgeImputed | 0.152047 | 0.227687 |
EmbarkedImputed | 0.005848 | 0.000000 |
CabinLetterImputed | 0.602339 | 0.876138 |
CabinNumber | 18.961988 | 6.074681 |
CabinNumberImputed | 0.611111 | 0.885246 |
male | 0.318713 | 0.852459 |
Embarked_C | 0.271930 | 0.136612 |
Embarked_Q | 0.087719 | 0.085610 |
Embarked_S | 0.634503 | 0.777778 |
Embarked_missing | 0.005848 | 0.000000 |
CabinLetter_A | 0.020468 | 0.014572 |
CabinLetter_B | 0.102339 | 0.021858 |
CabinLetter_C | 0.102339 | 0.043716 |
CabinLetter_D | 0.073099 | 0.014572 |
CabinLetter_E | 0.070175 | 0.014572 |
CabinLetter_F | 0.023392 | 0.009107 |
CabinLetter_G | 0.005848 | 0.003643 |
CabinLetter_T | 0.000000 | 0.001821 |
CabinLetter_missing | 0.602339 | 0.876138 |
Divide into X (features) and y (labels)#
We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict).
By convention our features are called X
(usually upper case to denote multiple features), and the label (survived or not) y
.
X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'
Divide into training and tets sets#
When we test a machine learning model we should always test it on data that has not been used to train the model.
We will use sklearn’s train_test_split
method to randomly split the data: 75% for training, and 25% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train.std(), X_train.mean()
(Pclass 0.840603
Age 12.965456
SibSp 1.045440
Parch 0.829373
Fare 53.168865
AgeImputed 0.390305
EmbarkedImputed 0.054677
CabinLetterImputed 0.424335
CabinNumber 28.664011
CabinNumberImputed 0.419561
male 0.481018
Embarked_C 0.395037
Embarked_Q 0.279581
Embarked_S 0.450037
Embarked_missing 0.054677
CabinLetter_A 0.115375
CabinLetter_B 0.228910
CabinLetter_C 0.265752
CabinLetter_D 0.174627
CabinLetter_E 0.197088
CabinLetter_F 0.121524
CabinLetter_G 0.038691
CabinLetter_T 0.038691
CabinLetter_missing 0.424335
dtype: float64, Pclass 2.296407
Age 29.466826
SibSp 0.497006
Parch 0.389222
Fare 33.320958
AgeImputed 0.187126
EmbarkedImputed 0.002994
CabinLetterImputed 0.764970
CabinNumber 12.332335
CabinNumberImputed 0.772455
male 0.637725
Embarked_C 0.193114
Embarked_Q 0.085329
Embarked_S 0.718563
Embarked_missing 0.002994
CabinLetter_A 0.013473
CabinLetter_B 0.055389
CabinLetter_C 0.076347
CabinLetter_D 0.031437
CabinLetter_E 0.040419
CabinLetter_F 0.014970
CabinLetter_G 0.001497
CabinLetter_T 0.001497
CabinLetter_missing 0.764970
dtype: float64)
Standardise data#
We want all of out features to be on roughly the same scale. This generally leads to a better model, and also allows us to more easily compare the importance of different features.
One simple method is to scale all features 0-1 (by subtracting the minimum value for each value, and dividing by the new remaining maximum value).
But a more common method used in many machine learning methods is standardisation, where we use the mean and standard deviation of the training set of data to normalise the data. We subtract the mean of the training set values, and divide by the standard deviation of the training data. Note that the mean and standard deviation of the training data are used to standardise the test set data as well.
Here we will use sklearn’s StandardScaler method
. This method also copes with problems we might otherwise have (such as if one feature has zero standard deviation in the training set).
X_train.astype(float)
Pclass | Age | SibSp | Parch | Fare | AgeImputed | EmbarkedImputed | CabinLetterImputed | CabinNumber | CabinNumberImputed | ... | Embarked_missing | CabinLetter_A | CabinLetter_B | CabinLetter_C | CabinLetter_D | CabinLetter_E | CabinLetter_F | CabinLetter_G | CabinLetter_T | CabinLetter_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
335 | 3.0 | 28.0 | 0.0 | 0.0 | 7.8958 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
827 | 2.0 | 1.0 | 0.0 | 2.0 | 37.0042 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
270 | 1.0 | 28.0 | 0.0 | 0.0 | 31.0000 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
826 | 3.0 | 28.0 | 0.0 | 0.0 | 56.4958 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
544 | 1.0 | 50.0 | 1.0 | 0.0 | 106.4250 | 0.0 | 0.0 | 0.0 | 86.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
166 | 1.0 | 28.0 | 0.0 | 1.0 | 55.0000 | 1.0 | 0.0 | 0.0 | 33.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
813 | 3.0 | 6.0 | 4.0 | 2.0 | 31.2750 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
679 | 1.0 | 36.0 | 0.0 | 1.0 | 512.3292 | 0.0 | 0.0 | 0.0 | 51.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
687 | 3.0 | 19.0 | 0.0 | 0.0 | 10.1708 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
558 | 1.0 | 39.0 | 1.0 | 1.0 | 79.6500 | 0.0 | 0.0 | 0.0 | 67.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
668 rows × 24 columns
def standardise_data(X_train, X_test):
# Initialise a new scaling object for normalising input data
sc = StandardScaler()
# Set up the scaler just on the training set
sc.fit(X_train)
# Apply the scaler to the training and test sets
train_std=sc.transform(X_train)
test_std=sc.transform(X_test)
return train_std, test_std
X_train_std, X_test_std = standardise_data(X_train, X_test)
Fit logistic regression model#
Now we will fir a logistic regression model, using sklearn’s LogisticRegression
method. Our machine learning model fitting is only two lines of code!
By using the name model
for our logistic regression model we will make our model more interchangeable later on.
model = LogisticRegression()
model.fit(X_train_std,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Predict values#
Now we can use the trained model to predict survival. We will test the accuracy of both the training and test data sets.
# Predict training and test set labels
y_pred_train = model.predict(X_train_std)
y_pred_test = model.predict(X_test_std)
Calculate accuracy#
In this example we will measure accuracy simply as the proportion of passengers where we make the correct prediction. In a later notebook we will look at other measures of accuracy which explore false positives and false negatives in more detail.
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.8278443113772455
Accuracy of predicting test data = 0.7623318385650224
Not bad - about 80% accuracy. You will probably see that accuracy of predicting the training set is usually higher than the test set. Because we are only testing one random sample, you may occasionally see otherwise. In later note books we will look at the best way to repeat multiple tests, and look at what to do if the accuracy of the training set is significantly higher than the test set (a problem called ‘over-fitting’).
Examining the model coefficients (weights)#
Not all features are equally important. And some may be of little or no use at all, unnecessarily increasing the complexity of the model. In a later notebook we will look at selecting features which add value to the model (or removing features that don’t).
Here we will look at the importance of features – how they affect our estimation of survival. These are known as the model coefficients (if you come from a traditional statistics background), or model weights (if you come from a machine learning background).
Because we have standardised our input data the magnitude of the weights may be compared as an indicator of their influence in the model. Weights with higher negative numbers mean that that feature correlates with reduced chance of survival. Weights with higher positive numbers mean that that feature correlates with increased chance of survival. Those weights with values closer to zero (either positive or negative) have less influence in the model.
We access the model weights my examining the model coef_
attribute. The model may predict more than one outcome label, in which case we have weights for each label. Because we are predicting a signle label (survive or not), the weights are found in the first element ([0]
) of the coef_
attribute.
co_eff = model.coef_[0]
co_eff
array([-0.73418553, -0.40986181, -0.35544658, -0.19006937, 0.14628888,
-0.11822452, 0.09578928, 0.09296143, 0.06552385, -0.55844984,
-1.40560834, 0.11123844, 0.03435464, -0.13062397, 0.09578928,
-0.05538374, -0.11890361, -0.23386705, 0.09861583, 0.13506331,
0.09892766, 0.15361326, -0.14199379, 0.09296143])
So we have an array of model weights.
Not very readable for us mere humans is it?!
We will transfer the weights array to a Pandas DataFrame. The array order is in the same order of the list of features of X, so we will put that those into the DataFrame as well. And we will sort by influence in the model. Because both large negative and positive values are more influential in the model we will take the absolute value of the weight (that is remove any negative sign), and then sort by that absolute value. That will give us a more readable table of most influential features in the model.
co_eff_df = pd.DataFrame() # create empty DataFrame
co_eff_df['feature'] = list(X) # Get feature names from X
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)
co_eff_df
feature | co_eff | abs_co_eff | |
---|---|---|---|
10 | male | -1.405608 | 1.405608 |
0 | Pclass | -0.734186 | 0.734186 |
9 | CabinNumberImputed | -0.558450 | 0.558450 |
1 | Age | -0.409862 | 0.409862 |
2 | SibSp | -0.355447 | 0.355447 |
17 | CabinLetter_C | -0.233867 | 0.233867 |
3 | Parch | -0.190069 | 0.190069 |
21 | CabinLetter_G | 0.153613 | 0.153613 |
4 | Fare | 0.146289 | 0.146289 |
22 | CabinLetter_T | -0.141994 | 0.141994 |
19 | CabinLetter_E | 0.135063 | 0.135063 |
13 | Embarked_S | -0.130624 | 0.130624 |
16 | CabinLetter_B | -0.118904 | 0.118904 |
5 | AgeImputed | -0.118225 | 0.118225 |
11 | Embarked_C | 0.111238 | 0.111238 |
20 | CabinLetter_F | 0.098928 | 0.098928 |
18 | CabinLetter_D | 0.098616 | 0.098616 |
6 | EmbarkedImputed | 0.095789 | 0.095789 |
14 | Embarked_missing | 0.095789 | 0.095789 |
23 | CabinLetter_missing | 0.092961 | 0.092961 |
7 | CabinLetterImputed | 0.092961 | 0.092961 |
8 | CabinNumber | 0.065524 | 0.065524 |
15 | CabinLetter_A | -0.055384 | 0.055384 |
12 | Embarked_Q | 0.034355 | 0.034355 |
So are three most influential features are:
male (being male reduces probability of survival)
Pclass (lower class passengers, who have a higher class number, reduces probability of survival)
age (being older reduces probability of survival)
Show predicted probabilities#
The predicted probabilities are for the two alternative classes 0 (does not survive) or 1 (survive).
Ordinarily we do not see these probabilities - the predict
method used above applies a cut-off of 0.5 to classify passengers into survived or not, but we can see the individual probabilities for each passenger.
Later we will use these to adjust sensitivity of our model to detecting survivors or non-survivors.
Each passenger has two values. These are the probability of not surviving (first value) or surviving (second value). Because we only have two possible classes we only need to look at one. Multiple values are important when there are more than one class being predicted.
# Show first ten predicted classes
classes = model.predict(X_test_std)
classes[0:10]
array([1., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
# Show first ten predicted probabilities
# (note how the values relate to the classes predicted above)
probabilities = model.predict_proba(X_test_std)
probabilities[0:10]
array([[0.34607729, 0.65392271],
[0.93813559, 0.06186441],
[0.46852979, 0.53147021],
[0.84549935, 0.15450065],
[0.95341715, 0.04658285],
[0.88266921, 0.11733079],
[0.86399098, 0.13600902],
[0.90629119, 0.09370881],
[0.99670517, 0.00329483],
[0.81884463, 0.18115537]])