A Random Forest model#

Our previous work has all been based on logistic regression, which is the most common ‘standard model’ against which all other models are compared.

In this notebook we swap out the logistic regression model for a Random Forest model. Random Forests are often chosen for classification based on structured data (i.e. when we have specific features of data, rather than unstructured data like a picture or a sound file).

Random Forests are based on constructing multiple decision trees, each of which sees only part of the data for each case, and only has limited ‘branches’. Random Forests tend to be less prone to over-fitting than decision trees. For more on the basis of Random Forests see:

https://en.wikipedia.org/wiki/Random_forest

Note in this example how similar the code is to our previous logistic regression model. A couple of notable changes are:

  • Data for Random Forest models do not need standardisation; we use the raw data.

  • Rather than having coefficients, we output model ‘importances’ which reflect how influential a feature is in deciding classification. This is accessed through examining model.feature_importances_.

Here we will again use stratified K-fold validation to test the model performance. We will use default settings for the Random Forest model.

Load modules#

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

Download data#

Run the following code if data for Titanic survival has not been previously downloaded.

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data#

# Load data & drop passenger ID
data = pd.read_csv('data/processed_data.csv')

# Make all data 'float' type
data = data.astype(float)

data.drop('PassengerId', inplace=True, axis=1)

# Split data into two DataFrames
X_df = data.drop('Survived',axis=1)
y_df = data['Survived']

# Convert DataFrames to NumPy arrays
X = X_df.values
y = y_df.values

Define function to calculate accuracy#

import numpy as np

def calculate_accuracy(observed, predicted):
    
    """
    Calculates a range of accuracy scores from observed and predicted classes.
    
    Takes two list or NumPy arrays (observed class values, and predicted class 
    values), and returns a dictionary of results.
    
     1) observed positive rate: proportion of observed cases that are +ve
     2) Predicted positive rate: proportion of predicted cases that are +ve
     3) observed negative rate: proportion of observed cases that are -ve
     4) Predicted negative rate: proportion of predicted cases that are -ve  
     5) accuracy: proportion of predicted results that are correct    
     6) precision: proportion of predicted +ve that are correct
     7) recall: proportion of true +ve correctly identified
     8) f1: harmonic mean of precision and recall
     9) sensitivity: Same as recall
    10) specificity: Proportion of true -ve identified:        
    11) positive likelihood: increased probability of true +ve if test +ve
    12) negative likelihood: reduced probability of true +ve if test -ve
    13) false positive rate: proportion of false +ves in true -ve patients
    14) false negative rate: proportion of false -ves in true +ve patients
    15) true positive rate: Same as recall
    16) true negative rate: Same as specificity
    17) positive predictive value: chance of true +ve if test +ve
    18) negative predictive value: chance of true -ve if test -ve
    
    """
    
    # Converts list to NumPy arrays
    if type(observed) == list:
        observed = np.array(observed)
    if type(predicted) == list:
        predicted = np.array(predicted)
    
    # Calculate accuracy scores
    observed_positives = observed == 1
    observed_negatives = observed == 0
    predicted_positives = predicted == 1
    predicted_negatives = predicted == 0
    
    true_positives = (predicted_positives == 1) & (observed_positives == 1)
    
    false_positives = (predicted_positives == 1) & (observed_positives == 0)
    
    true_negatives = (predicted_negatives == 1) & (observed_negatives == 1)
    
    false_negatives = (predicted_negatives == 1) & (observed_negatives == 0)
    
    accuracy = np.mean(predicted == observed)
    
    precision = (np.sum(true_positives) /
                 (np.sum(true_positives) + np.sum(false_positives)))
        
    recall = np.sum(true_positives) / np.sum(observed_positives)
    
    sensitivity = recall
    
    f1 = 2 * ((precision * recall) / (precision + recall))
    
    specificity = np.sum(true_negatives) / np.sum(observed_negatives)
    
    positive_likelihood = sensitivity / (1 - specificity)
    
    negative_likelihood = (1 - sensitivity) / specificity
    
    false_positive_rate = 1 - specificity
    
    false_negative_rate = 1 - sensitivity
    
    true_positive_rate = sensitivity
    
    true_negative_rate = specificity
    
    positive_predictive_value = (np.sum(true_positives) / 
                            (np.sum(true_positives) + np.sum(false_positives)))
    
    negative_predictive_value = (np.sum(true_negatives) / 
                            (np.sum(true_negatives) + np.sum(false_negatives)))
    
    # Create dictionary for results, and add results
    results = dict()
    
    results['observed_positive_rate'] = np.mean(observed_positives)
    results['observed_negative_rate'] = np.mean(observed_negatives)
    results['predicted_positive_rate'] = np.mean(predicted_positives)
    results['predicted_negative_rate'] = np.mean(predicted_negatives)
    results['accuracy'] = accuracy
    results['precision'] = precision
    results['recall'] = recall
    results['f1'] = f1
    results['sensitivity'] = sensitivity
    results['specificity'] = specificity
    results['positive_likelihood'] = positive_likelihood
    results['negative_likelihood'] = negative_likelihood
    results['false_positive_rate'] = false_positive_rate
    results['false_negative_rate'] = false_negative_rate
    results['true_positive_rate'] = true_positive_rate
    results['true_negative_rate'] = true_negative_rate
    results['positive_predictive_value'] = positive_predictive_value
    results['negative_predictive_value'] = negative_predictive_value
    
    return results

Run the model with k-fold validation#

# Set up lists to hold results for each k-fold run
replicate_accuracy = []
replicate_precision = []
replicate_recall = []
replicate_f1 = []
replicate_predicted_positive_rate = []
replicate_observed_positive_rate = []

# Set up DataFrame for feature importances
importances = pd.DataFrame(index = list(X_df))

# Convert DataFrames to NumPy arrays
X = X_df.values
y = y_df.values

# Set up splits
number_of_splits = 10
skf = StratifiedKFold(n_splits = number_of_splits)
skf.get_n_splits(X, y)

# Loop through the k-fold splits
k_fold_count = 0

for train_index, test_index in skf.split(X, y):
    
    # Get X and Y train/test
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Set up and fit model (n_jobs=-1 uses all cores on a computer)
    model = RandomForestClassifier(n_jobs=-1)
    model.fit(X_train,y_train)
    
    # Predict test set labels and get accuracy scores
    y_pred_test = model.predict(X_test)
    accuracy_scores = calculate_accuracy(y_test, y_pred_test)
    replicate_accuracy.append(accuracy_scores['accuracy'])
    replicate_precision.append(accuracy_scores['precision'])
    replicate_recall.append(accuracy_scores['recall'])
    replicate_f1.append(accuracy_scores['f1'])
    replicate_predicted_positive_rate.append(
        accuracy_scores['predicted_positive_rate'])
    replicate_observed_positive_rate.append(
        accuracy_scores['observed_positive_rate'])
    
    # Record feature importances
    col_title = 'split_' + str(k_fold_count)
    importances[col_title] = model.feature_importances_
    k_fold_count +=1
    
# Transfer results to list and add to data frame
results = pd.Series()
results['accuracy'] = np.mean(replicate_accuracy)
results['precision'] = np.mean(replicate_precision)
results['recall'] = np.mean(replicate_recall)
results['f1'] = np.mean(replicate_f1)
results['predicted positive rate'] = np.mean(replicate_predicted_positive_rate)
results['observed positive rate'] = np.mean(replicate_observed_positive_rate)

# Get average of feature importances, and sort
importance_mean = importances.mean(axis=1)
importance_mean.sort_values(ascending=False, inplace=True)
/tmp/ipykernel_8602/2158006155.py:52: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  results = pd.Series()

Show results and feature importance#

results
accuracy                   0.804782
precision                  0.760550
recall                     0.722269
f1                         0.737268
predicted positive rate    0.365868
observed positive rate     0.383833
dtype: float64
importance_mean
male                   0.234202
Fare                   0.216091
Age                    0.215611
Pclass                 0.060542
CabinNumber            0.056673
SibSp                  0.046100
Parch                  0.039414
CabinNumberImputed     0.019187
CabinLetter_missing    0.016389
AgeImputed             0.016190
Embarked_S             0.015698
CabinLetterImputed     0.014914
Embarked_C             0.012849
Embarked_Q             0.008345
CabinLetter_B          0.005826
CabinLetter_E          0.005712
CabinLetter_C          0.005278
CabinLetter_D          0.004320
CabinLetter_A          0.003729
CabinLetter_F          0.001549
CabinLetter_G          0.000845
CabinLetter_T          0.000247
EmbarkedImputed        0.000177
Embarked_missing       0.000113
dtype: float64

Observations#

  • Without any optimisation we observe accuracy similar to, or a little higher than, logistic regression.

  • Unlike logistic regression we see a good balance between precision and recall, rather than a bias towards the majority class.

  • Performance of the model may be tested and optimised as we previously did for logistic regression (e.g. construct ROC curves, perform feature selection, consider over-sampling or under-sampling as required, examine learning curves).

For further notes on the sklearn Random Forest model (and which parameters may be fine-tuned, e.g. with random search or grid search as we did with the logistic regression model) please see the help pages for the Random Forest model, or refer to:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html