Building a Cox Proportional Hazard Model

Introduction  

Proportional Hazard Models belong to the class of survival models relating time that passes to the occurrence of a particular event. In this post, I’ll guide you on how to make a Cox Proportional hazard regression model using TensorFlow which is mainly used for quantitative variables.

Required Packages

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from lifelines import CoxPHFitter
from lifelines.utils import concordance_index as cindex
from sklearn.model_selection import train_test_split

The Dataset

For the purpose of illustrating I’ve used the “Mayo Clinic Primary Biliary Cirrhosis Data” dataset from Kaggle. You can find this dataset here

A Bit of pre-processing

The Cox Proportional models can also be used with categorical values like those present in this dataset. In this dataset, the female patients were represented by “f” and the male patients by “m”. With a bit of pre-processing, I changed “f” to 0 and “m” to 1 so that they can be used in the model for regression analysis.

for i in df.index:
    df.at[i, 'sex'] = 0 if df.loc[i, 'sex'] == "f" else 1
np.random.seed(0)
df_dev, df_test = train_test_split(df, test_size = 0.2)
df_train, df_val = train_test_split(df_dev, test_size = 0.25)

Splitting the dataset into training and testing

np.random.seed(0)
df_dev, df_test = train_test_split(df, test_size = 0.2)
df_train, df_val = train_test_split(df_dev, test_size = 0.25)

Normalizing the Data

If you view the dataset, you’ll observe that the data isn’t normalized at all. To avoid overfitting, we shall now normalize the data

continuos_columns = ['age','bili','chol','albumin','copper','alk.phos', 'ast', 'trig', 'platelet', 'protime']
mean = df_train.loc[:, continuous_columns].mean()
std = df_train[:, continuous_columns].std()
df_train.loc[:, continuous_columns] = (df_train.loc[:, continuous_columns] - mean) / std
df_val.loc[:, continuous_columns] = (df_val.loc[:, continuous_columns] - mean) / std
df_test.loc[:, continuous_columns] = (df_test.loc[:, continuous_columns] - mean) / std

One-Hot Encoding the Values

def one_hot_encode(dataframe, columns):
     return pd.get_dummies(dataframe, columns = columns, drop_first=True, dtype = np.float)
to_encode = ["edema", "stage"]
one_hot_train = one_hot_encode(df_train, to_encode)
one_hot_test = one_hot_encode(df_test, to_encode)
one_hot_val = one_hot_encode(df_val, to_encode)

Removing the NaN Values

one_hot_train.dropna(inplace=True)

Fitting the Model

cph = CoxPHFitter()
cph.fit(one_hot_train, duration_col = 'time', event_col = 'status', step_size = 0.1)

Analysing the Results

cph.print_summary()
cph.plot_covariate_groups('trt', values = [0,1])

Leave a Reply

Your email address will not be published. Required fields are marked *