Getting Started

About Me

  • Third year Ph.D. Student in Statistics, UConn.
  • Research Interests:
    • Bayesian biostatistics
    • Modelling informative dropout process
    • Machine learning and deep learning
    • Microbiome data analysis
  • Future goals:
    • Contribute to the broader field of biostatistics and statistical learning.

Assumption about the audience

  • Have good understanding of time-to-event data.
  • Familiar with the well-known survival analysis methods.

Aims of the lecture

  • By the end of this lecture, the participants should have a basic understanding of
    • the use of machine learning and deep learning techniques
    • possible impacts of these techniques into the field of survival analysis

Contents

  • Basics of machine learning and deep learning
  • Illustration of machine learning and deep learning usinng R
  • Applications of machine learning in survival analysis
  • Recent developments
  • Future directions

Machine Learning (ML)

  • Humans learn from past experiences, whereas, machines follow instructions given by humans.
  • What if human can train the machines?

Traditional vs ML algorithm

Basic paradigm of ML algorithm

An example of ML: ‘Get the cake’

An example of ML: ‘Get the cake’ cont…

An example of ML: ‘Get the cake’ cont…

Popularly used ML techniques

  • Classification trees:
    • bagging (Breiman and others 1998; Breiman 1996),
    • random forest (RF) (Breiman 2001).
  • Support vector machine
  • Neural network: shallow or deep neural network (DNN)
  • Others

An example of Bagging and RF using R

  • R Packages required: randomForestSRC, ipred, MASS, Survival
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
names(breast)[1:10] # Displaying only ten variable names
##  [1] "status"             "mean_radius"        "mean_texture"      
##  [4] "mean_perimeter"     "mean_area"          "mean_smoothness"   
##  [7] "mean_compactness"   "mean_concavity"     "mean_concavepoints"
## [10] "mean_symmetry"
  • The breast dataset is from randomForestSRC R package.
  • For more details about the data: visit this link

An example of Bagging and RF using R cont…

  • The goal is to classify the status using decission tree
mod1 <- rfsrc(status ~ ., data = breast, nsplit = 10)
mod2 <- bagging(status ~ ., data = breast, coob=TRUE)
res <- as.data.frame(c(mean(mod1$err.rate[, 1], na.rm = TRUE), 
            mod2$err))
colnames(res) <- "Error Rate"
rownames(res) <- c("RSF", "Bagging")
  • The misclassification error for the two approaches
##         Error Rate
## RSF      0.2371134
## Bagging  0.2731959

Notes on Bagging and RF

  • Both are based on resampling techniques.
  • Both are widely used for classification.

Basics of Deep Learning

  • Deep Learning is very similar to human neural system
    • It is a special kind of Machine Learning (Goodfellow, Bengio, and Courville 2016).
  • Especially, useful to generalize complicated functions in high dimensional space.
  • Synonymous terms:
    • deep neural network (DNN)
    • deep feed-forward networks
    • feed-forward neural networks
    • multi-layer perceptrons (MLPs).

Understanding neural system

DNN example 1

DNN example 1 cont..

DNN example 1 DNN cont..

DNN example 2

Visualization of one-layer DNN

Visualization of two-layer DNN

Visualization of multi-layer DNN

An example of DNN

  • R Packages required: neuralnet, MASS.
  • Dataset: Boston from MASS package in R.
set.seed(500)
data(Boston, package = "MASS")
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
index <- sample(1:nrow(Boston), round(0.75*nrow(Boston)))
train <- Boston[index, ]
test <- Boston[-index, ]
lm.fit <- glm(medv~., data=train)
pr.lm <- predict(lm.fit, test)
MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)

An example of DNN cont…

  • For more detail of the data: ‘?Boston’
  • Performing newral network model on the Boston data
maxs <- apply(Boston, 2, max) 
mins <- apply(Boston, 2, min)
scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
train_ <- scaled[index, ]
test_ <- scaled[-index, ]
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f, data=train_, hidden=c(5,3), linear.output=T)

An example of DNN cont…

An example of DNN cont…

pr.nn <- compute(nn,test_[,1:13])
pr.nn_ <- pr.nn$net.result*(max(Boston$medv)
          -min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)
              -min(Boston$medv))+min(Boston$medv)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
print(paste(MSE.lm, MSE.nn))
## [1] "31.2630222372615 16.4595537665717"

An example of DNN cont…

Applications of ML

  • \(\color{blue}{\text{Data mining:}}\) web click data, medical records, Google search.
  • \(\color{blue}{\text{Signal recognition:}}\) autonomous helicopter, handwriting recognition, voice recognition, Machine translation, Anomaly detection.
  • \(\color{blue}{\text{Self customizing program:}}\) Amazon, Netflix, Bixby (Samsung), Siri (iphone).
  • \(\color{blue}{\text{On survival analysis:}}\)
    • Predicting patients’ survival
    • Classifying competing risks for an event
    • Personalized treatment recommender system

Survival data analysis

Survival data

  • Notations: \[ \begin{align*} i &: \text{index for subject } (i=1, \ldots, n)\\ T_i^* &: \text{time for event for subject } i \\ C_i &: \text{censoring time for subject } i \\ T_i &= \min (T_i^*, C_i), \text{observed time for subject } i \\ \delta_i &= I(T_i^* \le C_i), \text{censoring indicator for subject } i\\ \mathbf{x}_i &: \text{vector of covariates for subject } i \end{align*} \]
  • Observed survival outcome: \(\{ (T_i, \delta_i), i=1, \ldots, n \}\).

An illustration of time-to-event data

Some interesting questions

  • Analyzing patient 1, 4, 8, 9 might give us insight on which features contribute to increase the length of survival?
  • Patients who survived to 12 months, what is probability that the event will occur after \(t\) time?
  • A doctor might want to see what is the chance of re-hospitalizing after the patient is discharged?
  • Can we predict the sub-types of the event (considering the sub-types are missing) based on learning the training data?

Some real-life examples of survival data

Applications in healthcare

Applications in healthcare cont…

  • Event of interest: Rehospitalization; Disease recurrence; Cancer survival.
  • Outcome: Likelihood of hospitalization within \(t\) days of discharge.
  • Figure source: Wang, Li, and Reddy (2019)

Applications in education

Applications in education cont…

  • Event of interest: Student dropout.
  • Outcome: Likelihood of a student being dropout within \(t\) days.
  • Figure source: Wang, Li, and Reddy (2019)

Applications in Crowdfunding

Applications in Crowdfunding cont…

  • Event of interest: Project success
  • Outcome: Likelihood of a project being successful within \(t\) days.
  • Figure source: Wang, Li, and Reddy (2019)

Traditional approaches to analyse survival data

  • Non-parametric: Kaplan-Meier, Nelson-Aalen, Life table
  • Semi-parametric: Cox proportional hazards (PH) model
  • Parametric: Accelerated failure time model

Cox PH model

  • The Cox PH model (Cox 1972) is \[ \begin{equation} h_i(t | \mathbf{x}_i) = h_0(t) \exp{(\mathbf{x}'_{i} \boldsymbol{\beta})} \end{equation} \] where, \(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})\) denotes the vector of covariates and \(\boldsymbol{\beta}\) are the corresponding regression coefficients.
  • The Cox PH partial likelihood function is given by \[ \begin{equation*} pl(\boldsymbol{\beta}) = \prod_{i=1}^{D} \Big[ \dfrac{\exp(\mathbf{x}_i^T \boldsymbol{\beta})}{\sum_{j \in \mathcal{R}(T_i)} \exp(\mathbf{x}_j^T \boldsymbol{\beta})} \Big] \end{equation*} \] where, \(t_1, t_2, \ldots, t_D\) denote the ordered event times and \(\mathcal{R}(t_i)\) denotes the risk set at time \(t_i\).

Cox PH model cont…

  • The corresponding the log partial likelihood function is \[ \begin{equation} \ell(\boldsymbol{\beta}) = \sum_{i=1}^{D} \Big( \mathbf{x}_i^T \boldsymbol{\beta} - \log \sum_{j \in \mathcal{R}(t_i)} \exp(\mathbf{x}_j^T \boldsymbol{\beta}) \Big) \end{equation} \]

Estimation

  • Maximizing the log partial likelihood function, we get the estimates of the model parameters.
  • Evaluating the second derivative with respect to the model parameters, we calculate the information matrix.
  • Standard error of the parameters are estimated by the inverse of the information matrix.

Why ML algorithms?

  • Increasing data size
  • Increasing model size
  • Increasing
    • accuracy,
    • complexity, and
    • real-world impact
  • The need for supervised learning

ML for Survival Analysis

  • Survival Tree: is similar to decision tree which is built by recursive splitting of tree nodes.
    • Bagging Survival Trees
    • Random Survival Forest (RF)
  • Let us demonstrate one example in the following slides.

An example of RSF and Cox PH using R

  • R Packages required: randomForestSRC, ipred, MASS, Survival
data(veteran, package = "randomForestSRC")
names(veteran) # Displaying variable names
## [1] "trt"      "celltype" "time"     "status"   "karno"    "diagtime"
## [7] "age"      "prior"
  • The veteran dataset is from randomForestSRC R package.
  • For more details about the data: visit this link

An example of of RSF and Cox PH using R cont…

mod3 <- coxph(Surv(time, status)~., data=veteran,x=TRUE,y=TRUE)
mod4 <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)
cindex <- as.data.frame(c(concordance(mod3)$concordance, 
            1-mean(mod4$err.rate, na.rm=T)))
colnames(cindex) <- "C-index"
rownames(cindex) <- c("Cox PH", "RSF")
  • The C-index for the two models
##          C-index
## Cox PH 0.7053612
## RSF    0.7094312

Notes on Cox PH and RSF

  • The RSF:
    • is advantageous for classification.
    • For example, a new patient with ‘status’ unknown,
    • However, RSF cannot predict time-to-event and perform regression analysis.
  • The Cox PH model:
    • is preferred to predict time-to-event and perform regression analysis.
    • the proportionality assumption and linearity of the log-risk function might not be appropriate in complex data structure.

ML for Survival Analysis cont..

  • DNN: use deep hidden layers to extract the output based on the features.
    • Bayesian DNN (Polson, Sokolov, and others 2017; Ranganath et al. 2016)
  • \(\color{red}{\text{Note:}}\)
    • could be useful for both classification and regression
    • robust to the violation of the proportionality assumption

Previous works

  • Feed-forward non-linear proportional hazards model (Faraggi and Simon 1995).
    • Extended the linearity assumption of log-risk with the linear combination of the covariates to non-linear relationship.
    • In particular, used logit function with some hyper-parameters.
  • Bayesian version of Feed-forward non-linear proportional hazards model (Faraggi et al. 1997).
    • This model considered normal prior for the parameters and derived the posterior distribution of the parameters.

Cox Non-proportional Neural Network Model

  • The \(\exp(\mathbf{x}'_{i} \boldsymbol{\beta})\) function in Cox PH model is replaced by a more general function \(\color{red}{g_{\boldsymbol{\theta}}(\mathbf{x})}\) to accommodate non-linear relationship \[ \begin{equation} h_i(t | \mathbf{x}_i) = h_0(t) \exp{(\color{red}{g_{\boldsymbol{\theta}}(\mathbf{x})})} \end{equation} \]
  • The likelihood function \[ \begin{equation*} pl(\boldsymbol{\theta}) = \prod_{i=1}^{D} \Big[ \dfrac{\exp(\color{red}{g_{\boldsymbol{\theta}}(\mathbf{x}_i)})}{\sum_{j \in \mathcal{R}(T_i)} \exp(\color{red}{g_{\boldsymbol{\theta}}(\mathbf{x}_i)})} \Big] \end{equation*} \]

Cox Non-proportional Neural Network Model cont..

  • The following loss function is optimized \[ \begin{equation} -\dfrac{1}{N_{\delta=1}} \sum_{i=1}^{D} \Big( g_{\boldsymbol{\theta}}(\mathbf{x}_i) - \log \sum_{j \in \mathcal{R}(t_i)} \exp(g_{\boldsymbol{\theta}}(\mathbf{x}_i)) \Big) + \lambda ||\boldsymbol{\theta}||^2_2 \label{loss} \end{equation} \] where, \(N_{\delta=1}\) is the number of patients with an observable event and \(\lambda\) is the \(\ell_2\) regularization parameter.
  • Gradient descent optimization is used to minimize the loss..

Details on layer mechanism

Details on layer mechanism cont…

Computing Gradient: Backpropagation

Computing Gradient: Backpropagation cont..

  • How does a small change in one weight (e.g. \(w^{(2)}_1\)) affect the final loss \(J(\mathbf{w})\))? \[ \begin{equation*} \dfrac{\partial J}{\partial w^{(2)}_1} = \dfrac{\partial J}{\partial \hat{y}} \dfrac{\partial \hat{y}}{\partial w^{(2)}_1} \end{equation*} \]
  • How does a small change in one weight (e.g. \(w^{(1)}_1\)) affect the final loss \(J(\mathbf{w})\))? \[ \begin{equation*} \dfrac{\partial J}{\partial w^{(1)}_1} = \dfrac{\partial J}{\partial \hat{y}} \dfrac{\partial \hat{y}}{\partial z} \dfrac{\partial z}{\partial w^{(1)}_1} \end{equation*} \]

Gradient descent optimization

  • Initialize weights randomly \(\sim N(0, \sigma^2)\)
  • Loop until convergence
    • Compute gradient, \(\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}}\)
    • Update weights, \(\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \eta \dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}}\), where \(\eta\) is called the learning rate.
  • Return weights.

Complex Loss Function

Neural network for survival

Deep neural network for survival

Performance of DeepSurv

  • Evaluation Metric: Concordance (C) Index
  • It is a rank order statistic for predictions against true outcomes.
  • The index is calculated as the ratio of the concordant pairs to the total comparable pairs.
  • Given the comparable instance pair \((i, j)\), with \(t_i\) and \(t_j\) are the actual observed times and \(S(t_i)\) and \(S(t_j)\) are the predicted survival times,
    • The pair \((i, j)\) is concordant if \(t_i > t_j\) and \(S(t_i) > S(t_j)\).
    • The pair \((i, j)\) is discordant if \(t_i > t_j\) and \(S(t_i) < S(t_j)\).
  • The concordance probability \(=Pr(\hat{T}_i < \hat{T}_j | T_i < T_j)\) measures the concordance between the rankings of actual values and predicted values.

Performance of DeepSurv cont..

More recent developments

  • SurvELM:SurvELM: An R package for high dimensional survival analysis with extreme learning machine (Wang and Zhou 2018)
    • comes with an interactive shiny app: link.
  • DeepHit: A deep learning approach to survival analysis with competing risks (Lee et al. 2018).
    • can be applied in presence of competing risks data
  • Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data (Ching, Zhu, and Garmire 2018).
  • RNN-SURV: A deep recurrent model for survival analysis (Giunchiglia, Nemchenko, and Schaar 2018)

Future direction

  • Testing the effect of a covariate on the response.
  • Backpropagation method is computationally challenging.
  • Non-convexity of optimizer results in convergence problem.
  • How to choose the learning rate: \(\color{blue}{\text{fixed}}\) or \(\color{blue}{\text{adaptive}}\)?
  • Which method of optimization to be used: \(\color{blue}{\text{gradient descent}}\), \(\color{blue}{\text{stochastic gradient descent}}\), or others?
  • How to handle overfitting: \(\color{blue}{\text{dropout method}}\), \(\color{blue}{\text{early stopping}}\)?
  • The design of hidden units.
  • The design of the architecture: how many units it should have and how they should be connected.
  • The distribution of hyperparameters.
  • Most of the newly developed techniques are not consistent!

Acknowledgement

  • Professor Ming-Hui Chen, Statistics, UConn.
  • STAT 5645 (Fall 2019) class audience.

Some additional resources

Thanks

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.

———. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Breiman, Leo, and others. 1998. “Arcing Classifier (with Discussion and a Rejoinder by the Author).” The Annals of Statistics 26 (3): 801–49.

Ching, Travers, Xun Zhu, and Lana X Garmire. 2018. “Cox-Nnet: An Artificial Neural Network Method for Prognosis Prediction of High-Throughput Omics Data.” PLoS Computational Biology 14 (4): e1006076.

Cox, David R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 34 (2): 187–202.

Faraggi, David, and Richard Simon. 1995. “A Neural Network Model for Survival Data.” Statistics in Medicine 14 (1): 73–82.

Faraggi, David, R Simon, E Yaskil, and A Kramar. 1997. “Bayesian Neural Network Models for Censored Data.” Biometrical Journal 39 (5): 519–32.

Giunchiglia, Eleonora, Anton Nemchenko, and Mihaela van der Schaar. 2018. “RNN-Surv: A Deep Recurrent Model for Survival Analysis.” In International Conference on Artificial Neural Networks, 23–32. Springer.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.

Lee, Changhee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. 2018. “Deephit: A Deep Learning Approach to Survival Analysis with Competing Risks.” In Thirty-Second Aaai Conference on Artificial Intelligence.

Polson, Nicholas G, Vadim Sokolov, and others. 2017. “Deep Learning: A Bayesian Perspective.” Bayesian Analysis 12 (4): 1275–1304.

Ranganath, Rajesh, Adler Perotte, Noémie Elhadad, and David Blei. 2016. “Deep Survival Analysis.” arXiv Preprint arXiv:1608.02158.

Wang, Hong, and Lifeng Zhou. 2018. “SurvELM: An R Package for High Dimensional Survival Analysis with Extreme Learning Machine.” Knowledge-Based Systems 160: 28–33.

Wang, Ping, Yan Li, and Chandan K Reddy. 2019. “Machine Learning for Survival Analysis: A Survey.” ACM Computing Surveys (CSUR) 51 (6): 110.