R学习 / 生存分析

2023-11-26 22:39:46

文章目录

1. 概述
- 1.1 基本概念
- 1.2 主要研究内容
2. Cox模型 | R
3. 参考资料

1. 概述

1.1 基本概念

生存时间（survival time）：一般指从起始事件到终止事件所经历的时间，例如患某疾病的患者从发病到死亡的时间。
失效事件（failure event）与起始事件：失效事件一般指死亡事件或终点事件。起始事件是反映生存时间起始特征的事件，如疾病的确诊、治疗开始等。两者均需在设计时明确规定。
删失数据（censored data）：指在随访过程中，由于某种原因未能观察到患者的明确结局（终点事件），或称截尾。可能是失访、退出或终止等，其生存时间一般以“+”表示。
生存时间资料的分布特征：一般通过随访获得，因观察时间长且难以控制混杂因素，再加上存在截尾数据，规律难以估计，一般为正偏态分布。

1.2 主要研究内容

描述生存过程：估计生存率及平均存活时间，绘制生存曲线描述生存时间的分布特点。常用方法为Kaplan-Meier法（乘积极限法）、寿命法等非参数法（无需假定生存时间的分布类型）。
比较生存过程：比较各样本生存率及其标准误，探讨各总体的生存过程是否有差别。常用方法：log-rank检验，比较两组或多组的生存曲线。
影响生存时间的因素分析：通常以生存时间和结局为应变量，影响因素作为自变量，拟合生存分析模型，探讨影响生存时间的因素。常用方法：cox模型（半参数法），对数logistic回归分析、Weibull分布法等参数法（需假定生存时间的的参数分布类型）。
竞争风险模型（有空再补充）

2. Cox模型 | R

2.1 R包

生存分析的R包一般用survival包和survminer包。survival包用于分析，survminer包用于绘图。

survival包核心函数：

Surv()：创建生存对象
survfit()：使用公式或以构建的cox模型拟合生存曲线
coxph()：拟合cox比例风险回归模型
survdiff()：用log-rank或mantel-Haenszel test 检验生存差异
cox.zph()：检验cox模型的比例风险假设

data$surv <- Surv(time, status)
Surv(time, time2, event, type=c('right', 'left', 'interval', 'counting', 'interval2', 'mstate'),origin=0)
# time:对于右删失，即指随访时间。若为区间数据，则为区间数据的开始时间
# time2: 区间数据，则为区间数据的结束时间
# event: 结局变量。默认0为删失，1为出现终点事件。可以自己指定终点事件。data$status == 2表示认为值为2时终点事件发生
# type:制定删失类型。但是type在不特别指定的情况下，一般会自动根据数据进行默认选择
lung$surv <- Surv(time = lung$time, lung$status == 2)

2.2 实例分析

绘制生存曲线，比较生存率

ggsurvplot()：绘制生存曲线

library("survival") # cox
library("survminer") # data visualization
head(lung)
# 拟合Kaplan-Meier曲线
sfit <- survfit(formula = Surv(time, status)~1,data = lung) #总体
summary(sfit)
sfit <- survfit(formula = Surv(time, status)~sex, data = lung) # 分组
summary(sfit)
summary(sfit, times=seq(0, 1000, 100)) # 设定时间参数来选定时间区间
ggsurvplot(sfit, data = lung)

res.sum <- surv_summary(sfit)
head(res.sum[res.sum$sex==1,])
head(res.sum[res.sum$sex==2,])
ggsurvplot(sfit,pval = TRUE, conf.int = TRUE,
           risk.table = TRUE) #可视化K-M生存曲线

surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung) # log-rank test
surv_diff
p.val = 1 - pchisq(surv_diff$chisq, length(surv_diff$n) - 1)
p.val

lung$age_cut <- cut(lung$age, breaks = c(0, 70, Inf), labels = c("young", "old"))
fit <- survfit(Surv(time, status)~age_cut, data = lung)
## 直接用survfit()的话，它会把所有的年龄当作一个factor
summary(fit)
ggsurvplot(fit, data = lung)

在新的survminer 0.2.4版本中，新增了可以一次确定一个或多个连续变量最佳分割点的函数surv_cutpoint()与surv_categorize()

Cox回归分析

cox回归模型：coxph(Surv(time, status) ~ x1 + x2 + ..., data = )
单因素分析：可以用lapply(), sapply(), function()等批量分析和展示结果
多因素分析：ggforest()以森林图展示各因素的HR

fit <- coxph(Surv(time, status)~sex, data=lung) #univariable 
summary(fit)
cox.zph(fit)
plot(cox.zph(fit))
#可视化每一个变量在cox模型下的表现。当数据出现明显分离时，则说明表现不是很理想

# univariable 
covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
hr_results <- function(x){
		x <- summary(x)
        p.value<-signif(x$wald["pvalue"], digits=2)
        wald.test<-signif(x$wald["test"], digits=2)
        beta<-signif(x$coef[1], digits=2) # coeficient beta
        HR <-round(x$coef[2], 2) # exp(beta)
        HR.confint.lower <- round(x$conf.int[,"lower .95"], 2)
        HR.confint.upper <- round(x$conf.int[,"upper .95"],2)
        HR.with.CI <- paste0(HR, " (", HR.confint.lower, "-", HR.confint.upper, ")")
        res<-c(beta, HR.with.CI, wald.test, p.value)
        #names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", "p.value")
        return(res)}
univ_results <- lapply(univ_models,hr_results)
res <- t(as.data.frame(univ_results, check.names = FALSE)) # 转置
res1 <- as.data.frame(res)

# multi-vars
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
summary(res.cox)
ggforest(res.cox, data = lung, 
         main = "Hazard ratio",
         cpositions = c(0.10, 0.22, 0.4),
         fontsize = 1.0)

Cox回归的重要假设：变量对于hazard rate的影响不随时间的变化而变化（等比例），可通过cox.zph(fit)检验

多重插补法

sum(!complete.cases(lung))

md.pattern(lung, 5)
lung_cmplt <- mice(lung, 5) # 5种可选插补
lung_cmplt_3 <- complete(lung_cmplt, 3) # 选取其中一个完整的子集进行后续分析？
lung_cmplt_3$surv <- Surv(lung_cmplt_3$time, lung_cmplt_3$status == 2)
fit <- survfit(surv~age, data = lung_cmplt)
ggsurvplot(fit, pval = TRUE)

使用多重插补法与直接删除缺失变量后进行cox回归，作敏感性分析。

2.3 生存曲线的进阶版

ggsurvplot() is a generic function to plot survival curves.
ggsurvplot_list() 绘制多个对象
ggsurvplot_facet() 分面到多个panels
ggsurvplot_group_by() 一幅图中多个分组
ggsurvplot_add_all() 总合所有的情况
ggsurvplot_combine() 一个图中结合多个survfit对象

p1 <- ggsurvplot(fit,
           pval = TRUE, conf.int = TRUE,
           risk.table = TRUE, # Add risk table
           risk.table.col = "strata", # Change risk table color by groups
           linetype = "strata", # Change line type by groups
           surv.median.line = "hv", # Specify median survival
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"))

p2 <- ggsurvplot(fit, 
                 pval = TRUE,               # show p-value of log-rank test.
                 conf.int = TRUE,           # show confidence intervals for point estimaes of survival curves
                 conf.int.style = "step",   # customize style of confidence intervals
                 xlab = "Time in days",     # customize X axis label.
                 break.time.by = 200,       # break X axis in time intervals by 200.
                 ggtheme = theme_light(),   # customize plot and risk table with a theme.
                 risk.table = "abs_pct",    # absolute number and percentage at risk.
                 risk.table.y.text.col = T, # colour risk table text annotations.
                 risk.table.y.text = FALSE, # show bars instead of names in text annotations in legend of risk table
                 ncensor.plot = TRUE,       # plot the number of censored subjects at time t
                 surv.median.line = "hv",   # add the median survival pointer.
                 legend.labs = c("Male", "Female"),    # change legend labels.
                 palette = c("#E7B800", "#2E9FDF")     # custom color palettes.
)
arrange_ggsurvplots(list(p1,p2))

参数fun="event"，表示cumulative event 累计事件发生率
参数fun="cumhaz"，表示cummulative hazard累计风险水平（在时刻t，事件发生的可能性）

p3 <- ggsurvplot(fit,
           conf.int = TRUE,
           risk.table.col = "strata", # Change risk table color by groups
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"),
           fun = "event")
p4 <- ggsurvplot(fit,
              conf.int = TRUE,
              risk.table.col = "strata", # Change risk table color by groups
              ggtheme = theme_bw(), # Change ggplot2 theme
              palette = c("#E7B800", "#2E9FDF"),
              fun = "cumhaz") 
arrange_ggsurvplots(list(p3,p4))

ggsurvplot()和ggforest()的使用还需实战提高，可以使用R出矢量图后在AI中调试各种图例参数和图形的比例等。

码农公寓

文章目录

1. 概述

1.1 基本概念

1.2 主要研究内容

2. Cox模型 | R

2.1 R包

2.2 实例分析

2.3 生存曲线的进阶版

3. 参考资料

相关文章