本文主要学习和介绍:

  1. 比较两组或者多组数据的平均值

  2. 然后自动添加P值或者显著性水平到ggplot的各种图形中(such as box plots, dot plots, bar plots and line plots …)

一、比较均值的方法

The standard methods to compare the means of two or more groups in R, have been largely described at: comparing means in R.

常用的几种比较方法如下:

Methods R function Description
T-test t.test() Compare two groups (parametric)
Wilcoxon test wilcox.test() Compare two groups (non-parametric)
ANOVA aov() or anova() Compare multiple groups (parametric)
Kruskal-Wallis kruskal.test() Compare multiple groups (non-parametric)

如何使用这些方法进行计算和对结果进行解释,可以参考以下链接:

二、R中添加P值的函数介绍

这里主要使用的是ggpubr包的的两个函数:

  • compare_means(): 比较一组或多组均值,包括P值和显著性水平
  • stat_compare_means(): 自动添加P值或显著性水平到ggplot的图形中

1. compare_means函数

compare_means函数用于比较输出一组或者多组的均值,使用不同的方法

compare_means(formula, data, method = "wilcox.test", paired = FALSE,
group.by = NULL, ref.group = NULL, ...)
  • formula: 画图的公式y ~ x。
    • a formula of the form x ~ group, where x is a numeric variable and group is a factor with one or multiple levels. For example, formula = TP53 ~ cancer_group. It’s also possible to perform the test for multiple response variables at the same time. For example, formula = c(TP53, PTEN) ~ cancer_group.
  • data: 使用的数据。
    • a data.frame containing the variables in the formula.
  • method: 使用的检验P值的方法。the type of test. Default is “wilcox.test”. Allowed values include:
    • “t.test” (parametric) and “wilcox.test”" (non-parametric). Perform comparison between two groups of samples. If the grouping variable contains more than two levels, then a pairwise comparison is performed.
    • “anova” (parametric) and “kruskal.test” (non-parametric). Perform one-way ANOVA test comparing multiple groups.
  • paired: 是否做配对检验。
    • a logical indicating whether you want a paired test. Used only in t.test and in wilcox.test.
  • group.by: 指定分组做P值检验。variables used to group the data set before applying the test. When specified the mean comparisons will be performed in each subset of the data formed by the different levels of the group.by variables.
  • ref.group: 比较均值时指定参考的组,就像比较case-control的数据时,指定control组为对照一样,指定一个组做为参照。
    • a character string specifying the reference group. If specified, for a given grouping variable, each of the group levels will be compared to the reference group (i.e. control group). ref.group can be also “.all.”. In this case, each of the grouping variable levels is compared to all (i.e. base-mean). 这里的base-mean指的就是所有组的平均值

2.stat_compare_means函数

功能与上一个函数相似,只是这个函数用于在ggplot图形中添加平均值比较的P值或者显著性水平

stat_compare_means(mapping = NULL, comparisons = NULL hide.ns = FALSE,
label = NULL, label.x = NULL, label.y = NULL, ...)
  • mapping: 和ggplot函数一样指定aes()映射。
    • Set of aesthetic mappings created by aes().
  • comparisons: 由两个向量组成的列表,这两个向量就是要对比组的名字或者index值。
    • A list of length-2 vectors. The entries in the vector are either the names of 2 values on the x-axis or the 2 integers that correspond to the index of the groups of interest, to be compared.
  • hide.ns: 逻辑值,默认为FALSE。是否显示不显著时的符号ns。
    • logical value. If TRUE, hide ns symbol when displaying significance levels.
  • label: 指定标签的类型,可以是P值(“p.format”)或者是显著性水平(“p.signif”)(用*号和ns表示)。
    • character string specifying label type. Allowed values include “p.signif” (shows the significance levels), “p.format” (shows the formatted p value).
  • label.x,label.y: 指定label的坐标位置。
    • numeric values. coordinates (in data units) to be used for absolute positioning of the label. If too short they will be recycled.
  • : 其他的参数。
    • other arguments passed to the function compare_means() such as method, paired, ref.group.

使用boxplot演示添加P值和显著性水平

1. 准备

# 加载包
# library(tidyverse) #非必须
library(ggpubr)

# 示例数据:ToothGrowth
data("ToothGrowth")
head(ToothGrowth)

echo
head(ToothGrowth) %>% knitr::kable(format = "markdown")
len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5

2. 独立双样本比较

compare_means(len ~ supp, data = ToothGrowth) %>% knitr::kable()
.y. group1 group2 p p.adj p.format p.signif method
len OJ VC 0.0644907 0.064 0.064 ns Wilcoxon

Returned value is a data frame with the following columns:

.y.: the y variable used in the test.
p: the p-value
p.adj: the adjusted p-value. Default value for p.adjust.method = “holm”
p.format: the formatted p-value
p.signif: the significance level.
method: the statistical test used to compare groups.

Create a box plot with p-values:

p <- ggboxplot(data = ToothGrowth, 
x = "supp", y = "len",
color = "supp", palette = "jco",
add = "jitter")
# add = "jitter"参数添加箱线图的数值点
# 添加p值
p + stat_compare_means()
# 再画一幅,这次改变P值的统计方法
p + stat_compare_means(method = "t.test")

显示P值的标签位置可以通过如下参数来调整:label.x, label.y。

显示P值的标签默认为 compare_means() 返回值中的 method 和 p 的组合。也可以通过 aes() 函数指定为其他显示形式。例如:

For example,

aes(label = …p.format…) or aes(label = paste0(“p =”, …p.format…)): display only the formatted p-value (without the method name)
aes(label = …p.signif…): display only the significance level.
aes(label = paste0(…method…, “\n”, “p =”, …p.format…)): Use line break (“\n”) between the method name and the p-value.

如:

p + stat_compare_means(aes(label = paste0(..method..,": ","p=",..p.format..)),
label.x = 1.5, label.y = 40, method = "t.test")

# 单独显示字符型向量的显著水平
p + stat_compare_means(label = "p.signif", label.x = 1.5, label.y = 40)

## 单独显示P值
p + stat_compare_means(label = "p.format",
label.x = 1.5, label.y = 40, method = "t.test")

# 使用aes映射也可以达到同样的效果,此时只有P值,需要连接P=
p + stat_compare_means(aes(label = paste("p = ",..p.format..)),
label.x = 1.5, label.y = 40, method = "t.test")

3. 比较两个配对样本

compare_means(len ~ supp, data = ToothGrowth, paired = TRUE)
.y. group1 group2 p p.adj p.format p.signif method
len OJ VC 0.0043126 0.0043 0.0043 ** Wilcoxon
# 使用ggpaired函数可视化配对数据
ggpaired(data = ToothGrowth,
x = "supp", y = "len",
color = "supp", line.color = "gray",
line.size = 0.4, palette = "jco") +
stat_compare_means(paired = T)

  • palette: 选择不同的调色板

4.比较多组样本

1. 全局比较

# 使用anova的方法, 全局比较
compare_means(len ~ dose, data = ToothGrowth, method = "anova")
.y. p p.adj p.format p.signif method
len 0 0 9.5e-16 **** Anova
# Default method = "kruskal.test" for multiple groups
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means()
# Change method to anova
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "anova")

2. 成对比较

如果分组超过两个,默认将所有组两两比较, 默认方法为wilcox.test, 可以改为t.test

compare_means(len ~ dose,  data = ToothGrowth)
.y. group1 group2 p p.adj p.format p.signif method
len 0.5 1 0.0000070 1.4e-05 7.0e-06 **** Wilcoxon
len 0.5 2 0.0000001 2.0e-07 8.4e-08 **** Wilcoxon
len 1 2 0.0001772 1.8e-04 0.00018 *** Wilcoxon
# 可视化所指定的比较
# 指定比较的组,以下3组都指定
my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )

ggboxplot(data = ToothGrowth,
x = "dose", y = "len",
color = "dose", palette = "jco") +
stat_compare_means(comparisons = my_comparisons) + # 指定成对比较, 并指定P值位置
stat_compare_means(label.y = 50) # 指定P值位置的这句代码不能和上面的同一个函数的代码合并,合并之后P值比价会显示在一条线段上

# 将分组比较的P值label使用label.y指定位置
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))
# 去掉全局比较的数值

3. 指定一个参考组的两两比较

# Pairwise comparison against reference
compare_means(len ~ dose, data = ToothGrowth,
# 指定dose中0.5的组作为参考组
ref.group = "0.5",
method = "t.test") %>%
knitr::kable(format = "markdown")
.y. group1 group2 p p.adj p.format p.signif method
len 0.5 1 1e-07 1e-07 1.3e-07 **** T-test
len 0.5 2 0e+00 0e+00 4.4e-14 **** T-test
# Visualize
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "anova", label.y = 40)+ # 全局比较,使用anova方法
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = "0.5") # 指定配对比价的refer group,比较对象为0.5的组

4. 相对于所有组的均值(base-mean)的多重两两比较的检验

# Comparison of each group against base-mean
compare_means(len ~ dose, data = ToothGrowth, ref.group = ".all.",
method = "t.test")
.y. group1 group2 p p.adj p.format p.signif method
len .all. 0.5 0.0000003 9.0e-07 2.9e-07 **** T-test
len .all. 1 0.5118773 5.1e-01 0.51 ns T-test
len .all. 2 0.0000004 9.0e-07 4.3e-07 **** T-test
# .all.参数比较其中一组与所有的组的均值(base-mean),在组比较多时,比较方便
ggboxplot(data = ToothGrowth,
x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "kruskal.test", label.y = 40)+
stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

下面使用骨髓瘤的数据来说明和所有样本的均值(.all.)比较在一些典型的情况中是很有用的。

将依据患者molecular进行分组,绘制各个组别DEPDC1基因的表达水平。目的是比较各个组别之间是否存在差别,如果有差别,差别又在哪里。

要回答以上问题,可以在所有7个组之间进行两两比较(pairwise comparison)。 由于组别较多,将导致很多种组别组合,这将很难解释。

一个简单的解决办法是将7组中的每一组与“.all.”(如base-mean)比较。当检验结果显著时,可以得出结论:与所有组相比,xxx组的DEPDC1的表达水平显著降低或显著升高。

# Load myeloma data from GitHub
myeloma <- read.delim("https://raw.githubusercontent.com/kassambara/data/master/myeloma.txt")

# Perform the test
compare_means(DEPDC1 ~ molecular_group, data = myeloma,
ref.group = ".all.", method = "t.test")
.y. group1 group2 p p.adj p.format p.signif method
DEPDC1 .all. Cyclin D-1 0.2877529 1.0e+00 0.29 ns T-test
DEPDC1 .all. Cyclin D-2 0.4244240 1.0e+00 0.42 ns T-test
DEPDC1 .all. MMSET 0.5784193 1.0e+00 0.58 ns T-test
DEPDC1 .all. MAF 0.2538126 1.0e+00 0.25 ns T-test
DEPDC1 .all. Hyperdiploid 0.0000000 2.0e-07 2.7e-08 **** T-test
DEPDC1 .all. Proliferation 0.0000239 1.2e-04 2.4e-05 **** T-test
DEPDC1 .all. Low bone disease 0.0000053 3.2e-05 5.3e-06 **** T-test
head(myeloma)
X molecular_group chr1q21_status treatment event time CCND1 CRIM1 DEPDC1 IRF4 TP53 WHSC1 group
GSM50986 Cyclin D-1 3 copies TT2 0 69.24 9908.4 420.9 523.5 16156.5 10.0 261.9 cyclin.d.1
GSM50988 Cyclin D-2 2 copies TT2 0 66.43 16698.8 52.0 21.1 16946.2 1056.9 363.8 cyclin.d.2
GSM50989 MMSET 2 copies TT2 0 66.50 294.5 617.9 192.9 8903.9 1762.8 10042.9 mmset
GSM50990 MMSET 3 copies TT2 1 42.67 241.9 11.9 184.7 11894.7 946.8 4931.0 mmset
GSM50991 MAF NA TT2 0 65.00 472.6 38.8 212.0 7563.1 361.4 165.0 maf
GSM50992 Hyperdiploid 2 copies TT2 0 65.20 664.1 16.9 341.6 16023.4 2096.3 569.2 hyperdiploid
# 可视化
# legend = "none"参数去掉图形上方显示的每种颜色的box代表哪种分组的信息
#
ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group",
add = "jitter", legend = "none") +
rotate_x_text(angle = 45)+
geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean
stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.") # Pairwise comparison against all, 此处添加hide.ns = TRUE参数,将隐藏不显著的ns符号

From the plot above, we can conclude that DEPDC1 is significantly overexpressed in proliferation group and, it’s significantly downexpressed in Hyperdiploid and Low bone disease compared to all.

5. 多个分组变量

  • 在使用另外一个变量分组后,两个独立样本之间的比较
compare_means(len ~ supp, data = ToothGrowth, 
group.by = "dose")
dose .y. group1 group2 p p.adj p.format p.signif method
0.5 len OJ VC 0.0231864 0.046 0.023 * Wilcoxon
1.0 len OJ VC 0.0040304 0.012 0.004 ** Wilcoxon
2.0 len OJ VC 1.0000000 1.000 1.000 ns Wilcoxon

Visualize (1/2). Create a multi-panel box plots facetted by group (here, “dose”):

# 多个小图facet
p <- ggboxplot(data = ToothGrowth,
x = "supp", y = "len",
color = "supp", palette = "jco",
add = "jitter",
facet.by = "dose",
short.panel.labs = F) # short.panel.labs默认为T,改为F显示分组dose:

# 仅显示P值,去除方法
p + stat_compare_means(label = "p.format")

# 或者显示显著性符号,而不是直接显示P值
p + stat_compare_means(label = "p.signif", label.x = 1.5)

# To hide the ‘ns’ symbol, use the argument hide.ns = TRUE.

Visualize (2/2). Create one single panel with all box plots. Plot y = “len” by x = “dose” and color by “supp”:

# 将所有小图组合放在一张图里
p <- ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "supp", palette = "jco",
add = "jitter") # 去掉facet.by参数
p + stat_compare_means(aes(group = supp))
# 使用aes映射时,列名不能加双引号

# 只显示P值
p + stat_compare_means(aes(group = supp),label = "p.format")

显示P=数值

# 单纯的只显示P的数值
p + stat_compare_means(aes(group = supp, label = paste(..p.format..)))

只显示P值

# 只显示显著性水平
p + stat_compare_means(aes(group = supp), label = "p.signif")

  • 将数据按另一个变量分组后进行配对样本比较
compare_means(len ~ supp, data = ToothGrowth, 
group.by = "dose", paired = TRUE) # 增加paired参数
dose .y. group1 group2 p p.adj p.format p.signif method
0.5 len OJ VC 0.0329694 0.066 0.033 * Wilcoxon
1.0 len OJ VC 0.0136719 0.041 0.014 * Wilcoxon
2.0 len OJ VC 1.0000000 1.000 1.000 ns Wilcoxon

相比于非配对P检验,配对P检验的P值更大

# Box plot facetted by "dose"
p <- ggpaired(ToothGrowth, x = "supp", y = "len",
color = "supp", palette = "jco",
line.color = "gray", line.size = 0.4,
facet.by = "dose", short.panel.labs = FALSE)
# Use only p.format as label. Remove method name.
p + stat_compare_means(label = "p.format", paired = TRUE)

6. 其他类型的图表

6.1 不分组的图

  • 条形图
# 条形图
ggbarplot(data = ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "dose",palette="jco", legend="none")+
stat_compare_means()+ # 总体比较
stat_compare_means(ref.group = "0.5", label = "p.signif", label.y = c(22,28))

  • 折线图
# Line plot of mean +/-se
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "black", linetype = 2)+
stat_compare_means() + # Global p-value
stat_compare_means(ref.group = "0.5", label = "p.signif",
label.y = c(22, 29))

6.2 分组的图

# 分组条形图
# position = position_dodge()调节supp分组的OJ组和VC组两组样本的柱状图的距离,默认值为重合(0)
ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "supp", palette = "jco",
position = position_dodge(0.8))+
stat_compare_means(aes(group = supp), label = "p.signif", label.y = 29)

# 分组折线图
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "supp", palette = "jco")+
stat_compare_means(aes(group = supp), label = "p.signif",
label.y = c(16, 25, 29))

参考:

  1. 数据可视化——R语言为ggplot图形添加P值和显著性水平

  2. ggpubr: Publication Ready Plots Add P-values and Significance Levels to ggplots