没想到一个富集分析涉及这么多的东西,今天没有完全的搞懂,先记录一下学习的代码, 我打算先把R学一下,再来更新这篇文章

「生信下游分析」如何使用clusterProfiler进行富集分析

安装clusterProfiler等包

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
browseVignettes("clusterProfiler") # documentation
if (!requireNamespace("edgeR")){
BiocManager::install("edgeR")
}
# 物种包到:http://www.bioconductor.org/packages/release/BiocViews.html#___Organism 查找,这里的表示humans。
if (!requireNamespace("org.Hs.eg.db")){
BiocManager::install("org.Hs.eg.db")
}

不同的参考物种,下载不同的数据:
BIoconductor物种包

加载R包

library(edgeR)
library(clusterProfiler)
library(org.Hs.eg.db)

数据导入

数据信息:

  • 来源: TCGA Ovarian Serous Cystadenocarcinoma(OSC)
  • 日期: 2017-06-14
  • 下载工具: TCGABiolinks®
  • 类型: RNA-seq(RSEM)
  • 样本大小: 296
    • 79 Immunoreactive(免疫反应物);
    • 71 Mesenchymal(间充质);
    • 67 Differentiated(分化);
    • 79 Proliferative(增生);

表达量数据数据导入:

raw_count <- read.table("D:/baiduyundownload/富集分析/enrichment_analysis/TCGA_RNASeq_rawcounts.txt",
sep = "\t",
stringsAsFactors = FALSE,
header = TRUE)
dim(raw_count)

查看部分数据

raw_count[1:5,1:5]

分组信息导入

groups_df <- read.table("./RNASeq_classdefinitions.txt",
header = TRUE,
sep = "\t", stringsAsFactors = FALSE)
head(groups_df)

差异表达分析

数据预过滤

根据CPM值,过滤低表达的基因. 标准是基因至少在50个样本量的表达量超过1

cpms <- cpm(raw_count)
keep <- rowSums(cpms > 1) >= 50
counts <- raw_count[keep,]
dim(counts)

数据标准化和离散度(Dispersion)分析

创建DGEList类,用于存放表达量和样本信息

# create data structure to hold counts and subtype information for each sample.
d <- DGEList(counts=counts, group=groups_df$SUBTYPE)

对原始文库数据计算标准化因子

#Normalize the data
d <- calcNormFactors(d)

样本间距离

#create multidimensional scaling(MDS) plot. The command below will automatically
# generate the plot containing all samples where each subtype is a different color.
#Ideally there should be a good separation between the different classes.
mds_output <- plotMDS(d, labels=NULL, pch = 1,
col= c("darkgreen","blue","red","orange")[factor(groups_df$SUBTYPE)],
xlim = c(-2.5,4), ylim = c(-2.5,4))
legend("topright",
legend=levels(factor(groups_df$SUBTYPE)),
pch=c(1), col= c("darkgreen","blue","red", "orange"),title="Class",
bty = 'n', cex = 0.75)

计算离散度

#calculate dispersion
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)

差异表达分析

两样本使用exactTest即可.

多样本需要用model.matrix, makeContrasts, glmFit,glmLRT等函数。

de <- exactTest(d, pair=c("Immunoreactive","Mesenchymal"))
tt_exact_test <- topTags(de,n=nrow(d))

保存结果:

output_df <- cbind(gene=row.names(tt_exact_test),
as.data.frame(tt_exact_test))
write.csv(output_df, file = "edgeR_DEG.csv",
row.names = FALSE)

富集分析

读取数据

deg_df <- read.table("./edgeR_DEG.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE)

数据预处理: 将基因列拆分

deg_df$symbol <- do.call(rbind, strsplit(deg_df$gene, "|", fixed = TRUE))[,1]
deg_df$geneID <- do.call(rbind, strsplit(deg_df$gene, "|", fixed = TRUE))[,2]
head(deg_df)

GO分析

第一步: 筛选目标基因集。例如logFC >1, FDR < 0.05 认为是显著性上调

flt <- deg_df[deg_df$logFC > 1 & deg_df$FDR < 0.05,]
up_gene <- flt$geneID

第二步(1): GO富集分析

ego <- enrichGO(up_gene,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP")

第二步(2): KEGG富集分析

ekegg <- enrichKEGG(up_gene,
organism = "hsa",
keyType = "kegg")

第三步: 可视化

柱形图

barplot(ego, 
showCategory = 20)

气泡图

dotplot(ego)
dotplot(ekegg)
View(as.data.frame(ekegg))
browseKEGG(ekegg, "hsa05224")

GSEA分析

第一步: 构建排序基因表

rank_list <- deg_df$logFC
names(rank_list) <- deg_df$geneID
rank_list <- sort(rank_list, decreasing = TRUE)

第二步: GSEA分析

gseago <- gseGO(rank_list,
ont = "BP",
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID")

gseakegg <- gseKEGG(rank_list,
organism = "hsa",
keyType = "kegg")

第三步: 可视化

冒泡图

dotplot(gseakegg)

GSEA图

View(as.data.frame(gseakegg))
enrichplot::gseaplot2(gseakegg, "hsa05224")

Reference:

https://yulab-smu.github.io/clusterProfiler-book/index.html

https://guangchuangyu.github.io/software/clusterProfiler/documentation/

http://www.bioconductor.org/packages/release/BiocViews.html#___Organism