0
点赞
收藏
分享

微信扫一扫

DNA 2. SCI 文章中基因组变异分析神器之 maftools

点击关注,桓峰基因

桓峰基因

生物信息分析,SCI文章撰写及生物信息基础知识学习:R语言学习,perl基础编程,linux系统命令,Python遇见更好的你

78篇原创内容

公众号

前 言

随着癌症基因组学的进步,突变注释格式(MAF)正在被广泛接受并用于存储检测到的体细胞变异。癌症基因组图谱项目已经对30多种不同的癌症进行了测序,每种癌症的样本量都超过了200个。由体细胞变体组成的结果数据以突变注释格式的形式存储。只要数据是MAF格式,本包试图从TCGA来源或任何内部研究中以有效的方式总结、分析、注释和可视化MAF文件。

图片

实例解析

1. 软件安装

在安装这个软件maftools时,需要先安装BioManager,然后再安装maftools,如下:

if (!require("BiocManager")) {
    install.packages("BiocManager")
}
if (!require(maftools)) {
    BiocManager::install("maftools")
}
library(maftools)

2. 数据读取

maftools工具需要读入两个文件,如下:

1.MAF文件-可以是gz压缩。必需的;

2.与MAF中每个样本/肿瘤样本条码相关的可选推荐的临床数据;

3.一个可选的拷贝数数据:可以是GISTIC输出或自定义表。

1.maf文件格式

MAF文件包含许多字段,从染色体名称到cosmic注释。然而,maftools中的大多数分析使用以下列如下:1.Hugo_Symbol;

2.Chromosome;

3.Start_Position;

4.End_Position;

5.Reference_Allele;

6.Tumor_Seq_Allele2;

7.Variant_Classification;

8.Variant_Type;

9.Tumor_Sample_Barcode.

同时读取maf文件和临床信息文件,看下结果,如下:

# path to TCGA LAML MAF file
laml.maf = system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
# clinical information containing survival information and histology. This is
# optional
laml.clin = system.file("extdata", "tcga_laml_annot.tsv", package = "maftools")
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## -Finished in 3.790s elapsed (0.550s cpu)
print(laml@data[1:5, ])
##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1:      ABCA10          10349 genome.wustl.edu         37         17
## 2:       ABCA4             24 genome.wustl.edu         37          1
## 3:      ABCB11           8647 genome.wustl.edu         37          2
## 4:       ABCC3           8714 genome.wustl.edu         37         17
## 5:       ABCF1             23 genome.wustl.edu         37          6
##    Start_Position End_Position Strand Variant_Classification Variant_Type
## 1:       67170917     67170917      +            Splice_Site          SNP
## 2:       94490594     94490594      +      Missense_Mutation          SNP
## 3:      169780250    169780250      +      Missense_Mutation          SNP
## 4:       48760974     48760974      +      Missense_Mutation          SNP
## 5:       30554429     30554429      +      Missense_Mutation          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1:                T                 T                 C         TCGA-AB-2988
## 2:                C                 C                 T         TCGA-AB-2869
## 3:                G                 G                 A         TCGA-AB-3009
## 4:                C                 C                 T         TCGA-AB-2887
## 5:                G                 G                 A         TCGA-AB-2920
##    Protein_Change i_TumorVAF_WU i_transcript_name
## 1:        p.K960R      45.66000       NM_080282.3
## 2:       p.R1517H      38.12000       NM_000350.2
## 3:       p.A1283V      46.97218       NM_003742.2
## 4:       p.P1271S      56.41000       NM_003786.1
## 5:        p.G658S      40.95000    NM_001025091.1

2.临床信息格式

临床数据格式包括:每个样本/肿瘤样本条码,和对应的临床数据,如下:

print(laml@clinical.data[1:5, ])
##    Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1:         TCGA-AB-2802                 M4                   365
## 2:         TCGA-AB-2803                 M3                   792
## 3:         TCGA-AB-2804                 M3                  2557
## 4:         TCGA-AB-2805                 M0                   577
## 5:         TCGA-AB-2806                 M1                   945
##    Overall_Survival_Status
## 1:                       1
## 2:                       1
## 3:                       0
## 4:                       1
## 5:                       1

3.拷贝数变异格式

拷贝数数据包含样本名称,基因名称和拷贝数状态(Amp或Del)。

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes,
    gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
laml.gistic@data[1:5, ]
##    Hugo_Symbol Tumor_Sample_Barcode Variant_Classification Variant_Type
## 1:       KMT2A         TCGA-AB-2805                    Amp          CNV
## 2:   LINC00689         TCGA-AB-2805                    Del          CNV
## 3:      MIR595         TCGA-AB-2805                    Del          CNV
## 4:   RN7SL142P         TCGA-AB-2805                    Del          CNV
## 5:         SHH         TCGA-AB-2805                    Del          CNV
##        Cytoband
## 1: AP_2:11q23.3
## 2:  DP_4:7q32.3
## 3:  DP_4:7q32.3
## 4:  DP_4:7q32.3
## 5:  DP_4:7q32.3

3.数据统计

对整体数据情况进行汇总,统计样本量,突变基因量,以及突变类型,最后整理到表格中,如下:

# Shows sample summry.
getSampleSummary(laml)
##      Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##   1:         TCGA-AB-3009               0               5            0
##   2:         TCGA-AB-2807               1               0            1
##   3:         TCGA-AB-2959               0               0            0
##   4:         TCGA-AB-3002               0               0            0
##   5:         TCGA-AB-2849               0               1            0
##  ---                                                                  
## 189:         TCGA-AB-2942               0               0            0
## 190:         TCGA-AB-2946               0               0            0
## 191:         TCGA-AB-2954               0               0            0
## 192:         TCGA-AB-2982               0               0            0
## 193:         TCGA-AB-2903               0               0            0
##      In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##   1:            1                25                 2           1    34
##   2:            0                16                 3           4    25
##   3:            0                22                 0           1    23
##   4:            0                15                 1           5    21
##   5:            0                16                 1           2    20
##  ---                                                                   
## 189:            1                 0                 0           0     1
## 190:            0                 1                 0           0     1
## 191:            0                 1                 0           0     1
## 192:            0                 1                 0           0     1
## 193:            0                 0                 0           0     0
# Shows gene summary.
getGeneSummary(laml)
##       Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
##    1:        FLT3               0               0            1           33
##    2:      DNMT3A               4               0            0            0
##    3:        NPM1               0              33            0            0
##    4:        IDH2               0               0            0            0
##    5:        IDH1               0               0            0            0
##   ---                                                                      
## 1237:      ZNF689               0               0            0            0
## 1238:      ZNF75D               0               0            0            0
## 1239:      ZNF827               1               0            0            0
## 1240:       ZNF99               0               0            0            0
## 1241:        ZPBP               0               0            0            0
##       Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
##    1:                15                 0           3    52             52
##    2:                39                 5           6    54             48
##    3:                 1                 0           0    34             33
##    4:                20                 0           0    20             20
##    5:                18                 0           0    18             18
##   ---                                                                     
## 1237:                 1                 0           0     1              1
## 1238:                 1                 0           0     1              1
## 1239:                 0                 0           0     1              1
## 1240:                 1                 0           0     1              1
## 1241:                 1                 0           0     1              1
##       AlteredSamples
##    1:             52
##    2:             48
##    3:             33
##    4:             20
##    5:             18
##   ---               
## 1237:              1
## 1238:              1
## 1239:              1
## 1240:              1
## 1241:              1
# shows clinical data associated with samples
getClinicalData(laml)
##      Tumor_Sample_Barcode FAB_classification days_to_last_followup
##   1:         TCGA-AB-2802                 M4                   365
##   2:         TCGA-AB-2803                 M3                   792
##   3:         TCGA-AB-2804                 M3                  2557
##   4:         TCGA-AB-2805                 M0                   577
##   5:         TCGA-AB-2806                 M1                   945
##  ---                                                              
## 189:         TCGA-AB-3007                 M3                  1581
## 190:         TCGA-AB-3008                 M1                   822
## 191:         TCGA-AB-3009                 M4                   577
## 192:         TCGA-AB-3011                 M1                  1885
## 193:         TCGA-AB-3012                 M3                  1887
##      Overall_Survival_Status
##   1:                       1
##   2:                       1
##   3:                       0
##   4:                       1
##   5:                       1
##  ---                        
## 189:                       0
## 190:                       1
## 191:                       1
## 192:                       0
## 193:                       0
# Shows all fields in MAF
getFields(laml)
##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"         "Center"                
##  [4] "NCBI_Build"             "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                 "Variant_Classification"
## [10] "Variant_Type"           "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"   "Protein_Change"        
## [16] "i_TumorVAF_WU"          "i_transcript_name"
# Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = "laml")

数据汇总统计绘图,将每个样本中的变量数量显示为堆叠的barplot,将变量类型显示为Variant_Classification汇总的箱线图。如下:

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = "median", dashboard = TRUE,
    titvRaw = FALSE)

图片

4.突变数据可视化

瀑布图

# oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

图片

顺反式图

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
# plot titv summary
plotTiTv(res = laml.titv)

图片

棒棒糖图(氨基酸)

# lollipop plot for DNMT3A, which is one of the most frequent mutated gene in
# Leukemia.
lollipopPlot(maf = laml, gene = "DNMT3A", AACol = "Protein_Change", showMutationRate = TRUE,
    labelPos = 882)
##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_022552  NP_072046       912
## 2: DNMT3A NM_153759  NP_715640       723
## 3: DNMT3A NM_175629  NP_783328       912

图片

棒棒糖图(蛋白)

plotProtein(gene = "TP53", refSeqID = "NM_000546")

图片

降雨图

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.4)
##    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1:          8       98129348     98133560     7               702.0000 4212
## 2:          8       98398549     98403536     9               623.3750 4987
## 3:          8       98453076     98456466     9               423.7500 3390
## 4:          8      124090377    124096810    22               306.3333 6433
## 5:         12       97436055     97439705     7               608.3333 3650
## 6:         17       29332072     29336153     8               583.0000 4081
##    Tumor_Sample_Barcode C>G C>T
## 1:         TCGA-A8-A08B   4   3
## 2:         TCGA-A8-A08B   1   8
## 3:         TCGA-A8-A08B   1   8
## 4:         TCGA-A8-A08B   1  21
## 5:         TCGA-A8-A08B   4   3
## 6:         TCGA-A8-A08B   4   4

图片

突变负荷与TCGA比较

laml.mutload = tcgaCompare(maf = laml, cohortName = "Example-LAML", logscale = TRUE,
    capture_size = 50)

图片

VAF分布图

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

图片

CNV可视化

基因组分布

gisticChromPlot(gistic = laml.gistic, markBands = "all")

图片

气泡图

gisticBubblePlot(gistic = laml.gistic)

图片

瀑布图

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = "FAB_classification",
    sortByAnnotation = TRUE, top = 10)

图片

CBS 可视化

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
## NULL

图片

这期主要说说单纯的可视化数据,下期说说利用maftools分析一些实质性的操作,敬请期待,想学习生信的可以到B站,每天上传教程!

B站直播课,肿瘤克隆进化生信分析培训课程,没有录播,有需要这方面分析内容的老师可以过来交流一下!

图片

我们在桓峰基因公众号也推出克隆进化的系列文章,从组织到单细胞的克隆进化分析都有介绍,供大家参考:

Topic 1.克隆进化之sciClone
Topic 2.克隆进化之ClonEvol
Topic 3. 克隆进化之 fishplot
Topic 4. 克隆进化之 Pyclone
Topic 5. 克隆进化之 CITUP
Topic 6. 克隆进化之 Canopy
Topic 7. 克隆进化之 Cardelino
Topic 8. 克隆进化之 RobustClone
Topic 9. 克隆进化之 TimeScape
Clone 1. 肿瘤克隆进化之前世今生
Clone 2. 肿瘤克隆进化之不同进化模

有做肿瘤的转移,癌症的耐药性亚克隆,靶向用药等课题的老师都可以直接联系我,保证最好的服务,最顶级的科研方法,助力高分!

References:

  1. Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch. PMID: 30341162
举报

相关推荐

0 条评论