(9:22)早上来的时候,检查发现,昨天是(8:23)的时候,运行完成。昨天挂机的时间是(19:00),所以,整体上,运行这个程序的时间为2个小时。
与昨天的问题一样,我的somatic_mutations_hg19.txt这个文件还是没有跑出来。
我把过程性的文档调出来,我们再来细致的分析一下。
mutation calling 程序的运行情况:
(1)出错1: speedseq中的freebayes executable not found. Please set path in /home/xxzhang/workplace/software/speedseq/bin/speedseq.config file./home/xxzhang/workplace/software/speedseq/bin/speedseq somatic -F 0.01 -q 10 -t 20 -T ./output_RNA/sptmp -o ./output_RNA/speedseq ./geneome/hg19/hg19.fa /home/xxzhang/workplace/QBRC/output_RNA/normal/normal.bam /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam
Sourcing executables from /home/xxzhang/workplace/software/speedseq/bin/speedseq.config …
usage: speedseq somatic [options] <reference.fa> <normal.bam> <tumor.bam>
positional args:
reference.fa
genome reference fasta file
normal.bam
germline BAM file(s) (comma separated BAMs from multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
tumor.bam
tumor BAM file(s) (comma separated BAMs for multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
options:
-o STR output prefix [tumor.bam]
-w FILE BED file of windowed genomic intervals
-t INT threads [1]
-F FLOAT Require at least this fraction of observations supporting
an alternate allele within a single individual in order
to evaluate the position [0.05]
-C INT Require at least this count of observations supporting
an alternate allele within a single individual in order
to evaluate the position [2]
-S FLOAT minimum somatic score (SSC) for PASS [18]
-q FLOAT minimum QUAL score to output non-passing somatic variants [1e-5]
-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]
-A annotate the vcf with VEP
-a VEP assembly to use [GRCh37]
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-k keep tempory files
-h show this message
Error: freebayes executable not found. Please set path in /home/xxzhang/workplace/software/speedseq/bin/speedseq.config file
打开那个speedseq.config文件,我终于明白我的错误在什么地方了。因为当初安装speedseq的时候, 是直接“移植”到运行环境中。所以,一些依赖性的路径依然是参考之前的,所以,在调用的时候,就无法合理的运行。
SPEEDSEQ_HOME=/home/zxx/workplace/speedseq/
general
SAMTOOLS=/usr/local/bin/samtools
SAMBAMBA=/home/zxx/workplace/speedseq//bin/sambamba
BGZIP=/home/zxx/workplace/speedseq//bin/bgzip
TABIX=/home/zxx/workplace/speedseq//bin/tabix
VAWK=/home/zxx/workplace/speedseq//bin/vawk
PARALLEL=/home/zxx/workplace/speedseq//bin/parallel
PYTHON=/usr/bin/python2.7
HEXDUMP=/usr/bin/hexdump
align
BWA=/home/zxx/workplace/speedseq//bin/bwa
SAMBLASTER=/home/zxx/workplace/speedseq//bin/samblaster
var/somatic
FREEBAYES=/home/zxx/workplace/speedseq//bin/freebayes
VEP=/home/zxx/workplace/speedseq//bin/variant_effect_predictor.pl
VEP_CACHE_DIR=/home/zxx/workplace/speedseq//annotations/vep_cache
所以,我现在的核心任务就是,修改路径。或者重新安装一下speedseq。
用home/xxzhang/workplace/software/speedseq
替换/home/zxx/workplace/speedseq/
。
还有剩余三行,在服务器中的地址需要重新确认一下。
SAMTOOLS=/home/xxzhang/miniconda3/bin
PYTHON=/usr/bin/python2.7
HEXDUMP=/usr/bin/hexdump
完成如上修改,放回到文件夹中去看最终的结果。
完成并保存成功了,然后我们开始运行。
试一下能不能运行这个指令,同样出现错误。
Error: freebayes executable not found. Please set path in /home/xxzhang/workplace/software/speedseq/bin/speedseq.config file
原来是这个地方出现了问题。
./freebayes: /lib64/libm.so.6: version GLIBC_2.27 not found (required by ./freebayes)
./freebayes: /lib64/libstdc++.so.6: version GLIBCXX_3.4.20’ not found (required by ./freebayes)
./freebayes: /lib64/libstdc++.so.6: version CXXABI_1.3.9’ not found (required by ./freebayes)
./freebayes: /lib64/libstdc++.so.6: version GLIBCXX_3.4.21’ not found (required by ./freebayes)
问题挺大的,所以我打算重新安装speedseq。
但是,我试图clone在git-hub上的资源。
Cloning into ‘speedseq’…
fatal: unable to access ‘https://github.com/hall-lab/speedseq/’: OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443
显示,无法获取到这个软件。
(这个地方,问题还挺大的,不知道今天能不能弄完)
(2)出错2:/home/xxzhang/miniconda3/bin/lofreq2_call_pparallel.py --pp-threads 20 -s --sig 0.1 --bonf 1 -C 7 -f ./geneome/hg19/hg19.fa -S ./geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf --call-indels -l ./geneome/hg19/hg19.fa.exon.bed -o ./output_RNA/lofreq_t.vcf /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam
这里主要有两处错误:/home/xxzhang/miniconda3/bin/lofreq2_call_pparallel.py
,这里属于失误。应该将 lofreq2_call_pparallel.py
,改为lofreq2_call_parallel.py
,这里错打了一个。
(不对,这里没有错误。)2021/6/11 17:29
- 文件找不到:
CRITICAL [2021-06-10 20:21:36,841]: Bed-file ./geneome/hg19/hg19.fa.exon.bed does not exist
这个文件来自于何处呢?
我按照gtf文件中,对于exon的注释,提取出exon的首尾的信息,最终转换为bed文件。
(3)出错3:文件顺序的错误。
- 同样还是索引错误的问题:CosmicCodingMuts.hg19.vcf and reference have incompatible contigs: No overlapping contigs found.需要重新排一个顺序。
ERROR MESSAGE: Input files /home/xxzhang/workplace/QBRC/./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf and reference have incompatible contigs: No overlapping contigs found.
ERROR /home/xxzhang/workplace/QBRC/./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]
ERROR reference contigs = [chr1, chr1_gl000191_random, chr1_gl000192_random, chr2, chr3, chr4, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr5, chr6, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7, chr7_gl000195_random, chr8, chr8_gl000196_random, chr8_gl000197_random, chr9, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18, chr18_gl000207_random, chr19, chr19_gl000208_random, chr19_gl000209_random, chr20, chr21, chr21_gl000210_random, chr22, chrM, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249, chrX, chrY]
这个问题,还挺大,远远不止是顺序不匹配的问题,还有它的这个顺序是:
[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]
而我们基因组的染色体号是chr
开头的。所以,在进行排序之前,还需要解决一下,我们之前遇到的那个添加chr
的问题。
问题分为两步解决:
首先,使用perl,在文本之前添加chr,可以先使用小一点的文本试一试,看看是否可行。在可行的基础上,添加大批量的数据进行尝试。
还是使用R来做这件事吧,直接合并一下即可。
(16:26)用R简单的写了一个函数,来处理这件事。
setwd("E://S-code//Rworkplace")
process<-function(a){
data<-read.table(a)
n<-dim(data)[1]
chr<-rep("chr", n)
chromosome<-paste(chr,data$V1,sep="")
data_new<-cbind(chromosome,data)
data_new<-data_new[,-2]
write.table(data_new,file="new.txt",sep="\t",quote=FALSE,row.names = FALSE,col.names=FALSE)
}
process("vcftest.txt")
(16:35)将文件提交到程序中进行处理。
需要一定的时间。
(18:38)现在两个小时过去了,这个程序还没有运行完成。所以,有时候写代码,不仅仅是要能够解决任务,还要考虑代码对于内存的消耗,以及运行时间。
(19:58)处理到猴年马月,我怀疑我代码写错了。
(20:45)换用其他的思路:使用我自己的计算机运行实在是太慢了。
使用服务器,继续运行(5秒中结束。。)所以,我自己的工作效率超级受到自身能力的影响。本来可以五秒中解决的事情,愣是等了4个小时。所以,更加坚定了自己深入学习Linux的指令的决心!!!
这个过程中,使用到的代码:
awk '{
if($0 !~ /^#/)
print "chr"$0;
else if(match($0,/(##contig=<ID=)(.*)/,m))
print m[1]"chr"m[2];
else print $0
}' CosmicCodingMuts.hg19.vcf > CosmicCodingMuts.hg19.chr.vcf
修改完成后,查看CosmicCodingMuts.hg19.chr.vcf文件。
chr1 137069 COSV70649555 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN23945201;CDS=c.*909+552C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+552C>T;HGVSG=1:g.137069G>A;CNT=1
chr1 137080 COSV70649318 A G . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN8964589;CDS=c.*909+541T>C;AA=p.?;HGVSC=ENST00000423372.3:c.*909+541T>C;HGVSG=1:g.137080A>G;CNT=1
chr1 137109 COSV70649279 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN21849789;SNP;CDS=c.*909+512C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+512C>T;HGVSG=1:g.137109G>A;CNT=1
chr1 137121 COSV70649283 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN22491822;CDS=c.*909+500C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+500C>T;HGVSG=1:g.137121G>A;CNT=1
实现目标。
接下来,我们需要单独对其进行排序。grep -o -E "^\w+([-+.]\w+)*" CosmicCodingMuts.hg19.vcf |uniq
chr1
chr10
……
和我们预料的一样,现在对它进行排序。/home/xxzhang/workplace/software/java/jdk1.8.0_291/bin/java -jar /home/xxzhang/workplace/QBRC/somatic_script/picard.jar SortVcf --INPUT CosmicCodingMuts.hg19.vcf --OUTPUT CosmicCodingMuts.hg19_2.vcf -SEQUENCE_DICTIONARY /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.dict
前面运行挺顺畅的,卡在了这一步。
INFO 2021-06-11 21:20:32 SortVcf read 37,500,000 records. Elapsed time: 00:04:15s. Time for last 25,000: 0s. Last read position: chrX:54,209,249
(21:23)总结一下,现在需要解决的问题。
(1)完成安装speedseq,正常运行。
(2)对vcf文件进行排序处理。
(2021年6月12日,13:57)
来的时候,发现昨天晚上,一直希望解决的问题,始终没有解决。是我处理后的文件,有错误吗?
INFO 2021-06-11 21:20:32 SortVcf read 37,450,000 records. Elapsed time: 00:04:14s. Time for last 25,000: 0s. Last read position: chrX:49,033,311
INFO 2021-06-11 21:20:32 SortVcf read 37,475,000 records. Elapsed time: 00:04:15s. Time for last 25,000: 0s. Last read position: chrX:51,487,063
INFO 2021-06-11 21:20:32 SortVcf read 37,500,000 records. Elapsed time: 00:04:15s. Time for last 25,000: 0s. Last read position: chrX:54,209,249
卡在了这一处chrX:54,209,249,就很奇怪。我想看一下,这一处后面的内容是什么?为什么卡在了这里?
使用grep
指令,查看了上下文,并没有发现什么“异常”?
chrX 54209249 COSV104652506 C A . . GENE=FAM120C_ENST00000477084;STRAND=-;LEGACY_ID=COSM9340896;CDS=c.383G>T;AA=p.R128L;HGVSC=ENST00000477084.1:c.383G>T;HGVSP=ENSP00000420718.1:p.Arg128Leu;HGVSG=X:g.54209249C>A;CNT=1
chrX 54209249 COSV104652506 C A . . GENE=FAM120C_ENST00000328235;STRAND=-;LEGACY_ID=COSM9340896;CDS=c.383G>T;AA=p.R128L;HGVSC=ENST00000328235.4:c.383G>T;HGVSP=ENSP00000329896.4:p.Arg128Leu;HGVSG=X:g.54209249C>A;CNT=1
chrX 54209249 COSV104652506 C A . . GENE=FAM120C;STRAND=-;LEGACY_ID=COSM9340896;CDS=c.383G>T;AA=p.R128L;HGVSC=ENST00000375180.2:c.383G>T;HGVSP=ENSP00000364324.2:p.Arg128Leu;HGVSG=X:g.54209249C>A;CNT=1
chrX 54209257 COSV60260184 C T . . GENE=FAM120C_ENST00000477084;STRAND=-;LEGACY_ID=COSM4110272;CDS=c.375G>A;AA=p.A125%3D;HGVSC=ENST00000477084.1:c.375G>A;HGVSP=ENSP00000420718.1:p.Ala125%3D;HGVSG=X:g.54209257C>T;CNT=2
chrX 54209257 COSV60260184 C T . . GENE=FAM120C_ENST00000328235;STRAND=-;LEGACY_ID=COSM4110272;CDS=c.375G>A;AA=p.A125%3D;HGVSC=ENST00000328235.4:c.375G>A;HGVSP=ENSP00000329896.4:p.Ala125%3D;HGVSG=X:g.54209257C>T;CNT=2
chrX 54209257 COSV60260184 C T . . GENE=FAM120C;STRAND=-;LEGACY_ID=COSM4110272;CDS=c.375G>A;AA=p.A125%3D;HGVSC=ENST00000375180.2:c.375G>A;HGVSP=ENSP00000364324.2:p.Ala125%3D;HGVSG=X:g.54209257C>T;CNT=2
chrX 54209258 COSV60262952 G A . . GENE=FAM120C;STRAND=-;LEGACY_ID=COSM5444372;CDS=c.374C>T;AA=p.A125V;HGVSC=ENST00000375180.2:c.374C>T;HGVSP=ENSP00000364324.2:p.Ala125Val;HGVSG=X:g.54209258G>A;CNT=1
chrX 54209258 COSV60262952 G A . . GENE=FAM120C_ENST00000328235;STRAND=-;LEGACY_ID=COSM5444372;CDS=c.374C>T;AA=p.A125V;HGVSC=ENST00000328235.4:c.374C>T;HGVSP=ENSP00000329896.4:p.Ala125Val;HGVSG=X:g.54209258G>A;CNT=1
其后的文字也挺正常的,没有看出什么异常。
难道是dict文件的错误?
@SQ SN:chrUn_gl000248 LN:39786 M5:5a8e43bec9be36c7b49c84d585107776 UR:file:/home/zxx/QBRC/geneome/hg19/hg19.fa
@SQ SN:chrUn_gl000249 LN:38502 M5:1d78abec37c15fe29a275eb08d5af236 UR:file:/home/zxx/QBRC/geneome/hg19/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd UR:file:/home/zxx/QBRC/geneome/hg19/hg19.fa
@SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623 UR:file:/home/zxx/QBRC/geneome/hg19/hg19.fa
觉得也没什么不对。
就是为什么,sortVCF
这个指令卡住了呢?
/home/xxzhang/workplace/software/java/jdk1.8.0_291/bin/java -jar /home/xxzhang/workplace/QBRC/somatic_script/picard.jar SortVcf --INPUT CosmicCodingMuts.hg19.vcf --OUTPUT CosmicCodingMuts.hg19_2.vcf --SEQUENCE_DICTIONARY /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.dict
重新看一下,sortVCF具体的指令。
不行。
awk '{
chr_order="chrMnchr1nchr2nchr3nchr4nchr5nchr6nchr7nchr8nchr9nchr10nchr11nchr12nchr13nchr14nchr15nchr16nchr17nchr18nchr19nchr20nchr21nchr22nchrXnchrY"
cat "$0" | grep "^#" > .header.vcf
cat "$0" | grep -v "^#" | sort -k1,1 -k2,2n > .pre.sorted.vcf
echo -e $chr_order | while read line
do
cat .pre.sorted.vcf | grep "^$line"$'t' >> .header.vcf
done
mv .header.vcf sorted.vcf && rm .header.vcf .pre.sorted.vcf
}' CosmicCodingMuts.hg19.vcf
awk '/^#/{print}!/^#/{exit}' CosmicCodingMuts.hg19.vcf ;
|sed '/^#/d’'CosmicCodingMuts.hg19.vcf
|awk -F"\t" '($1~/^[0-9]+$/){sub("^chr","",$0);print $0}'
|sort -k1,1n -k2,2n|awk '{print "chr"$0}' ;
|sed '/^#/d' CosmicCodingMuts.hg19.vcf
|awk -F"\t" '($1!~/^[0-9]+$/){sub("^chr","",$0);print $0}'
sort -k1,1d -k2,2n|awk '{print "chr"$0}' > sort.vcf
突然想到,我为什么要这么麻烦呢?因为vcf文档本身就是一个文本文档,为什么不能使用sort
指令,对其直接进行排序呢?
我们来看一下vcf文档。
chr1 137024 COSV70649793 G C . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN26194893;CDS=c.*909+597C>G;AA=p.?;HGVSC=ENST00000423372.3:c.*909+597C>G;HGVSG=1:g.137024G>C;CNT=1
chr1 137069 COSV70649555 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN23945201;CDS=c.*909+552C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+552C>T;HGVSG=1:g.137069G>A;CNT=1
chr1 137080 COSV70649318 A G . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN8964589;CDS=c.*909+541T>C;AA=p.?;HGVSC=ENST00000423372.3:c.*909+541T>C;HGVSG=1:g.137080A>G;CNT=1
chr1 137109 COSV70649279 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN21849789;SNP;CDS=c.*909+512C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+512C>T;HGVSG=1:g.137109G>A;CNT=1
chr1 137121 COSV70649283 G A . . GENE=AL627309.1;STRAND=-;LEGACY_ID=COSN22491822;CDS=c.*909+500C>T;AA=p.?;HGVSC=ENST00000423372.3:c.*909+500C>T;HGVSG=1:g.137121G>A;CNT=1
所以,从这个角度来看,我们按照第一列,按照数值的大小递增排序,第二列,同样按照数值的大小递增排序即可。
一直没有找到比较合适的参数(也是自己的基础比较薄弱)。
糟糕的心情,换一张纸,重新写。