实验记录 | mutect问题详解:No tribble type was provided on the command line and the type of the file?

出错详情:

/home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java -Djava.io.tmpdir=./output_RNA/mutmp -Xmx31g -jar /home/xxzhang/workplace/QBRC//somatic_script/mutect-1.1.7.jar --analysis_type MuTect --reference_sequence ./geneome/hg19/hg19.fa --dbsnp ./geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf --cosmic ./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf  --input_file:tumor /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam --input_file:normal /home/xxzhang/workplace/QBRC/output_RNA/normal/normal.bam --vcf ./output_RNA/mutect.vcf --out ./output_RNA/mutect.out

报出错误:

ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
ERROR Name FeatureType Documentation
ERROR BCF2 VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)

我觉得可能是command的书写不和要求,导致程序无法决断输出文件的格式。
我们来看一下mutect的help文档的内容。
/home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java -Xmx31g -jar /home/xxzhang/workplace/QBRC//somatic_script/mutect-1.1.7.jar --analysis_type MuTect --help
关于mutect的详细信息,
也可参见链接
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_cga_tools_gatk_walkers_cancer_mutect_MuTect.html
(这个链接是打不开的,这个html文件实际存放于gatk的安装包中)

-T,–analysis_type <analysis_type> Name of the tool to run
-args,–arg_file <arg_file> Reads arguments from the
specified file
-I,–input_file <input_file> Input file containing sequence
data (SAM or BAM)
-rbs,–read_buffer_size <read_buffer_size> Number of reads per SAM file to
buffer in memory
-et,–phone_home <phone_home> Run reporting mode (NO_ET|AWS|
STDOUT)
-K,–gatk_key <gatk_key> GATK key file required to run
with -et NO_ET
-tag,–tag Tag to identify this GATK run
as part of a group of runs
-rf,–read_filter <read_filter> Filters to apply to reads
before analysis
-L,–intervals One or more genomic intervals
over which to operate
-XL,–excludeIntervals One or more genomic intervals
to exclude from processing
-isr,–interval_set_rule <interval_set_rule> Set merging approach to use for
combining interval inputs
(UNION|INTERSECTION)
-im,–interval_merging <interval_merging> Interval merging rule for
abutting intervals (ALL|
OVERLAPPING_ONLY)
-ip,–interval_padding <interval_padding> Amount of padding (in bp) to
add to each interval
-R,–reference_sequence <reference_sequence> Reference sequence file
-ndrs,–nonDeterministicRandomSeed Use a non-deterministic random
seed
-maxRuntime,–maxRuntime Stop execution cleanly as soon
as maxRuntime has been reached
-maxRuntimeUnits,–maxRuntimeUnits Unit of time used by maxRuntime
(NANOSECONDS|MICROSECONDS|
MILLISECONDS|SECONDS|MINUTES|
HOURS|DAYS)
-dt,–downsampling_type <downsampling_type> Type of read downsampling to
employ at a given locus (NONE|
ALL_READS|BY_SAMPLE)
-dfrac,–downsample_to_fraction <downsample_to_fraction> Fraction of reads to downsample
to
-dcov,–downsample_to_coverage <downsample_to_coverage> Target coverage threshold for
downsampling to coverage
-baq,–baq Type of BAQ calculation to
apply in the engine (OFF|
CALCULATE_AS_NECESSARY|
RECALCULATE)
-baqGOP,–baqGapOpenPenalty BAQ gap open penalty
-fixMisencodedQuals,–fix_misencoded_quality_scores Fix mis-encoded base quality
scores
-allowPotentiallyMisencodedQuals,–allow_potentially_misencoded_quality_scores Ignore warnings about base
quality score encoding
-OQ,–useOriginalQualities Use the base quality scores
from the OQ tag
-DBQ,–defaultBaseQualities Assign a default base quality
-PF,–performanceLog Write GATK runtime performance
log to this file
-BQSR,–BQSR Input covariates table file for
on-the-fly base quality score
recalibration
-DIQ,–disable_indel_quals Disable printing of base
insertion and deletion tags
(with -BQSR)
-EOQ,–emit_original_quals Emit the OQ tag with the
original base qualities (with
-BQSR)
-preserveQ,–preserve_qscores_less_than <preserve_qscores_less_than> Don’t recalibrate bases with
quality scores less than this
threshold (with -BQSR)
-globalQScorePrior,–globalQScorePrior Global Qscore Bayesian prior to
use for BQSR
-S,–validation_strictness <validation_strictness> How strict should we be with
validation (STRICT|LENIENT|
SILENT)
-rpr,–remove_program_records Remove program records from the
SAM header
-kpr,–keep_program_records Keep program records in the SAM
header
-sample_rename_mapping_file,–sample_rename_mapping_file <sample_rename_mapping_file> Rename sample IDs on-the-fly at
runtime using the provided
mapping file
-U,–unsafe Enable unsafe operations:
nothing will be checked at
runtime (ALLOW_N_CIGAR_READS|
ALLOW_UNINDEXED_BAM|
ALLOW_UNSET_BAM_SORT_ORDER|
NO_READ_ORDER_VERIFICATION|
ALLOW_SEQ_DICT_INCOMPATIBILITY|
LENIENT_VCF_PROCESSING|ALL)
-nt,–num_threads <num_threads> Number of data threads to
allocate to this analysis
-nct,–num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> Number of CPU threads to
allocate per data thread
-mte,–monitorThreadEfficiency Enable threading efficiency
monitoring
-bfh,–num_bam_file_handles <num_bam_file_handles> Total number of BAM file
handles to keep open
simultaneously
-rgbl,–read_group_black_list <read_group_black_list> Exclude read groups based on
tags
-ped,–pedigree Pedigree files for samples
-pedString,–pedigreeString Pedigree string for samples
-pedValidationType,–pedigreeValidationType Validation strictness for
pedigree information (STRICT|
SILENT)
-variant_index_type,–variant_index_type <variant_index_type> Type of IndexCreator to use for
VCF/BCF indices (DYNAMIC_SEEK|
DYNAMIC_SIZE|LINEAR|INTERVAL)
-variant_index_parameter,–variant_index_parameter <variant_index_parameter> Parameter to pass to the
VCF/BCF IndexCreator
-l,–logging_level <logging_level> Set the minimum level of
logging
-log,–log_to_file <log_to_file> Set the logging location
-h,–help Generate the help message
-version,–version Output version information
Arguments for MalformedReadFilter:
-filterRNC,–filter_reads_with_N_cigar filter out reads with CIGAR containing the N operator, instead of stop
processing and report an error.
-filterMBQ,–filter_mismatching_base_and_quals if a read has mismatching number of bases and base qualities, filter
out the read instead of blowing up.
-filterNoBases,–filter_bases_not_stored if a read has no stored bases (i.e. a ‘*’), filter out the read
instead of blowing up.
Arguments for MuTect:
–enable_extended_output add many additional columns of
statistics to the output file
–enable_qscore_output output quality scores of all
bases used for ref and alt
–artifact_detection_mode used when running the caller
on a normal (as if it were a
tumor) to detect artifacts
–tumor_sample_name <tumor_sample_name> name to use for tumor in
output files
–bam_tumor_sample_name <bam_tumor_sample_name> if the tumor bam contains
multiple samples, only use
read groups with SM equal to
this value
–normal_sample_name <normal_sample_name> name to use for normal in
output files
–force_output force output for each site
–force_alleles force output for all alleles
at each site
–only_passing_calls only emit passing calls
–initial_tumor_lod <initial_tumor_lod> Initial LOD threshold for
calling tumor variant
–tumor_lod <tumor_lod> LOD threshold for calling
tumor variant
–fraction_contamination <fraction_contamination> estimate of fraction (0-1) of
physical contamination with
other unrelated samples
–minimum_mutation_cell_fraction <minimum_mutation_cell_fraction> minimum fraction of cells
which are presumed to have a
mutation, used to handle
non-clonality and
contamination
–normal_lod <normal_lod> LOD threshold for calling
normal non-germline
–dbsnp_normal_lod <dbsnp_normal_lod> LOD threshold for calling
normal non-variant at dbsnp
sites
–minimum_normal_allele_fraction <minimum_normal_allele_fraction> minimum allele fraction to be
considered in normal, useful
for normal sample contaminated
with tumor
–tumor_f_pretest <tumor_f_pretest> for computational efficiency,
reject sites with allelic
fraction below this threshold
–min_qscore <min_qscore> threshold for minimum base
quality score
–gap_events_threshold <gap_events_threshold> how many gapped events
(ins/del) are allowed in
proximity to this candidate
–heavily_clipped_read_fraction <heavily_clipped_read_fraction> if this fraction or more of
the bases in a read are
soft/hard clipped, do not use
this read for mutation calling
–fraction_mapq0_threshold <fraction_mapq0_threshold> threshold for determining if
there is relatedness between
the alt and ref allele read
piles
–pir_median_threshold <pir_median_threshold> threshold for clustered read
position artifact median
–pir_mad_threshold <pir_mad_threshold> threshold for clustered read
position artifact MAD
–required_maximum_alt_allele_mapping_quality_score required minimum value for
<required_maximum_alt_allele_mapping_quality_score> tumor alt allele maximum
mapping quality score
–max_alt_alleles_in_normal_count <max_alt_alleles_in_normal_count> threshold for maximum
alternate allele counts in
normal
–max_alt_alleles_in_normal_qscore_sum <max_alt_alleles_in_normal_qscore_sum> threshold for maximum
alternate allele quality score
sum in normal
–max_alt_allele_in_normal_fraction <max_alt_allele_in_normal_fraction> threshold for maximum
alternate allele fraction in
normal
–power_constant_qscore <power_constant_qscore> Phred scale quality score
constant to use in power
calculations
–power_constant_af <power_constant_af> Allelic fraction constant to
use in power calculations
-o,–out Call-stats output
-vcf,–vcf VCF output of mutation
candidates
-dbsnp,–dbsnp VCF file of DBSNP information
-cosmic,–cosmic VCF file of COSMIC sites
-normal_panel,–normal_panel <normal_panel> VCF file of sites observed in
normal
-cov,–coverage_file <coverage_file> write out coverage in WIGGLE
format to this file
-cov_q20,–coverage_20_q20_file <coverage_20_q20_file> write out 20x of Q20 coverage
in WIGGLE format to this file
-pow,–power_file <power_file> write out power in WIGGLE
format to this file
-tdf,–tumor_depth_file <tumor_depth_file> write out tumor read depth in
WIGGLE format to this file
-ndf,–normal_depth_file <normal_depth_file> write out normal read depth in
WIGGLE format to this file

最后出错的信息,有很大一部分来源于这下面的内容。

Available Reference Ordered Data types:
Name FeatureType Documentation
BCF2 VariantContext (this is an external codec and is not documented within GATK)
BEAGLE BeagleFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_beagle_BeagleCodec.html
BED BEDFeature (this is an external codec and is not documented within GATK)
BEDTABLE TableFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_table_BedTableCodec.html
EXAMPLEBINARY Feature (this is an external codec and is not documented within GATK)
GELITEXT GeliTextFeature (this is an external codec and is not documented within GATK)
OLDDBSNP OldDbSNPFeature (this is an external codec and is not documented within GATK)
RAWHAPMAP RawHapMapFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_hapmap_RawHapMapCodec.html
REFSEQ RefSeqFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html
SAMPILEUP SAMPileupFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_sampileup_SAMPileupCodec.html
SAMREAD SAMReadFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_samread_SAMReadCodec.html
TABLE TableFeature http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_table_TableCodec.html
VCF VariantContext (this is an external codec and is not documented within GATK)
VCF3 VariantContext (this is an external codec and is not documented within GATK)

我一时没怎么看明白,我看一下官网的实例。
我们这次所使用的mutect的版本是1.1.7。
参考链接:https://github.com/bcbio/bcbio-nextgen/issues/765
这个文档中显示,如果出现我们上述列出的错误,很可能的原因是,我们的输入文件是空的。

EDIT: scratch that. The reason is that some files are empty.
One of the files that caused the errors, that was *-vardict-multiallelic-decompose-vepeffects.vcf.gz, was in fact an empty file. And this made the GATK choke. Probably these files should be made as “dummy empty”, like other places in the pipeline.

sub fun_mutect{
  if ($pdx=~/mouse/) {$known=" ";} else {$known=" --cosmic ".$resource_cosmic." ";}
  system_call($java17." -Djava.io.tmpdir=".$mutect_tmp." -Xmx31g -jar ".$mutect." --analysis_type MuTect --reference_sequence ".$index.
    " --dbsnp ".$resource_dbsnp.$known." --input_file:tumor ".$tumor_bam." --input_file:normal ".$normal_bam.
    " --vcf ".$output."/mutect.vcf --out ".$output."/mutect.out");
}

我现在,还想把mutect的版本换一下,总觉得1.1.7这个版本太老了===>查看之前的安装记录,好复杂,放弃。
现在,我觉得问题应该是在input_file的声明上,按照官网的介绍:
(应该类如这样)

gatk Mutect2 \
   -R ref_fasta.fa \
   -I tumor.bam \
   -tumor tumor_sample_name \
   -I normal.bam \
   -normal normal_sample_name \

所以,将指令修改为:

/home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java -Djava.io.tmpdir=./output_RNA/mutmp -Xmx31g -jar /home/xxzhang/workplace/QBRC//somatic_script/mutect-1.1.7.jar --analysis_type MuTect --reference_sequence ./geneome/hg19/hg19.fa --dbsnp ./geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf --cosmic ./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf  --input_file /home/xxzhang/workplace/QBRC/output_RNA/tumor/tumor.bam  -tumor tumor --input_file /home/xxzhang/workplace/QBRC/output_RNA/normal/normal.bam  -normal normal --vcf ./output_RNA/mutect.vcf --out ./output_RNA/mutect.out

没啥意义。这个tumor/normal是针对于输出文件而言的。所以,整体上,这个尝试不对。

继续去谷歌上检索其他的解决方案。
(1)https://*.com/questions/19934021/vcf4-2-file-not-recognised-by-gatk
在这篇文章中,作者是这样说的:

I finally figured it out: It was something off the the VariantAnnotator vcf from GATK, I re-ran it and used the new file, I also deleted the old index file. The manual correction of the ESP VCF works! Its a bit boring, but it does the job. Hope this help anyone else getting this error! Replace all chromosome names manually using “find & replace” in Textedit. Its something about the formatting thats off.

翻译过来的意思就是说,
指的是,我们的两个参考的vcf文档还是有些问题对吗?

–dbsnp ./geneome/hg19/hg19.fa_resource/dbsnp.hg19.vcf
–cosmic ./geneome/hg19/hg19.fa_resource/CosmicCodingMuts.hg19.vcf

我们需要使用GATK的VariantAnnotator对其进行重新的处理。不得不说,这个gatk在格式方面非常的挑剔。我记得之前,我也遇到类似的棘手的问题。
我觉得我做这件事情,就像是欣赏景色一样,只是匆匆一瞥,却没有追究于其内部的使用了哪个文件,这个文件是什么样的?是不太称职的长官。

我们来看一下这两个vcf文件(其中的cosmic,我非常存疑。因为我们包括标记chr,排序,折腾了挺长的时间)。

检查了一下,终于找到了问题的核心,我的CosmicCodingMuts.hg19.vcf这个文件居然是空的。真的如之前的一个作者所说的,需要检查一下vcf文档是否为空。
但是,怎么会出现这种问题呢?我记得我之前是处理好的,到底发生了什么?
我需要检查一下,我在6.12号处理这一步骤的实验记录看一下。
我们有存档ComicCodingMuts.hg19_2.vcf的文件(已经添加chr,只是没有很好的排序)。好像还有一个ComicCodingMuts.hg19.vcf的文件(这个是不是我们需要的,只是忘记放到hg19.fa_resource的路径中了)。
我们将这个文件打开看一下。
grep -o -E "^\w+([-+.]\w+)" CosmicCodingMuts.hg19.vcf | uniq
得到的结果是:

chr1
chr10
chr11
……

可见这个文件并非我们想要的结果文档,将其重命名为ComicCodingMuts.hg19_1.vcf。
mv ComicCodingMuts.hg19.vcf ComicCodingMuts.hg19_1.vcf
然后接着ComicCodingMuts.hg19_1.vcf这个文档继续处理,对其进行sort排序。
sort -k1,1V -k2,2n CosmicCodingMuts.hg19_1.vcf >CosmicCodingMuts.hg19_3.vcf
程序开始运行中,重新调用指令。
grep -o -E "^\w+([-+.]\w)*" CosmicCodingMuts.hg19_3.vcf | uniq
得到结果:

chr1
chr2
chr3
chr4
chr5
chr6
……

我希望自己能够在今天中午吃饭前,把命令挂上去重新跑一遍(12:30)。
这个文件就是我们需要的。然后我们将其cp到hg19_resource的路径下。
cp CosmicCodingMuts.hg19_3.vcf ./hg19.fa_resource/
在hg19_resource的路径下,对其进行重新的命名,为CosmicCodingMuts.hg19.vcf。
这部分的操作结束之后。我们再回到QBRC的文件夹下,重新运行mutect的指令,看是否会重复同样的错误。
同样的错误,再次出现。
那么接下来,还有另一个线索:GATK的VariantAnnotator。===>我们有问题的可能是结果的文件,应该用处并不大。实在不行,给gatk发邮件问一下。

我们现在,修改somatic的指令,重新运行somatic.pl这个文件(看一下,shimmer的问题有没有得到解决===>我两天的工作,我心态很好的!)。

下面的任务:
(1)组会
(2)发邮件问作者,结果假阳性的影响因素。===>看看有没有相关的文献,对这个问题进行了思考。
===>汇报的时候有自己的想法。

(3)找到文章的数据,看一下,是否可以应用到我们的分析流程中。
(4)如何有效的汇报?
(5)理解每一个步骤的生物学意义,我什么时候能够像那个作者一样,学到那种程度呢?
(6)找一篇与mutation calling的验证相关的文献。

再次运行的时候,两个问题还是再次的出现了。
(1)问题1:

ERROR
/usr/bin/perl: symbol lookup error: /opt/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake
Failed to run /home/xxzhang/workplace/software/Shimmer/shimmer.pl --counts!

https://www.cnblogs.com/dzqk/archive/2018/01/08/8236627.html

$p->{install_sets} =
    {
     core   => {
       lib     => $c->get('installprivlib'),
       arch    => $c->get('installarchlib'),
       bin     => $c->get('installbin'),
       script  => $c->get('installscript'),
       bindoc  => $bindoc,
       libdoc  => $libdoc,
       binhtml => $binhtml,
       libhtml => $libhtml,
     },
     site   => {
       lib     => $c->get('installsitelib'),
       arch    => $c->get('installsitearch'),
       bin     => $c->get('installsitebin')      || $c->get('installbin'),
       script  => $c->get('installsitescript')   ||
         $c->get('installsitebin') || $c->get('installscript'),
       bindoc  => $c->get('installsiteman1dir')  || $bindoc,
       libdoc  => $c->get('installsiteman3dir')  || $libdoc,
       binhtml => $c->get('installsitehtml1dir') || $binhtml,
       libhtml => $c->get('installsitehtml3dir') || $libhtml,
     },
     vendor => {
       lib     => $c->get('installvendorlib'),
       arch    => $c->get('installvendorarch'),
       bin     => $c->get('installvendorbin')      || $c->get('installbin'),
       script  => $c->get('installvendorscript')   ||
         $c->get('installvendorbin') || $c->get('installscript'),
       bindoc  => $c->get('installvendorman1dir')  || $bindoc,
       libdoc  => $c->get('installvendorman3dir')  || $libdoc,
       binhtml => $c->get('installvendorhtml1dir') || $binhtml,
       libhtml => $c->get('installvendorhtml3dir') || $libhtml,
     },
    };

/home/xxzhang/miniconda3/bin/perl /home/xxzhang/workplace/software/Shimmer/shimmer.pl --minqual 25 --ref /home/xxzhang/workplace/QBRC/geneome/hg19/hg19.fa /home/xxzhang/workplace/output/normal.bam /home/xxzhang/workplace/output/tumor.bam --outdir /home/xxzhang/workplace/output/

在自己的机器上跑shimmer就很顺畅。
perl /home/zxx/bin/shimmer.pl --minqual 25 --ref /media/zxx/TOSHIBA/QBRC/geneome/hg19/hg19.fa /media/zxx/TOSHIBA/normal.bam /media/zxx/TOSHIBA/tumor.bam --outdir /home/zxx/workplace/

/home/zxx/bin/shimmer.pl --counts --ref /media/zxx/TOSHIBA/QBRC/geneome/hg19/hg19.fa /media/zxx/TOSHIBA/normal.bam /media/zxx/TOSHIBA/tumor.bam --min_som_reads 10 --som_file /home/zxx/workplace//som_counts.txt --het_file /home/zxx/workplace//het_counts.txt --indel_file /home/zxx/workplace//indel_counts.txt --minqual 25 --minindel 10
Calling printCompCounts -bam1 /media/zxx/TOSHIBA/normal.bam -bam2 /media/zxx/TOSHIBA/tumor.bam -fasta /media/zxx/TOSHIBA/QBRC/geneome/hg19/hg19.fa -minqual 25
[mpileup] 2 samples in 2 input files
/home/zxx/bin/shimmer.pl --max_q 0.05 --test_fof /home/zxx/workplace//som_counts.fof --bh --vs_file /home/zxx/workplace//somatic_diffs.vs --vcf_file /home/zxx/workplace//somatic_diffs.vcf --outfile /home/zxx/workplace//som_counts.bh.txt /media/zxx/TOSHIBA/normal.bam /media/zxx/TOSHIBA/tumor.bam
Applying Benjamini-Hochberg correction to somatic change predictions with 105 tests. Results in /home/zxx/workplace//som_counts.bh.txt.
Applying Benjamini-Hochberg correction to somatic change predictions with 170 tests. Results in /home/zxx/workplace//indel_counts.bh.txt.

突然想到,卡在这样的一个问题上是很无聊的事情,如果不能够解决,不如就拜托其他同学解决。更加的省时省力。

上一篇:Caused by: java.io.IOException: On-disk size without header provided is 6, but block header contains


下一篇:rabbitmq安装报 Failed to start LSB: Enable AMQP service provided by RabbitMQ broker错误