纠错工具之 - Proovread

BioInf-Wuerzburg/proovread - Github

主要是来解读 proovread 发表的文章,搞清楚它内在的原理。


原文:proovread: large-scale high-accuracy PacBio correction through iterative short read consensus





过去十年,二代改写了测序的历史,Today, a single run of a HiSeq2500 can generate as much as 600Gb high-quality output data, which covers a human genome 200. 但是,太短,不好组装,尤其是重复区域。因此,大量的SR组装软件出现了,Allpath-LG (Gnerre et al., 2011), the Celera Assembler (Miller et al., 2008; Myers et al., 2000) and SOAPdenovo (Li et al., 2010).

比SR长的重复不能被解决,目前的好的组装方案是,联合short reads和long insert libraries和额外的fosmid测序。

但是,SMRT出现了,With the latest chemistry, this approach delivers reads44 kb. 而且无偏向性,Their third-generation sequencer, PacBio RS II, generates to date up to 400Mb per sequencing run.

LR 的准确度太低,二代99%,而三代只有80%-85%,而且错误分布模型也不同,Although Illumina reads mainly contain miscalled bases with increasing frequency toward read ends, SMRT generates primarily insertions (10%) and deletions (5%) in a random pattern (Ross et al., 2013).  SMRT可以CCS,但这同时也减少了reads的长度,从而失去了三代的优势。


(i) The hierarchical genome-assembly process (HGAP) uses shorter SMRT reads contained within longer reads to generate pre-assemblies and to calculate consensus sequences (Chin et al., 2013). (缺陷:coverage of 80 to 100)

(ii) PacBioToCA (Koren et al., 2012) and LSC (Au et al., 2012) use Illumina SRs in a hybrid approach to correct SMRT reads. These approaches result in higher quality LRs.(需要大量计算资源,PacBioToCA lost >40%数据,LCS只能转录组,WGS集成,不好调用)


(i) run on standard computers as well as computer grids and

(ii) can be easily adapted to different use cases.

Obviously, these objectives should not be at the cost of accuracy, length of corrected reads or throughput.


Mapping—sensitive and trusted hybrid alignments

比对 - 敏感的可信的混合比对



(i) The expected error rates for SMRT sequencing are 10% for insertions and up to 5%for deletions (Ono et al., 2013; Ross et al., 2013). Thus, the costs for gaps in the LR, which correspond to deletions, are about twice as high as for gaps in the SR, which represent insertion.

(ii) Substitutions are comparatively rare (1%). This is reflected by a mismatch penalty of at least 10 times the cost of SR gaps.
(iii) The distribution of SMRT sequencing errors is random. Hence, contrasting to biological scenarios, continuous insertions or deletions are less likely, resulting in higher costs for gap extension than for gap opening.

本软件使用SHRiMP2作为首选,Its versatile interface allowed us to completely implement the hybrid scoring model with the following parameters: insertions are the most frequent errors and are penalized as gap open with –1. Deletions occur
about half as often and are thus penalized with –2. Extensions for insertions and deletions are scored with –3 and –4, respectively. Mismatches are at least 10 times as rare, resulting in a penalty of –11 (Supplementary Table S1).


纠错工具之 - Proovread

本软件使用Bowtie2作为次选,However, corrections using Bowtie2 lagged延迟 behind owing to a limited set of parameters regarding scoring and sensitivity. 可以自己trim(sickle,https://github.com/najoshi/sickle),corrected SRs(Quake)

比对,自然要区分真比对和假比对,重复区自然会导致reads的堆积,error还会影响比对得分,We therefore assess length normalized scores in a localized context.

引入了Bin的概念:LRs are internally represented by a consecutive series of small bins.

Only the highest scoring alignments of each bin, not the overall highest scoring alignments, up to the specified coverage cutoff are considered for the next step—the calculation of the consensus sequence.

Consensus call with quality computation and chimera detection

Quality and chimera trimming

untrimmed corrected LRs(这不就是我们最终得到的结果吗)


Iterative correction

解决 computationally demanding and time consuming 问题

Configuration and customization

The settings include scoring schemes, binning, masking, iteration procedure and post-processing.

Scalability and parallelization扩展性和并行





下一篇:《Visual Studio Magazine》2013年读者选择奖—界面框架类