三代纠错的重要性不言而喻,三代的核心优势就是长,唯一的缺点就是错误率高,但好就好在错误是随机分布的,可以通过算法解决,这也就是为什么现在有这么多针对三代开发的纠错工具。
纠错和组装是分不开的,纠错就是为了组装,单纯的为了纠错而纠错是没有意义的。
目前的算法大致可以分为三种:1.三代数据自纠;2.二代对三代纠;3.二代三代混合纠错。
目前已有的三代纠错程序(PBcR既能自纠,也能二代纠三代):
- MHAP自纠(falcon也是用MHAP,SMRT的HGAP使用的是另一种速度慢的自纠算法,自纠的核心是多重序列比对)
- CCS
- Quiver,Arrow
- Sparc,PBdagcon
- PacBioToCA(之前说错了,这是个二代纠三代的算法)
- Proovread
- LoRDEC
- Ectools
- LSC
1.PacBioToCA实战
小问题:
- PBcR全称是什么? PacBio Corrected Reads (PBcR) pipeline
- PBcR是哪个单位开发的? 不是PacBio官方,而是University of Maryland,Center for Bioinformatics and Computational Biology
- PBcR的程序长什么样?PBcR只是一个名称,上面都说明了,这只是一个pipeline,程序全称Whole-Genome Shotgun Assembler,简称 wgs-8.3rc2,这个程序全部是由Perl脚本组成,大部分都是调用其他子程序(SAMtools、Jellyfish、pbutgcns、pbdagcon、BLASR、FALCON等)。
- 怎么调用PacBioToCA? PacBioToCA只是PBcR里的一个功能,运行“./PBcR”,设置参数就可以使用PacBioToCA了(如下)。
- PBcR、HGAP和SMRT区别? PBcR、CA 和 CANU 区别? Celera Assembler、Canu和PBcR基本可以认为是一个东西的不同版本;HGAP是PacBio官方的组装pipeline,SMRT是单分子实时测序的简称。
PacBioToCa 是 PBcR 程序里的一个子程序,专门用来进行二代纠三代,PacBioToCa 顾名思义就是将 PB 转换成 CA 能利用的格式,从而进行后续的组装。
下面是自纠的脚本:
source /ifs4/BC_PUB/biosoft/pipeline/DNA/DNA_Denovo/PacBio/WGS-8.3/wgs-.3rc2/Linux-amd64/bin/sampleData/setup.sh
/ifs4/BC_PUB/biosoft/pipeline/DNA/DNA_Denovo/PacBio/WGS-8.3/wgs-.3rc2/Linux-amd64/bin/PBcR_V2 -sensitive -length -partitions -l chr22_60X -s ./pacbio.SGE.spec -threads -genomeSize -maxGap -fastq /ifs4/BC_RD/USER/lizhixin/my_project/PacBio_reads/PB_chr22.fastq
pacBioToCA github
/tools/wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname illumina -technology illumina -reads illumina.fastq > illumina.frg
/tools/wgs-7.0/Linux-amd64/bin/pacBioToCA -length -partitions -l ec_pacbio -t -s pacbio.spec \
-fastq pacbio.filtered_subreads.fastq illumina.frg > run.out >&
PBcR 马里兰大学
PBcR SourceForge
pacBioToCA wiki (三篇文献,其他链接)
PacBio sequence error correction amd assemble via pacBioToCA
本地路径:E:\desktop\项目\*\二三代联合纠错 文章
博客解读:PBcR纠错及组装算法
自纠要求:目前看最低需要15X,再低就无法自纠了,自纠貌似可以解决嵌合体,但是自纠的核心缺点就是 PB 变短,数据量减少了一半。
以上只是我的初步认识,具体的还得看官方论文以及自己试验。
PBcR系列牛逼之处就在于它开发了很多三代方面牛逼的算法,理论上三代的缺点是可以用算法弥补的,PBcR将其实现了。
MHAP - MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
MinHash - 是LSH的一种,可以用来快速估算两个集合的相似度。
MinHash 在PBcR中被用来快速的寻找overlap,它和 DALIGNER 的功能是一样的,但是底层的算法不一样,显然这两种算法都不是直接的两两比对,因为这种复杂度是无法接受的。
越挖牛逼的算法就越多:
Complete-Striped-Smith-Waterman-Library (好屌!早就有人实现并打包了,所以现在大部分的编程只需要组合和打包别人的程序)
可以看出,很多生信里面牛逼的算法都是来自于计算机领域,所以你也要多多关注计算机领域的最新算法,说不定就能解决某个生信上的算法难题。
现在数据挖掘、机器学习、文本处理的算法就那么多,看着看着最终都会看到一起,慢慢的把它们都掌握了。
虚无的宇宙中居然出现了数学这种高度抽象而又实在的美,这难道不是造物主留下的痕迹吗?
2.CCS
Circular consensus sequencing
PacificBiosciences/unanimity - Consensus library and applications(这个软件可以直接做CCS)
这个其实很好理解,只要搞清楚了三代测序的基本原理。
三代和二代测序原理完全不同,二代是桥式PCR、边合成编程序,测的是DNA的单链。而三代建库是对DNA双链建库,测的是环状的DNA双链,如果插入片段短的话,就会测到多个pass,显然多个pass之间是重复测序,我们就可以对其做CCS。
CCS与subreads的区别?是不一样的,CCS是一个校正的过程,而subreads就是原始测序的结果。
ccs takes multiple reads of the same SMRTbell sequence and combines them, employing a statistical model, to produce one high quality consensus sequence.
3.Quiver & Arrow
PacificBiosciences/GenomicConsensus 可以用的工具
bax2bam -f test.fofn -o subreads2
pbalign subreads.bam ref.fa mapped.bam
source setup.sh
samtools faidx ref.fa
arrow --algorithm=arrow -v -j8 mapped.bam -r ref.fa -o ref.arrowed.fq
Quiver is the legacy consensus model based on a conditional random field approach.(HGAP final "assembly polishing" step)
Arrow is an improved consensus model based on a more straightforward hidden Markov model approach.
有兴趣可以仔细研究这两个程序,都是纯python写的,而且涉及到两个很重要的算法。
4.Sparc & PBdagcon
PacificBiosciences/pbdagcon - A sequence consensus algorithm implementation based on using directed acyclic graphs to encode multiple sequence alignment
顺便提一下sparc
source setup.sh
blasr query.fa ref.fa -bestn -m -out mapped.m5
Sparc m mapped.m5 b ref.fa k c g t 0.2 o consensus.fa
5.Proovread
proovread – github
发表论文:proovread: large-scale high-accuracy PacBio correction through iterative short read consensus - 2014
通过迭代短read consensus来进行大规模的高准确度的PacBio纠错
本软件是对 PacBioToCA 和 LSC 的优化,PacBioToCA 丢失了>40%的数据,必须安装CA,在集群上运行,LSC主要是开发来用于人转录组的纠错。
a new SMRT sequencing correction pipeline:
- 能在普通电脑和集群上运行
- 可以应用到不同场合(基因组、转录组)
- 不损失准确度、长度和数据量
Box1:correction-by-short-read-consensus方法的简化表示,short reads (dark green bars)比对到了高错误和多嵌合体的
long read (blue bar)。black strokes是测序错误和没有比对上的位置。高错误率的地方会阻止short read的比对。碱基质量值在下面以(light green curve表示。在纠错过程中,主要的错误被移除,可能的嵌合体位点会被切断(有大问题,不能这么一概而论)。碱基质量值是根据覆盖度和每个位点consensus的组成来推断的。处理后的reads和嵌合体注释信息被写入文件。
Box2:使用质量值cutoff和嵌合体注释信息来trim reads,从而生成proovread最初的输出:high-accuracy long reads
检测嵌合体断点
The implementation of the theoretical model strongly depends on the used mapping software.
As default, proovread uses SHRiMP2 (David et al., 2011) for mapping. Its versatile interface allowed us to completely implement the hybrid scoring model with the following parameters: insertions are the most frequent errors and are penalized as gap open with –1. Deletions occur about half as often and are thus penalized with –2. Extensions for insertions and deletions are scored with –3 and –4, respectively. Mismatches are at least 10 times as rare, resulting in a penalty of –11 (Supplementary Table S1). All results presented here have been generated using these settings with SHRiMP2 version 2.2.3.
SHRiMP - SHort Read Mapping Package 软件主页
SHRiMP2 使用说明
6.LoRDEC
LoRDEC: a hybrid error correction program for long, PacBio reads
发表论文:LoRDEC: accurate and efficient long read error correction - 2014
We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy.
7.Ectools实战
Github源程序:Ectools - tools for error correction and working with long read data(有详细操作步骤)
ECTools官网
发表论文:Error correction and assembly complexity of single molecule sequencing reads – 2014
很简单的几十个Python脚本
整体来说,这个纠错算法是使用 unitigs(二代reads组装而来),来对三代长 reads 进行纠错。
- 将二代 reads 组装成 unitigs;输出 organism.utg.fasta
- 创建工作目录 mkdir organism_correct;创建软链接 ln -s /path/to/organism.utg.fasta;
- 过滤掉短于 1kb 的长 reads,保证有多于 20X 的数据用于纠错
- 将三代 reads 拆成多个部分,python ${ECTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa;
- 复制纠错脚本到工作目录,cp ${ECTOOLS_HOME}/correct.sh .;修改 correct.sh 中的全局变量
- 安装 nucmer
- 逐个运行;$> for i in {0001..000N}; do cd $i; qsub -cwd -j y -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..;
- 完成后合并纠错结果; cat ????/*.cor.fa > organism.cor.fa
- Use convert-fasta-to-v2.pl to make celera frg file from organism.cor.fa(可选)
From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction.
8.LSC
LSC - a long read error correction tool(for RNA-Seq)
发表论文:Improving PacBio Long Read Accuracy by Short Read Alignment
LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. 均聚物压缩
The workflow of standard LSC and the outline of error correction based on HC transformation。
可以好好看看biostars上的一个帖子:Question: What tools you use or know for PacBio Long Read error correction?
帖子里面介绍了如下工具:(这个帖子够你看好久)
Pacific Biosciences – PB github大本营 Bioinformatics Workshop - PB流程
PacBio RS - 知乎精华
Identify adapter sequences in pacbio reads BBMap - BBMap short read aligner, and other bioinformatic tools.
BBTools - 官网
SEQ 的 PacBio专题:http://seqanswers.com/forums/archive/index.php/f-39.html
待续~