Hybrid assembly with long and short reads improves discovery of gene family expansions

Hybrid assembly with long and short reads improves discovery of gene family expansions  长链和短链杂交组合提高了基因家族扩展的发现

Abstract

Background

Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.

Methods

We developed a hybrid assembly pipeline called “Alpaca” that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation.

Results

Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies.

Conclusion

Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

Background

Tandemly duplicated genes are important contributors to genomic and phenotypic variation both among and within species [1]. Clusters of tandemly duplicated genes have been associated with disease resistance [2], stress response [3], and other biological functions [45]. Confounding the analysis of tandem repeats in most organisms is their underrepresentation in genome assemblies constructed from short-read sequence data, typically Illumina reads, for which the sequence reads are shorter than repeats [6,7,8,9].

The ALLPATHS-LG software [10] overcomes some of the assembly limitations of short-read sequencing by clever combination of Illumina paired end reads from both short-insert and long-insert libraries. Applied to human and mouse genomes, the ALLPATHS assembler produced assemblies with more contiguity, as indicated by contig N50 and scaffold N50, than had been attainable from other short-read sequence assemblers. ALLPATHS also performs well on many other species [1112]. The ALLPATHS assemblies approached the quality of Sanger-era assemblies by measures such as exon coverage and total genome coverage. However, the ALLPATHS assemblies captured only 40% of genomic segmental duplications present in the human and mouse reference assemblies [10]. Similarly, an ALLPATHS assembly of the rice (Oryza sativa Nipponbare) genome [13] was missing nearly 12 Mbp of the Sanger-era reference genome, including more than 300 Kbp of annotated coding sequence. These findings illustrate the potential for loss of repeat coding sequence in even the highest quality draft assemblies constructed exclusively from short-read sequence data.

Long-read sequencing offers great potential to improve genome assemblies. Read lengths from PacBio platforms (Pacific Biosciences, Menlo Park CA) vary but reach into the tens of kilobases [9]. The base call accuracy of individual reads is about 87% [14] and chimera, i.e. falsely joined sequences, can occur within reads [15]. Although low base call accuracy and chimeric reads create challenges for genome assembly, these challenges can be addressed by a hierarchical approach [9] in which the reads are corrected and then assembled. The pre-assembly correction step modifies individual read sequences based on their alignments to other reads from any platform. The post-correction assembly step can use a long-read assembler such as Celera Assembler [16,17,18], Canu [19], HGAP [20], PBcR [21], MHAP [22], or Falcon [23]. Because most of the errors in PacBio sequencing are random, PacBio reads can be corrected by alignment to other PacBio reads, given sufficient coverage redundancy [24]. For example, phased diploid assemblies of two plant and one fungal genome were generated by hierarchical approaches using 100X to 140X PacBio [25] and a human genome was assembled from 46X PacBio plus physical map data [23]. Despite the potential of long-read assembly, high coverage requirements increase cost and thereby limit applicability.

Several hybrid approaches use low-coverage PacBio to fill gaps in an assembly of other data. The ALLPATHS pipeline for bacterial genomes maps uncorrected long reads to the graph of an assembly in progress [26]. SSPACE-LongRead, also for bacterial genomes, maps long reads to contigs assembled from short reads [27]. PBJelly [28] maps uncorrected long reads to the sequence of previously assembled scaffolds and performs local assembly to fill the gaps. In tests on previously-existing assemblies of eukaryotic genomes, PBJelly was able to fill most of the intra-scaffold gaps between contigs using 7X to 24X long-read coverage [28]. These gap filling approaches add sequence between contigs but still rely on the contig sequences of the initial assemblies. As such, gap filling may not correct assembly errors such as missing segmental duplications or collapsed representations of tandemly duplicated sequence. Long reads that span both copies of a genomic duplication, including the unique sequences at the repeat boundaries, are needed during the initial contig assembly to avoid the production of collapsed repeats.

We developed a novel hybrid pipeline named Alpaca (ALLPATHS and Celera Assembler) that exploits existing tools to assemble Illumina short-insert paired-end short reads (SIPE), Illumina long-insert paired-end short reads (LIPE), and PacBio unpaired long reads. Unlike other approaches that use Illumina or PacBio sequencing for only certain limited phases of the assembly, Alpaca uses the full capabilities of the data throughout the entire assembly process: 1) contig structure is primarily formed by long reads that are error corrected by short reads, 2) consensus accuracy is maximized by the highly accurate base calls in Illumina SIPE reads, and 3) scaffold structure is enhanced by Illumina LIPE that can provide high-coverage connectivity at scales similar to the PacBio long reads. We targeted low-coverage, long-read data in order to make the pipeline a practical tool for non-model systems and for surveys of intraspecific structural variation.

We evaluated the performance of Alpaca using data from Oryza sativa Nipponbare (rice), assembling the genome sequence of the same O. sativa Nipponbare accession used to construct the 382 Mbp reference, which had been constructed using clone-by-clone assembly, Sanger-sequenced BAC ends, physical and genetic map integration, and prior draft assemblies [29]. We also sequenced and assembled three accessions of Medicago truncatula, a model legume, and compared these to the M. truncatula Mt4.0 reference assembly of the A17 accession [30]. The Mt4.0 reference had been constructed using Illumina sequencing, an ALLPATHS assembly, Sanger-sequenced BAC ends, a high-density linkage map, plus integration of prior drafts that integrated Sanger-based BAC sequencing and optical map technology [31].

For the Medicago analyses where no high quality reference sequence was available for the accessions whose genomes we assembled, we focused our evaluation on the performance of Alpaca on large multigene families that play important roles in plant defense (the NBS-LRR family) and in various regulatory processes involving cell to cell communications (the Cysteine-Rich Peptide, or CRP, gene family). Members of these multigene families are highly clustered; the reference genome of M. truncatula harbors more than 846 NBS-LRR genes, with approximately 62% of them in tandemly arrayed clusters and 1415 annotated Cysteine-Rich Peptide (CRP) genes, with approximately 47% of them in in tandemly arrayed clusters. Resolving variation in gene clusters like these is crucial for identifying the contribution of copy number variation (CNV) to phenotypic variation as well as understanding the evolution of complex gene families.

 

背景
长读和短读测序技术为真核生物基因组测序项目提供了竞争优势。
两者的组合可能适用于调查种内基因组变异。

方法
我们开发了一种名为“Alpaca”的混合装配流水线,可以运行20倍的长插入覆盖率,以及大约50倍的短插入覆盖率和50倍的长插入短插入覆盖率。
为了防止串联重复序列的崩溃,羊驼依靠碱基-呼叫-校正的长序列来形成叠群。

结果
与其他两种装配协议相比,羊驼在水稻基因组上的参照协议和重复捕获最多。
在三组模型豆科植物中,羊驼在同一参照上的一致性最大,并预测出了其他组合中缺失的串联重复基因。

结论
我们的结果表明,羊驼是一个有用的工具,以调查结构和拷贝数变异的重新装配的抽样群体。

背景
成对重复的基因是[1]种间和种内基因和表型变异的重要原因。
成对重复的基因簇与抗病性[2]、应激反应[3]等生物学功能有关[4,5]。
在大多数生物中,串联重复序列分析的问题在于它们在由短读序列数据构建的基因组装配中代表性不足,典型的是Illumina reads,其序列读序列比重复序列短[6,7,8,9]。

上一篇:每天学一点Scala之内部类


下一篇:关于Java内部类的初始化