SPAdes

用后感:

拼个小基因组还好,对于很大的基因组,文库很多的,还是不要用了。服务器768G内存,都不够用。。。。

SPAdes

主页:

http://bioinf.spbau.ru/spades

说明书:

http://spades.bioinf.spbau.ru/release3.6.1/manual.html

Note, that SPAdes was initially designed for small genomes. It was tested on single-cell and standard bacterial and fungal data sets. SPAdes is not intended for larger genomes (e.g. mammalian size genomes) and metagenomic projects. For such purposes you can use it at your own risk.

SPAdes has also a separate modules for assembling highly polymorphic diploid genomes and for TruSeq barcode assembly. For more information see dipSPAdes manual and truSPAdes manual.

对reads的矫正默认是开启的,如果你用自己的矫正软件,可以关闭spades的矫正。

支持的输入数据:

illumina: 既可以是fasta又可以是fastq格式

paired end

mate pairs

single(unpaired)reads

IonTorrent:

fastq

bam

Sanger, Oxford Nanopore and PacBio reads can be provided in both formats since SPAdes does not run error correction for these types of data.

SPAdes should not be used if only PacBio, Oxford Nanopore, Sanger reads or additional contigs are available.

下载解压免安装 版本是3.6.1

wget http://spades.bioinf.spbau.ru/release3.6.1/SPAdes-3.6.1-Linux.tar.gz
    tar -xzf SPAdes-3.6.1-Linux.tar.gz
    cd SPAdes-3.6.1-Linux/bin/

服务位置:

~/spades/SPAdes-3.6.1-Linux/bin

suggest adding SPAdes installation directory to the PATH variable.

测试是否可以用:

./spades.py --test

Version 3.6.1 of SPAdes supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously.

命令示范:

nohup ~/spades/SPAdes-3.6.1-Linux/bin/spades.py --careful--pe1-1 SHT-6K_rmpcr-1_paired.fq  --pe1-2  SHT-6K_rmpcr-2_paired.fq \--pe2-1 NHD1114_L1_1_paired.fq  --pe2-2  NHD1114_L1_2_paired.fq \--pe3-1 NHD1114_L2_1_paired.fq    --pe3-2  NHD1114_L2_2_paired.fq \--pe4-1 NHD1115_L1_1_paired.fq.gz  --pe4-2 NHD1115_L1_2_paired.fq.gz   \--pe5-1 NHD1115_L2_1_paired.fq.gz  --pe5-2 NHD1115_L2_2_paired.fq.gz \--mp1-1 SHT-3K-1_rmpcr-1_paired.fq  --mp1-2 SHT-3K-1_rmpcr-2_paired.fq  \--mp2-1 SHT-3K-2_rmpcr-1_paired.fq  --mp2-2 SHT-3K-2_rmpcr-2_paired.fq  \--mp3-1 5k_rmpcr-1_paired.fastq  --mp3-2 5k_rmpcr-2_paired.fastq  \--mp4-1 SHT-6K_rmpcr-1_paired.fq  --mp4-2 SHT-6K_rmpcr-2_paired.fq \
--s1 SHT_500_1.fq --s2 SHT_500_2.fq \
--pacbio SHT_filtered_subreads20150723.fastq \
-t 10 -k 21,33,55,77,99,127  -o SPAdes_results &

用在命令中制定数据,最多只能指定5个pe和mp库,pe默认是fr的,mp默认是rf的。

如果有单端的数据:--s1  test1.fq --s2 test2.fq 不能用-s 必须是--s

如果有三代,nanopore, contigs数据:

Specifying data for hybrid assembly
--pacbio <file_name>
File with PacBio reads. More information on PacBio reads is provided in section 3.1.

--nanopore <file_name>
File with Oxford Nanopore reads.

--sanger <file_name>
File with Sanger reads

--trusted-contigs <file_name>
Reliable contigs of the same genome, which are likely to have no misassemblies and small rate of other errors (e.g. mismatches and indels). This option is not intended for contigs of the related species.

--untrusted-contigs <file_name>
Contigs of the same genome, quality of which is average or unknown. Contigs of poor quality can be used but may introduce errors in the assembly. This option is also not intended for contigs of the related species.

-o 指定的目录必须存在,你要先创建这个目录,然后再跑。

-t <int> (or --threads <int>)
Number of threads. The default value is 16.

-m <int> (or --memory <int>)
Set memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of consumed RAM will be below this limit. Make sure this value is correct for the given machine. SPAdes uses the limit value to automatically determine the sizes of various buffers, etc.

--tmp-dir <dir_name>
Set directory for temporary files from read error correction. The default value is <output_dir>/corrected/tmp

--sc 如果你是单细胞基因组,小基因组,用这个参数

-k <int,int,...>
Comma-separated list of k-mer sizes to be used (all values must be odd, less than 128 and listed in ascending order). If --sc is set the default value are 21,33,55. For multicell data sets K values are automatically selected using maximum read length (see note for assembling long Illumina paired reads for details). To properly select K values for IonTorrent data read section 3.3.

--careful
Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector – a post processing tool, which uses BWA tool (comes with SPAdes). This option is recommended.

--continue
Continues SPAdes run from the specified output folder starting from the last available check-point. Check-points are made after:

error correction module is finished
iteration for each specified K value of assembly module is finished
mismatch correction is finished for contigs or scaffolds
For example, if specified K values are 21, 33 and 55 and SPAdes was stopped or crashed during assembly stage with K = 55, you can run SPAdes with the --continue option specifying the same output directory. SPAdes will continue the run starting from the assembly stage with K = 55. Error correction module and iterations for K equal to 21 and 33 will not be run again. Note that all options except -o <output_dir> are ignored if --continue is set.

关于kmer的问题:如果你是单细胞数据,就加上--sc参数,如果你是多细胞的数据,既不要加--sc 也不要加-k, 软件会根据你的read长度,去选择kmer,长度多长,kmer就多大,-k是指定你想要的kmer值,必须是奇数。

SPAdes

--restart-from <check_point>
Restart SPAdes run from the specified output folder starting from the specified check-point. Check-points are:

ec – start from error correction
as – restart assembly module from the first iteration
k<int> – restart from the iteration with specified k values, e.g. k55
mc – restart mismatch correction
In comparison to the --continue option, you can change some of the options when using --restart-from. You can change any option except: all basic options, all options for specifying input data (including --dataset), --only-error-correction option and --only-assembler option. For example, if you ran assembler with k values 21,33,55 without mismatch correction, you can add one more iteration with k=77 and run mismatch correction step by running SPAdes with following options:
--restart-from k55 -k 21,33,55,77 --mismatch-correction -o <previous_output_dir>.
Since all files will be overwritten, do not forget to copy your assembly from the previous run if you need it.

这两个参数挺好的, 不用重新跑~

如果你还有别的库,就要用yaml文件了。

By using a YAML file you can provide an unlimited number of paired-end, mate-pair and unpaired libraries. Basically, YAML data set file is a text file, in which input libraries are provided as a comma-separated list in square brackets. Each library is provided in braces as a comma-separated list of attributes. The following attributes are available:

  • orientation ("fr", "rf", "ff")
  • type ("paired-end", "mate-pairs", "hq-mate-pairs", "single", "pacbio", "nanopore", "sanger", "trusted-contigs", "untrusted-contigs")
  • interlaced reads (comma-separated list of files with interlaced reads)
  • left reads (comma-separated list of files with left reads)
  • right reads (comma-separated list of files with right reads)
  • single reads (comma-separated list of files with single reads)

To properly specify a library you should provide its type and at least one file with reads. Orientation is an optional attribute. Its default value is "fr" (forward-reverse) for paired-end libraries and "rf" (reverse-forward) for mate-pair libraries.

The value for each attribute is given after a colon. Comma-separated lists of files should be given in square brackets. For each file you should provide its full path in double quotes. Make sure that files with right reads are given in the same order as corresponding files with left reads.

SPAdes

这个yaml文件指定了两个pe库,一个mp库, 一个pacbio数据。

Notes:

  • 如果用了--dataset来指定数据,就不能再用--pe1-1这种指定数据方式了。
  • We recommend to nest all files with long reads of the same data type in a single library block.

Additional contigs

In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using --trusted-contigs or --untrusted-contigs. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited.

Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified.

SPAdes output

SPAdes stores all output files in <output_dir> , which is set by the user.

<output_dir>/corrected/ directory contains reads corrected by BayesHammer in *.fastq.gz files; if compression is disabled, reads are stored in uncompressed *.fastq files
<output_dir>/contigs.fasta contains resulting contigs
<output_dir>/scaffolds.fasta contains resulting scaffolds
<output_dir>/assembly_graph.fastg contains SPAdes assembly graph in FASTG format
To view FASTG files we recommend to use Bandage tool. Note that sequences stored in assembly_graph.fastg correspond to contigs before repeat resolution (edges of the assembly graph). Contigs after repeat resolution (scaffolding) are stored in contigs.fasta (scaffolds.fasta) and can be represented as paths in the assembly graph.

The full list of <output_dir> content is presented below:

contigs.fasta – resulting contigs
scaffolds.fasta – resulting scaffolds
before_rr.fasta – contigs before repeat resolution
assembly_graph.fastg – assembly graph

corrected/ – files from read error correction
configs/ – configuration files for read error correction
corrected.yaml – internal configuration file
Output files with corrected reads

params.txt – information about SPAdes parameters in this run
spades.log – SPAdes log
dataset.info – internal configuration file
input_dataset.yaml – internal YAML data set file
K<##>/ – directory containing files from the run with K=<##>
SPAdes will overwrite these files and directories if they exist in the specified <output_dir>.

freemao

FAFU

上一篇:Redis数据类型,以及应用场合


下一篇:python学习笔记(基础四:模块初识、pyc和PyCodeObject是什么)