snakemake教程-03额外特征

2024-01-25 17:31:10

文章目录

- 额外的特征

参考： https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html

额外的特征

接下来，我们将介绍更多上面的案例没有包括的特征。更多细节和更多特征，可以参考 Writing Workflows, Frequently Asked Questions或者使用命令行帮助(snakemake --help)。

1. Benchmarking（基准测试）

使用benchmark指令，Snakemake可以度量一项工作的运行时间。我们可以在bwa_map中激活benchmarking:

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

可以多次重复基准测试，了解测量量的可变性。这可以通过另一个注释benchmark文件来实现，例如：repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)。Snakemake会被告知执行三次。重复度量可以作为tab分割的基准文件的后续行。

运行代码为：

##删掉保护的bam文件
rm -rf /share/inspurStorage/home1/jialh/tools/snakemake/snakemake-tutorial/sorted_reads

##重新运行整个流程
snakemake --forceall --cores 4

2. Modularization（模块化）

为了重用构建的模块或者简化结构化的大型流程，有时候需要合理地split a workflow into modules。对此，Snakemake提供了include指令来将另一个Snakefile包含进当前的Snakefile，例如：

include: "path/to/other.snakefile"

另外，Snakemake还允许定义子流程（define sub-workflows，详情可参考https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#snakefiles-sub-workflows）。基本示例如下：

subworkflow otherworkflow:
    workdir:
        "../path/to/otherworkflow"
    snakefile:
        "../path/to/otherworkflow/Snakefile"
    configfile:
        "path/to/custom_configfile.yaml"

rule a:
    input:
        otherworkflow("test.txt")
    output: ...
    shell:  ...

练习

将读段比对相关的rules作为分开的Snakefile, 然后用include指令，使它们能够在我们的示例流程中工作。

mapping_snakefile文件中写入：

##snakemake --cores 4 -s Snakefile
##snakemake --cores 4 mapped_reads/B.bam
##snakemake --cores 4 mapped_reads/C.bam
def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

主snakefile为：

include: "/home1/jialh/tools/snakemake/snakemake-tutorial/mapping_Snakefile" ##注意需要使用引号
#SAMPLES = ["A", "B"]
configfile: "config.yaml"
rule all:
    input:
        "plots/quals.svg"

##snakemake --cores 4 sorted_reads/B.bam
##snakemake --cores 4 sorted_reads/{A,B,C}.bam
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

##snakemake --cores 4 sorted_reads/{A,B}.bam.bai
##snakemake --dag sorted_reads/{A,B}.bam | dot -Tsvg > dag_test.svg
rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

##snakemake --cores 4  calls/all.vcf
print(config["pmr"])
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    params:
        pmr=config["pmr"]
    log:
        "logs/bcftools_call/all_vcf.log"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv -P {params.pmr} - > {output}"

##snakemake --dag calls/all.vcf | dot -Tsvg > dag_vcf.svg

##使用自己写的脚本。
##snakemake --cores 4  plots/quals.svg
rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"

3.Automatic deployment of software dependencies（自动部署软件依赖项）

为了完全重复数据分析，能够执行每个步骤和指定所有使用的参数是不够的。使用的软件工具和库也需要写清楚。在本教程中，你已经看到了Conda如何为整个流程指定独立的软件环境。使用Snakefile，你能够更进一步，指定每个rule的Conda环境。这种方式，你可以使用冲突的软件版本（例如，结合Python2和Python3）。

在我们的案例中，除了外部环境，我们可以指定每条rule的环境，例如：

rule samtools_index:
  input:
      "sorted_reads/{sample}.bam"
  output:
      "sorted_reads/{sample}.bam.bai"
  conda:
      "envs/samtools.yaml"
  shell:
      "samtools index {input}"

其中，envs/samtools.yaml可以被定义为：

channels:
  - default
  - bioconda
  - conda-forge
dependencies:
  - samtools =1.9

Snakemake可以被执行为：

snakemake --use-conda --cores 4 --forceall

如果自动创建需要的环境和然后在执行工作前激活它们。最好是指定该环境需要的packages的最低和最高版本。上述运行结果如下：

会尝试下载和激活所需的环境。

4. Tool wrappers（工具包装）

为了简化流行工具的使用，Snakemake提供了包装资源库（ Snakemake wrapper repository）。A wrapper是一段短的脚本（通常包装了命令行应用，使之在Snakemake内部能够直接处理。对此，Snakemake提供了wrapper指令来代替shell, script或者run。例如bwa_map可以被写成如下形式：

rule bwa_mem:
  input:
      ref="data/genome.fa",
      sample=lambda wildcards: config["samples"][wildcards.sample]
  output:
      temp("mapped_reads/{sample}.bam")
  log:
      "logs/bwa_mem/{sample}.log"
  params:
      "-R '@RG\tID:{sample}\tSM:{sample}'"
  threads: 8
  wrapper:
      "0.15.3/bio/bwa/mem"

wrapper指令需要（部分）URL直接指向存储库中的包装器。这些可以在对应的数据库（database）中看到。在调用时，Snakemake将自动下载请求的包装器版本。进一步，结合--use-conda，所需的软件将自动在执行之前下载。

[注意：按照实际操作看，–use-conda会导致程序执行很慢，而且可能导致奇怪的错误，慎用。]

5. Cluster execution

默认情形下，Snakemake在本地机器上执行工作。然而，其也可以在分布式环境下执行工作，例如：compute cluster or batch system。如果节点共享公共的文件系统，Snakemake支持三种可选的执行模式。

在集群环境中，计算工作通常通过类似qsub的命令提交给shell脚本。Snakemake提供了一种generic mode在clusters上执行。通过调用Snakemake：

snakemake --cluster qsub --jobs 100

此部分暂时用不到，略过。

6. Using –cluster-status

可用于检测，是否一个cluster job被完全执行成功了。

7.Constraining wildcards

Snakemake使用正则表达式来匹配输入文件和输出文件，进而决定jobs之间的依赖关系。有时候，限制一个通配符的值非常有用。这可以通过增加描述允许通配符的值的集合的正则表达式来实现。例如，通配符sample在输出文件"sorted_reads/{sample}.bam"可以限制其只允许字母数字式样的sample name为"sorted_reads/{sample,[A-Za-z0-9]+}.bam"。每个规则都可以定义限制，或者在全局使用wildcard_constrains关键字，正如Wildcards所展示的。这种机制能够解决两种类型的混淆。

其可以避免歧义规则，例如两个或者多个规则被用于产生相同的输出文件。其他处理混淆规则的方法会在 Handling Ambiguous Rules中描述。
其有助于基于匹配指导正则表达式，使得通配符被分配到文件名的正确部分。考虑输出文件{sample}.{group}.txt，假设目标文件是A.1.normal.txt。不清楚到底是dataset="A.1"和group="normal"，还是dataset="A"和group="1.normal"是正确的赋值。此处，通过{sample,[A-Z]+}.{group}限制数据集通配符可以解决此问题。

当解决混淆规则时，最好的办法是首先尝试通过合适的文件结构解决混淆，例如，通过将不同步骤的输出文件划分到不同的目录。

码农公寓