Skip to content Skip to sidebar Skip to footer

Snakemake - Dynamically Derive The Targets From Input Files

I have a large number of input files organized like this: data/ ├── set1/ │ ├── file1_R1.fq.gz │ ├── file1_R2.fq.gz │ ├── file2_R1.fq.gz │ �

Solution 1:

I think you misunderstand how snakemake works. When you run snakemake you define which output you want, either on the command-line, or else the input of the first rule in the Snakefile (your rule all) will be generated. Since you do not specify any output (snakemake -np), Snakemake will try to generate the input of rule all.

The input of your rule all basically is:

"somepath/clipped_and_trimmed_reads/{sample}_R1.fq.gz"

Snakemake, unfortunately, does not know how to generate output from this... We need to tell Snakemake which files to use. We can do this manually:

rule all:
    input:
        "somepath/clipped_and_trimmed_reads/file1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/asdf1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/zxcv1_R1.fq.gz"

But this becomes quite cumbersome as we get more files, and as you specify in the question, is not what you want. We need to write a small function that that gets all the filenames for us.

import glob
import re

data=config["raw_data"]
samples = []
locations = {}
for file in glob.glob(data + "/**", recursive=True):
    if'_R1.fq.gz' in file:
        split = re.split('/|_R1', file)
        filename, directory = split[-2], split[-3]
        locations[filename] = directory  # we will need this one later
        samples.append(filename)

We can now feed this to our rule all:

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{sample}_R1.fq.gz", output=config["output"], sample=samples),
        read2=expand("{output}/clipped_and_trimmed_reads/{sample}_R2.fq.gz", output=config["output"], sample=samples)

Note that we do not have sample as a wildcard anymore, but we 'expand' it into our read1 and read2, making all possible combinations of output and sample.

We are only halfway done however.. If we call Snakemake like this, it will know exactly which output we want, and which rule can generate this (rule clip_and_trim_reads). However it still does not know where to look for these files. Fortunately we already have a dictionary where we stored these (stored in locations).

rule clip_and_trim_reads:
    input:
        read1=lambda wildcards: expand("{data}/{set}/{sample}_R1.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample),
        read2=lambda wildcards: expand("{data}/{set}/{sample}_R2.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample)
    output:
        ...

Now everything should work!! And even better; since all our results from the rule clip_and_trim_reads are written to a single folder, continuing from here should be much easier!

p.s. I haven't tested any of the code, so probably not everything works on the first try. However the message remains.

Post a Comment for "Snakemake - Dynamically Derive The Targets From Input Files"