Intro to Snakemake#

Under the hood, showyourwork! is essentially a wrapper around Snakemake. The code builds the article PDF by parsing the showyourwork.yml config file and the ms.tex manuscript to build the computational graph for the workflow, identifying which scripts it needs to execute and which datasets it needs to download to produce all the figures in the article. If you poke around the API, you’ll see that showyourwork! defines several Snakemake rules to do these various tasks, then hands over full control to Snakemake.

If your article consists only of text and figures that can be generated by running lightweight scripts, you probably don’t need to worry about any of this. But for certain use cases, it can be convenient to extend or even override some of the showyourwork! functionality by defining custom Snakemake rules. Below we discuss a few examples of this.

The Snakefile#

Every showyourwork article repository is instantiated with a blank Snakefile at the repository root. This file gets included at the start of the main (build) step of the workflow, and may thus be used to define custom rules or to run custom python code during the workflow. Almost everything you need to know about Snakefiles can be found in the Snakemake documentation, but we’ll go over the basics below.

Snakefiles are, at their core, Python scripts with a little extra functionality. Any valid Python script is also a valid Snakefile, so that should give you lots of flexibility to define your custom commands. However, the main thing you probably want to use the Snakefile for is to define custom rules for your workflow. Snakefile rules tell Snakemake how to generate an output file from given input files, much like rules in a classic Makefile. Snakemake rules usually look something like this:

rule simulation:
    input:
        "dataset1.dat",
        "dataset2.dat"
    output:
        "results.dat"
    conda:
        "environment.yml"
    params:
        seed=42,
        iterations=1000,
        mode="fast"
    script:
        "src/scripts/run_simulation.py"

In this example, we’ve defined a rule called simulation, which tells Snakemake how to produce the output file results.dat. Specifically, this file can be generated by running the script src/scripts/run_simulation.py in an isolated conda environment with specs given in environment.yml. The rule also tells Snakemake that the files dataset1.dat and dataset2.dat are dependencies of results.dat, meaning (1) the rule cannot be executed if those files are not present (and there’s no other Snakemake rule capable of generating them) and (2) whenever either of those two files is modified, this rule will be re-executed the next time the workflow runs in order to keep results.dat up to date with its inputs. Finally, the rule specifies three parameters params, which can be accessed within the script via the snakemake.params dictionary (e.g., snakemake.params["seed"]). Note that there’s no need to explicitly import snakemake within run_simulation.py, as it gets automagically inserted into the namespace.

Note

The argument to the script key must be a Python script. If your script is in a different language, you can instead pass the shell key and provide a string containing the shell command Snakemake should execute to produce the output file, e.g., jupyter execute notebook.ipynb. If you do that, remember to include the script (notebook.ipynb) as an explicit input to your rule so that Snakemake can track dependencies properly!

Note that Snakemake also provides a run key which allows users to specify Python code directly. To ensure commands are run in isolated conda environments (to maximize reproducibility), showyourwork! does not support this. Please use either script or shell in your rules, and remember to always provide a conda environment file.

There are a lot of other features supported within rules; for instance, input files and parameters can be provided as functions, adding another layer of flexibility to your workflow. Rules can also be declared within for loops, if statements, etc. For the full list of features, please refer to the Snakemake documentation.

Intermediate results#

An example usage of the Snakefile is discussed in the Zenodo integration guide, where we show how to define a Snakemake rule to generate intermediate results. The idea here is that partitioning one’s workflow into pipeline steps and plotting steps can make it easier for the author (and the interested reader) while writing or editing the article. For example, suppose one of the figures in an article depends on running a computationally expensive simulation. If this simulation is run within the script that generates the figure, any changes to that script will result in a re-execution of the simulation the next time the article is built. Thus, if one wanted to change something as simple as the color of one of the lines in the figure, the entire simulation would have to be run again.

The way around this is to split the script into a simulation script and a plotting script. The former generates an intermediate results file, and the latter loads that file to do the plotting. This way, the plotting is decoupled from the simulation, and changes to the plotting script will not trigger re-execution of the expensive computation.

In the Zenodo integration guide, we show how to define a custom Snakemake rule to make this work. In that guide, we also discuss how showyourwork! extends the Snakemake cache command to allow caching of intermediate results on Zenodo, which can help others avoid re-running expensive computations when reproducing your work.

Variables in the TeX file#

Another use case for custom rules is the definition of dynamic variables in the TeX manuscript. For example, say I have a script called age_of_universe.py that infers the age of the universe from some cosmological dataset:

File: age_of_universe.py#

import paths
from my_awesome_code import get_age_of_universe

# Load the data
dataset = paths.data / "planck.dat"

# Compute the age
age = get_age_of_universe(dataset)

# Write it to disk
with open(paths.output / "age_of_universe.txt", "w") as f:
    print(f"{age:.3f}", file=f)

I would like to report this age in the text of my article, but I want to avoid having to re-type it in every time I make changes to my workflow that affect this quantity. We can easily automate this by defining a custom Snakemake rule:

File: Snakefile#

rule age_of_universe:
    input:
        "src/data/planck.dat"
    output:
        "src/tex/output/age_of_universe.txt"
    script:
        "src/scripts/age_of_universe.py"

Then, in my TeX file, I can do the following:

File: ms.tex#

Based on a detailed analysis of Planck observations of the cosmic
microwave background, we have determined the age of the universe
to be \variable{output/age_of_universe.txt} Gyr.

That’s it! This functionality can easily be adapted to automatically populate tables in your article or anything else that can be generated programmatically from your workflow. Note that showyourwork! automatically parses calls to \variable statements and adds their arguments as explicit dependencies of the manuscript, so that any changes to these files will trigger a re-run of the compile step. For more information on this command, see The \variable command.

Mixed figure environments#

Note

Coming soon: how to deal with \figure environments with figures that are generated by multiple different scripts, or if you’d like to include figures generated by a given script in multiple figure environments. It’s easy if you define your own Snakemake rules.

Advanced usage#

It is also possible to entirely override showyourwork! rules. When ingesting user-defined rules from the Snakefile, the code automatically gives precedence to those rules over showyourwork! rules (by setting a higher ruleorder for all user rules). This means that if there are two rules that can generate the same output, Snakemake will always favor the user-defined rule. You can take advantage of this to provide custom rules to build individual figures or even the article PDF itself.

Using existing (data) files in a workflow by ignoring timestamps#

When starting up a project or when in a rapid development phase, it can be useful to tell Snakemake to ignore changes to a file or timestamp when running the build. For example, you may have a slow rule to generate a data file from querying an external data archive and you just want to use a temporary subset of the data or existing copy of the data. Snakemake supports this with the ancient() command. See the Snakemake documentation for more information about how to use this in a rule.