Intro to Snakemake#
Under the hood, showyourwork! is essentially a wrapper around Snakemake. The
code builds the article PDF by parsing the showyourwork.yml
config file and
the ms.tex
manuscript to build the computational graph for the workflow,
identifying which scripts it needs to execute and which datasets it needs to
download to produce all the figures in the article. If you poke around the
API, you’ll see that showyourwork! defines several Snakemake rules to do
these various tasks, then hands over full control to Snakemake.
If your article consists only of text and figures that can be generated by running lightweight scripts, you probably don’t need to worry about any of this. But for certain use cases, it can be convenient to extend or even override some of the showyourwork! functionality by defining custom Snakemake rules. Below we discuss a few examples of this.
The Snakefile#
Every showyourwork
article repository is instantiated with a blank Snakefile
at the repository root. This file gets included at the start of the main (build) step of the
workflow, and may thus be used to define custom rules or to run custom python
code during the workflow. Almost everything you need to know about Snakefiles can
be found in the
Snakemake documentation,
but we’ll go over the basics below.
Snakefiles are, at their core, Python scripts with a little extra functionality.
Any valid Python script is also a valid Snakefile, so that should give you lots
of flexibility to define your custom commands. However, the main thing you probably
want to use the Snakefile for is to define custom rules for your workflow.
Snakefile rules tell Snakemake how to generate an output
file from given
input
files, much like rules in a classic Makefile
. Snakemake rules
usually look something like this:
rule simulation:
input:
"dataset1.dat",
"dataset2.dat"
output:
"results.dat"
conda:
"environment.yml"
params:
seed=42,
iterations=1000,
mode="fast"
script:
"src/scripts/run_simulation.py"
In this example, we’ve defined a rule called simulation
, which tells
Snakemake how to produce the output file results.dat
. Specifically,
this file can be generated by running the script src/scripts/run_simulation.py
in an isolated conda environment with specs given in environment.yml
.
The rule also tells Snakemake that the files dataset1.dat
and dataset2.dat
are dependencies of results.dat
, meaning (1) the rule cannot be executed
if those files are not present (and there’s no other Snakemake rule capable
of generating them) and (2) whenever either of those two files is modified,
this rule will be re-executed the next time the workflow runs in order to keep
results.dat
up to date with its inputs.
Finally, the rule specifies three parameters params
, which can be accessed
within the script via the snakemake.params
dictionary
(e.g., snakemake.params["seed"]
). Note that there’s
no need to explicitly import snakemake
within run_simulation.py
, as
it gets automagically inserted into the namespace.
Note
The argument to the script
key must be a Python script.
If your script is in a different language, you can instead pass the
shell
key and provide a string containing the shell command
Snakemake should execute to produce the output file, e.g.,
jupyter execute notebook.ipynb
. If you do that, remember to include
the script (notebook.ipynb
) as an explicit input to your rule so
that Snakemake can track dependencies properly!
Note that Snakemake also provides a run
key which allows users
to specify Python code directly. To ensure commands are run in isolated
conda environments (to maximize reproducibility), showyourwork! does
not support this. Please use either script
or shell
in your rules,
and remember to always provide a conda environment file.
There are a lot of other features supported within rules; for instance, input files and parameters can be provided as functions, adding another layer of flexibility to your workflow. Rules can also be declared within for loops, if statements, etc. For the full list of features, please refer to the Snakemake documentation.
Intermediate results#
An example usage of the Snakefile is discussed in the Zenodo integration guide, where we show how to define a Snakemake rule to generate intermediate results. The idea here is that partitioning one’s workflow into pipeline steps and plotting steps can make it easier for the author (and the interested reader) while writing or editing the article. For example, suppose one of the figures in an article depends on running a computationally expensive simulation. If this simulation is run within the script that generates the figure, any changes to that script will result in a re-execution of the simulation the next time the article is built. Thus, if one wanted to change something as simple as the color of one of the lines in the figure, the entire simulation would have to be run again.
The way around this is to split the script into a simulation script and a plotting script. The former generates an intermediate results file, and the latter loads that file to do the plotting. This way, the plotting is decoupled from the simulation, and changes to the plotting script will not trigger re-execution of the expensive computation.
In the Zenodo integration guide, we show how to define a custom Snakemake rule to
make this work. In that guide, we also discuss how showyourwork! extends
the Snakemake cache
command to allow caching of intermediate results on
Zenodo, which can help others avoid re-running expensive computations when
reproducing your work.
Variables in the TeX file#
Another use case for custom rules is the definition of dynamic variables in
the TeX manuscript. For example, say I have a script called age_of_universe.py
that infers the age of the universe from some cosmological dataset:
age_of_universe.py
#import paths
from my_awesome_code import get_age_of_universe
# Load the data
dataset = paths.data / "planck.dat"
# Compute the age
age = get_age_of_universe(dataset)
# Write it to disk
with open(paths.output / "age_of_universe.txt", "w") as f:
print(f"{age:.3f}", file=f)
I would like to report this age in the text of my article, but I want to avoid having to re-type it in every time I make changes to my workflow that affect this quantity. We can easily automate this by defining a custom Snakemake rule:
Snakefile
#rule age_of_universe:
input:
"src/data/planck.dat"
output:
"src/tex/output/age_of_universe.txt"
script:
"src/scripts/age_of_universe.py"
Then, in my TeX file, I can do the following:
ms.tex
#Based on a detailed analysis of Planck observations of the cosmic
microwave background, we have determined the age of the universe
to be \variable{output/age_of_universe.txt} Gyr.
That’s it! This functionality can easily be adapted to automatically populate tables in
your article or anything else that can be generated programmatically from your
workflow. Note that showyourwork! automatically parses calls to \variable
statements and adds their arguments as explicit dependencies of the manuscript,
so that any changes to these files will trigger a re-run of the compile step.
For more information on this command, see The \variable command.
Mixed figure environments#
Note
Coming soon: how to deal with \figure
environments with figures
that are generated by multiple different scripts, or if you’d like to
include figures generated by a given script in multiple figure
environments. It’s easy if you define your own Snakemake rules.
Advanced usage#
It is also possible to entirely override showyourwork! rules. When ingesting
user-defined rules from the Snakefile, the code automatically gives precedence
to those rules over showyourwork! rules (by setting a higher ruleorder
for
all user rules). This means that if there are two rules that can generate the
same output, Snakemake will always favor the user-defined rule.
You can take advantage of this to provide custom rules to build individual
figures or even the article PDF itself.
Using existing (data) files in a workflow by ignoring timestamps#
When starting up a project or when in a rapid development phase, it can be useful to
tell Snakemake to ignore changes to a file or timestamp when running the build. For
example, you may have a slow rule to generate a data file from querying an external data
archive and you just want to use a temporary subset of the data or existing copy of the
data. Snakemake supports this with the ancient()
command. See the Snakemake
documentation
for more information about how to use this in a rule.