Zenodo integration#
showyourwork! integrates with the Zenodo and Zenodo Sandbox services to easily allow users to host datasets and simulation results used in the workflow. There are two main ways in which this integration occurs: with static datasets and with dynamic datasets.
Static datasets#
If your workflow depends on data that cannot be programmatically generated,
such as data collected from a telescope, a mass spectrometer, a DNA
sequencer, etc., that data should be made available to anyone trying to
reproduce your results. Instead of committing the dataset directly to the
repository, we recommend you archive it on an online open-access
file-hosting service. You can use whichever service you want, but if you use
Zenodo, showyourwork! does all the communicating back-and-forth for you. All
you need to do is specify the ID of the (public) archive and some information about the
files your workflow needs in the showyourwork.yml
config file.
For detailed information about how to work with static Zenodo datasets, please
check out datasets.
Dynamic datasets#
The showyourwork! workflow has the ability to cache the results of intermediate steps in your pipeline on either Zenodo or Zenodo Sandbox. This is useful for workflows that entail running lengthy computations, simulations, etc., that third-party users may not want to run on their own. It’s also useful for builds on GitHub Actions, which has limited compute resources and a timeout of a few hours.
The way showyourwork! deals with these cases is to cache these lengthy computations on Zenodo or Zenodo Sandbox alongside a record of all the inputs that went into generating the cached output. If, on subsequent runs of the workflow, the inputs remain unchanged, showyourwork! will simply download the cached results from Zenodo or Zenodo Sandbox, maintaining the guarantee that the output you get follows deterministically from the given inputs.
Before we get into how this works and how to take advantage of it, let’s discuss how to set up the integration by generating Zenodo and Zenodo Sandbox API tokens.
Setting up Zenodo integration#
By default, caching takes place on Zenodo Sandbox, which is equal to Zenodo in all respects except that records on that service are temporary. This makes it perfect for caching intermediate results during the development cycle, where things are likely to change a lot and you may not want to assign a static, permanent DOI with any particular dataset.
In order for showyourwork! to communicate with Zenodo Sandbox, it needs access to an API token. You can generate it by clicking here. If you don’t yet have a Zenodo Sandbox account, you’ll be asked to create one.
Name the token something informative and make
sure to give it deposit:actions
and deposit:write
permissions. Copy the
token and store it somewhere secure.
Warning
Never commit your Zenodo API token (or any API token) directly to your repository!
Then, on your local computer, create an environment variable called
$SANDBOX_TOKEN
with value equal to the API token you just generated.
You can either do this manually in the terminal, e.g.,
export SANDBOX_TOKEN=YYYYYY
or by adding that line to your shell config file (.bashrc
, .zshrc
, etc.)
and re-starting your session.
In order for showyourwork! to have access to Zenodo Sandbox when running on GitHub
Actions, you’ll also have to provide this value as a secret (name it SANDBOX_TOKEN
);
read more about those
here.
If you’ve done all that, the next time you create a new article repository
using showyourwork setup, pass the --cache
option and showyourwork! will automatically create a Zenodo
Sandbox draft deposit which it will use to cache your intermediate results. Note that
you can also manually create a draft deposit by running showyourwork cache create
(see showyourwork cache, and below, for details).
Intermediate results#
Earlier, we mentioned that the Zenodo integration allows users to cache intermediate
results in their workflow.
But hang on–what’s an “intermediate result”? The standard procedure for generating
figures using showyourwork! is to define a figure script that generates the
figure output (PDF, PNG, etc.) and to specify that script using the \script
command in your TeX file (see The \script command). There’s no intermediate step
there–we simply go from figure script to figure output (and then to the article
PDF).
Suppose, however, that our figure script involves some lengthy computation, integration, or simulation. Every time we change anything in that script, showyourwork! will attempt to re-run the entire computation when asked to build your article PDF. This is good for reproducibility—i.e., to always ensure the output is up to date with the inputs—but it is extremely wasteful in cases where, e.g., we wish to tweak some aspect of the plot, like the color of a line in a graph.
To avoid this, we can split our script into two: one that runs the simulation and saves the results, and one that loads the results and plots them. Let’s discuss how to do this by considering an example.
An expensive workflow#
Consider a workflow containing the following figure script and TeX file:
src/scripts/figure.py
#import simulation
import matplotlib.pyplot as plt
import paths
# Run the simulation for some inputs
simulation.run(x=10, y=25)
data = simulation.get_results()
# Plot the results
fig, ax = plt.subplots(1)
ax.plot(data, color="k")
fig.savefig(paths.figures / "figure.pdf")
src/tex/ms.tex
#...
\begin{figure}[ht!]
\script{figure.py}
\begin{centering}
\includegraphics{figures/figure.pdf}
\caption{Simulation results.}
\label{fig:figure}
\end{centering}
\end{figure}
...
where simulation
is some custom package we’re using to run
an expensive simulation. As we mentioned above, changing anything in the
file src/scripts/figure.py
, including something as trivial as the plot
line color, will result in a re-run of the entire simulation the next time
we build the article.
The streamlined version#
We would like to streamline our workflow by decoupling the plotting step
from the simulation step. We can do this by introducing a new script, which
we’ll call simulation.py
, that runs and saves the result of the simulation.
Then, in figure.py
, we load the result and plot our figure:
src/scripts/simulation.py
#import simulation
import numpy as np
import paths
# Run the simulation for some inputs
simulation.run(x=10, y=25)
data = simulation.get_results()
# Save the results
np.savetxt(paths.data / "simulation.dat", data)
src/scripts/figure.py
#import numpy as np
import matplotlib.pyplot as plt
import paths
# Load the data
data = np.loadtxt(paths.data / "simulation.dat")
# Plot the results
fig, ax = plt.subplots(1)
ax.plot(data, color="k")
fig.savefig(paths.figures / "figure.pdf")
Our workflow is now separable: changes to figure.py
will not result
in the re-execution of the simulation, as they are merely plotting changes.
The simulation will only be re-executed if we change something in simulation.py
,
like the input arguments to our simulation.run()
function.
In order to get this all to work, we need to tell showyourwork! two things:
(1) the script figure.py
has a dependency called simulation.dat
and
(2) the dependency simulation.dat
can be generated by running the script
simulation.py
. We accomplish this by (1) editing the config file:
showyourwork.yml
#dependencies:
src/scripts/figure.py:
- src/data/simulation.dat
(see dependencies for details) and (2) adding a custom rule to our Snakefile:
Snakefile
#rule simulation:
output:
"src/data/simulation.dat"
script:
"src/scripts/simulation.py"
(see Intro to Snakemake for details).
Caching the intermediate result#
The workflow above is now separable, but we’re still not caching anything.
If we commit and push it to GitHub, the runner will still have to execute
simulation.py
in order to generate simulation.dat
; the same goes for
third-party users who have cloned your repository. Adding caching functionality
can be done by adding a single line to the Snakefile
:
Snakefile
#rule simulation:
output:
"src/data/simulation.dat"
cache:
True
script:
"src/scripts/simulation.py"
which tells showyourwork! to cache the output of that rule (simulation.dat
).
Normally, if we were just running this in a regular Snakemake pipeline, this
would result in the data file getting cached in some local hidden folder. The
next time you run your workflow, Snakemake will check to see if any of the inputs
to the simulation
rule changed and, if not, it will restore simulation.dat
from the cache (if it’s needed).
showyourwork! builds on this functionality by also caching the file simulation.dat
on Zenodo Sandbox, allowing the results to be restored on any computer running your
workflow (as long as they have the correct SANDBOX_TOKEN
; but more on this
in a moment). This means that, provided you have run your workflow locally first,
the runner on GitHub Actions will never have to execute simulation.py
, as
it can just download the result from Zenodo Sandbox. Recall that this procedure still
guarantees that you’ll get the same result as if you had run your entire
simulation (provided your workflow is deterministic), since a cache is only
restored if none of the upstream inputs to a rule have changed.
The cached files (and the hashes of the rule inputs)
are stored in a Zenodo Sandbox deposit draft with concept ID specified
in your zenodo.yml
config file. If you navigate to Zenodo Sandbox in your
browser and log in, you should see a draft with a title like
Data for user/repo [main]
, where user/repo
is your repository slug
and main
is the current branch. At any given time, you can only have
one draft per deposit, so if you change any of the inputs to your rule (e.g., if
you change the file simulation.py
), the draft will get overwritten with
a new version of the cache. Note, also, that drafts are private: only
users with access to your account can see their files.
Note
If you switch branches, or if you set up a repository without caching functionality and would like to add it, you can create a new Zenodo Sandbox deposit for the current branch by running
showyourwork cache create
Note
Snakemake only allows rules with a single output to be cached
(see here).
If you would like to cache a step in your build that generates multiple files,
the recommended approach is to generate all those files in a dedicated
directory and cache the directory itself and wrap it in the special directory
function within the rule:
rule simulation:
output:
directory("src/data/simulation")
cache:
True
script:
"src/scripts/simulation.py"
Snakemake (and showyourwork!) will automatically zip the contents of the directory and upload the archive to Zenodo Sandbox, then download and unzip it the next time it is needed.
Scripts that use the files in this output directory can have their rule inputs point directly to the directory itself.
rule figure:
input:
"src/data/simulation"
output:
"src/figures/figure.pdf"
script:
"src/scripts/figure.py"
You will also need to update the showyourwork.yml
config file to
reflect the dependency on the directory, rather than the individual files.
showyourwork.yml
#dependencies:
src/scripts/figure.py:
- src/data/simulation/
Freezing the cache#
Only one draft can exist at a given time on Zenodo Sandbox. This is inconvenient if, during development, you find the need to frequently switch back and forth between different sets of inputs (such as a few different randomizer seeds for a simulation). showyourwork! can help with this! At any time, you can “freeze” the latest draft on Zenodo Sandbox by running
showyourwork cache freeze
which publishes your draft into an actual deposit on Zenodo Sandbox. The next time you edit your workflow, a new draft will be created. Future builds can then draw from either the published version or the draft. This process can be repeated as often as you need, so that at any given time the workflow is able to restore intermediate results from any number of deposits on Zenodo Sandbox.
One useful side effect of freezing a deposit is that an API token is not
required to access it. This is especially convenient for working with
outside collaborators, who will not in general have access to your
SANDBOX_TOKEN
. If the cache has been frozen, they will be able to
restore output from the cache both locally and when building pull requests
on GitHub Actions.
Note
It is important to keep in mind that Zenodo Sandbox deposits cannot be deleted. While the service is temporary—and can be wiped at any time to free up space on the Zenodo servers—users should take care to not abuse it!
Publishing the cache#
When you’re ready to publish or distribute your article to the outside world –and you’re confident the inputs to your cached rules won’t change again– you should consider migrating your cached outputs from Zenodo Sandbox to Zenodo. You can easily do this by running
showyourwork cache publish
in the top level of your repo. This will transfer the latest draft of your Zenodo Sandbox deposit over to Zenodo and publish it, giving it a permanent DOI (digital object identifier) and making it visible to unauthenticated users. Once you do this, anyone can take advantage of the caching functionality, for all of eternity (well, for as long as Zenodo continues to exist).
Note that in order for this step to work, you’ll need proper authentication
in the form of an API token stored as the environment variable ZENODO_TOKEN
.
This is different from the Zenodo Sandbox token we generated above. If you
don’t have one, you may generate one by clicking
here.
If you don’t yet have a Zenodo account, you’ll be asked to create one.
Note
Once you publish your deposit, further changes to a cached rule’s inputs
will result in a new draft being created on Zenodo Sandbox (much like what
happens when you freeze
your cache). Future runs of your workflow
will be able to restore the cache from any of the frozen versions on Zenodo Sandbox,
published versions on Zenodo, or from the latest draft on Zenodo Sandbox.
Warning
Published Zenodo deposits are permanent! There is no way to delete a Zenodo deposit once
it’s published, as it now has a perennial DOI associated with it. Therefore,
it is important that users be responsible in their use of this service!
If you simply want a few different versions of your cache while you develop
your project, please use the freeze
feature instead (see above).
Deleting the cache#
You can delete the latest Zenodo Sandbox cache draft for the current branch by running
showyourwork cache delete
Note that, as we mentioned above, you can’t delete deposits once they have been published!