Introduction
Overview
Teaching: 10 min
Exercises: 1 minQuestions
What makes research data analyses reproducible?
Is preserving code, data, and containers enough?
Objectives
Understand principles behind computational reproducibility
Understand the concept of serial and parallel computational workflows
Computational reproducibility
A reproducibility quote
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
– Jonathan B. Buckheit and David L. Donoho, “WaveLab and Reproducible Research”, source
Computational reproducibility has many definitions. Example: The Turing Way definition of computational reproducibility: source
In other words: same data + same analysis = reproducible results
What about real life?
Example: Nature volume 533 issue 7604 (2016) surveying 1500 scientists. source
Half of researchers cannot reproduce their own results.
Slow uptake of best practices
Many “best practices” guidelines published. For example, “Ten Simple Rules for Reproducible Computational Research” by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig (2013) DOI:10.1371/journal.pcbi.1003285:
- For every result, keep track of how it was produced
- Avoid manual data manipulation steps
- Archive the exact versions of all external programs used
- Version control all custom scripts
- Record all intermediate results, when possible in standardized formats
- For analyses that include randomness, note underlying random seeds
- Always store raw data behind plots
- Generate hierarchical analysis output, allowing layers of increasing detail to be inspected
- Connect textual statements to underlying results
- Provide public access to scripts, runs, and results
Yet the uptake has been slow. Several reasons:
- sociological: publish-or-perish culture; missing incentives
- technological: easy-to-use tools
Top-down approaches (funding bodies asking for Data Management Plans) combined with bottom-up approaches (building tools integrating into daily research workflow) bringing the change.
A reproducibility quote
Your closest collaborator is you six monhts ago… and your younger self does not reply to emails.
Four questions
Four questions to aid robustness of analyses:
- Input data? Specify all input data and parameters.
- Analysis code? Specify all analysis code and libraries analysing the data.
- Compute environment? Specify all requisite liraries and operating system platform running the analysis.
- Runtime procedures? Specify all the computational steps taken to achieve the result.
Code and containerised environment was covered in previous two days; good!
Today we’ll cover the preservation of runtime procedures.
Exercise
Are containers enough to capture your runtime environment? What else might be necessary in your typical physics analysis scienarios?
Solution
Any external resources, such as database calls, must also be thought about. Will the external database that you use be there in two years?
Computational workflows
Use of interactive and graphical interfaces is not recommended, as one cannot reproduce user clicks easily.
Use of custom helper scripts (e.g. run.sh
shell scripts) or custom orchestration scripts (e.g.
Python glue code) running the analysis is much better.
However, porting glue code to new usage scenarios may be tedious work that is better spent doing research.
Hence the birth of declarative workflow systems that express the computational steps more abstractly.
Example of a serial computational workflow typical for ATLAS RECAST analyses:
Example of a parallel computational workflow typical for Beyond Standard Model searches:
Many different computational data analysis workflow systems exist.
Different tools used in different communities: fit for use, fit for purpose, culture, preferences.
REANA
We shall use REANA reproducible analysis platform to explore computational workflows in this lesson. REANA is a pilot project and supports:
- multiple workflow systems (CWL, Serial, Yadage)
- multiple compute backends (Kubernetes, HTCondor, Slurm)
- multiple storage backends (Ceph, EOS)
Analysis preservation ab initio
Preserving analysis code and processes after the publication is often too late. Key information and knowledge may be lost during the lengthy analysis process.
Making research reproducible from the start, in other words making research “preproducible”, makes analysis preservation easy.
Preproducibility driving preservation: red-pill. Preservation driving reproducibility: blue pill.
Key Points
Workflow is the new data.
Data + Code + Environment + Workflow = Reproducible Analyses
Before reproducibility comes preproducibility
First example
Overview
Teaching: 15 min
Exercises: 5 minQuestions
How to run analyses on REANA cloud?
What are the basic REANA command-line client usage scenarios?
How to monitor my analysis using REANA web interface?
Objectives
Get hands-on experience with REANA command-line client
Overview
In this lesson we shall run our first simple REANA example. We shall see:
- structure of the example analysis and associated
reana.yaml
file - how to install of REANA command-line client
- how to connect REANA client to remote REANA cluster
- how to run analysis on remote REANA cluster
Checklist
Have you installed
reana-client
and/or have you logged into LXPLUS as described in Setup?
First REANA example
We shall get acquainted with REANA by means of running a sample analysis example:
Let’s start by cloning it:
$ git clone https://github.com/reanahub/reana-demo-root6-roofit
$ cd reana-demo-root6-roofit
What does the example do? The example emulates a typical particle physics analysis where the signal and background data is processed and fitted against a model. The example will use the RooFit package of the ROOT framework.
Four questions:
- Input data? None. We’ll simulate them.
- Analysis code? Two files:
gendata.C
macro generates signal and background data;fitdata.C
macro makes a fit for the signal and the background data. - Compute environment? ROOT with RooFit.
- Runtime procedures? Simple serial workflow: first run gendata, then run fitdata.
Workflow definition:
START
|
|
V
+-------------------------+
| (1) generate data |
| |
| $ root gendata.C ... |
+-------------------------+
|
| data.root
V
+-------------------------+
| (2) fit data |
| |
| $ root fitdata.C ... |
+-------------------------+
|
| plot.png
V
STOP
The four questions expressed in reana.yaml
fully define our analysis:
version: 0.6.0
inputs:
files:
- code/gendata.C
- code/fitdata.C
parameters:
events: 20000
data: results/data.root
plot: results/plot.png
workflow:
type: serial
specification:
steps:
- name: gendata
environment: 'reanahub/reana-env-root6:6.18.04'
commands:
- mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'
- name: fitdata
environment: 'reanahub/reana-env-root6:6.18.04'
commands:
- root -b -q 'code/fitdata.C("${data}","${plot}")'
outputs:
files:
- results/plot.png
Note the basic structure of reana.yaml
answering the Four Questions. (input data? analysis code?
compute environment? workflow steps?)
Install REANA command-line client
First we need to make sure we can use REANA command-line client. Option 1: use locally on your laptop. Option 2: use preinstalled on LXPLUS. See setup instructions.
The client will offer several commands which we shall go through in this tutorial:
$ reana-client --help
Usage: reana-client [OPTIONS] COMMAND [ARGS]...
REANA client for interacting with REANA server.
Options:
-l, --loglevel [DEBUG|INFO|WARNING]
Sets log level
--help Show this message and exit.
Configuration commands:
ping Check connection to REANA server.
version Show version.
Workflow management commands:
create Create a new workflow.
delete Delete a workflow.
diff Show diff between two workflows.
list List all workflows and sessions.
Workflow execution commands:
logs Get workflow logs.
restart Restart previously run workflow.
run Shortcut to create, upload, start a new workflow.
start Start previously created workflow.
status Get status of a workflow.
stop Stop a running workflow.
validate Validate workflow specification file.
Workspace interactive commands:
close Close an interactive session.
open Open an interactive session inside the workspace.
Workspace file management commands:
download Download workspace files.
du Get workspace disk usage.
ls List workspace files.
mv Move files within workspace.
rm Delete files from workspace.
upload Upload files and directories to workspace.
Secret management commands:
secrets-add Add secrets from literal string or from file.
secrets-delete Delete user secrets by name.
secrets-list List user secrets.
You can use --help
option to learn more about any command, for example validate
:
$ reana-client validate --help
Usage: reana-client validate [OPTIONS]
Validate workflow specification file.
The `validate` command allows to check syntax and validate the reana.yaml
workflow specification file.
Examples:
$ reana-client validate -f reana.yaml
Options:
-f, --file PATH REANA specifications file describing the workflow and
context which REANA should execute.
--help Show this message and exit.
Exercise
Validate our
reana.yaml
file to discover any errors. Usevalidate
command to do so.
Solution
$ reana-client validate -f ./reana.yaml
File reana-demo-root6-roofit/reana.yaml is a valid REANA specification file.
Connect REANA client to remote REANA cluster
The REANA client will interact with a remote REANA cluster.
The REANA client knows to which REANA cluster it connects by means of two environment variables:
$ export REANA_SERVER_URL=https://reana.cern.ch
$ export REANA_ACCESS_TOKEN=xxxxxx
The REANA client connection to remote REANA cluster can be verified via ping
command:
$ reana-client ping
Connected to https://reana.cern.ch - Server is running.
The authentication uses a token that one can get by logging into REANA UI at reana.cern.ch.
Exercise
Get REANA user token and connect to REANA cluster at
reana.cern.ch
.
Solution
Login to reana.cern.ch, copy your token, and use the commands above.
Run example on REANA cluster
Now that we have defined and validated our reana.yaml
, and connected to the REANA production
cluster, we can run the example easily via:
$ reana-client run -w roofit
[INFO] Creating a workflow...
roofit.1
[INFO] Uploading files...
File code/gendata.C was successfully uploaded.
File code/fitdata.C was successfully uploaded.
[INFO] Starting workflow...
roofit.1 is running
Here, we use run
command that will create a new workflow named roofit
, upload its inputs as
specified in the workflow specification and finally start the workflow.
While the workflow is running, we can enquire about its status:
$ reana-client status -w roofit
NAME RUN_NUMBER CREATED STARTED STATUS PROGRESS
roofit 1 2020-02-17T16:01:45 2020-02-17T16:01:48 running 1/2
After a minute, the workflow should finish and we can list the output files in the remote workspace:
$ reana-client ls -w roofit
NAME SIZE LAST-MODIFIED
code/gendata.C 1937 2020-02-17T16:01:46
code/fitdata.C 1648 2020-02-17T16:01:47
results/plot.png 15450 2020-02-17T16:02:44
results/data.root 154457 2020-02-17T16:02:17
We can also get the logs:
$ reana-client logs -w roofit | less
==> Workflow engine logs
2020-02-17 16:02:10,859 | root | MainThread | INFO | Publishing step:0, cmd: mkdir -p results && root -b -q 'code/gendata.C(20000,"results/data.root")', total steps 2 to MQ
2020-02-17 16:02:23,002 | root | MainThread | INFO | Publishing step:1, cmd: root -b -q 'code/fitdata.C("results/data.root","results/plot.png")', total steps 2 to MQ
2020-02-17 16:02:50,093 | root | MainThread | INFO | Workflow 424bc949-b809-4782-ba96-bc8cfa3e1a89 finished. Files available at /var/reana/users/b57e902f-fd11-4681-8a94-4318ae05d2ca/workflows/424bc949-b809-4782-ba96-bc8cfa3e1a89.
==> Job logs
==> Step: gendata
==> Workflow ID: 424bc949-b809-4782-ba96-bc8cfa3e1a89
==> Compute backend: Kubernetes
==> Job ID: 53c97429-25e9-4b74-94f7-c665d93fdbc2
==> Docker image: reanahub/reana-env-root6:6.18.04
==> Command: mkdir -p results && root -b -q 'code/gendata.C(20000,"results/data.root")'
==> Status: finished
==> Logs:
...
We can download the resulting plot:
$ reana-client download results/plot.png -w roofit
$ display results/plot.png
Exercise
Run the example workflow on REANA cluster. Practice
status
,ls
,logs
,download
commands. For example, can you get the logs of the gendata step only?
Solution
$ reana-client logs -w roofit --step gendata
Key Points
Use
reana-client
rich command-line client to run containerised workflows from your laptop on remote compute cloudsBefore running analysis remotely, check locally its correctness via
validate
commandAs always, when it doubt, use the
--help
command-line argument
Developing serial workflows
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How to write serial workflows?
What is declarative programming?
How to develop workflows progressively?
Can I temporarily override workflow parameters?
Do I always have to build new Docker image when my code changes?
Objectives
Understand pros/cons between imperative and declarative programming styles
Get familiar with serial workflow development practices
Understand run numbers of your analysis
See how you can run only parts of the workflow
See how you can repeat workflow to fix a failed step
Overview
We have seen how to use REANA client to run containerised analyses on the REANA cloud.
In this lesson we see more use cases suitable for developing serial workflows.
Imperative vs declarative programming
Imperative programming feels natural: use a library and just write code. Example: C.
for (int i = 0; i < sizeof(people) / sizeof(struct people); i++) {
if (people[i].age < 20) {
printf("%s\n", people[i].name)
}
}
However, it has also its drawbacks. If you write scientific workflows imperatively and you need port the code to use GPUs, to run on different compute architectures, or to scale up, it may be necessary to do considerable code refactoring. This is not writing science code, but rather writing orchestration for the said science code onto different deployment scenarios.
Enter declarative programming that “expresses the logic of a computation without describing its control flow”. Example: SQL.
SELECT name FROM people WHERE age<20
The idea of declarative approach to scientific worflows is to express research as a series of data analysis steps and let an independent “orchestration tool” or a “workflow system” the task of running things properly on various deployment architectures.
This achieves better separation of concerns between physics code knowledge and computing orchestration glue code knowledge. However, the development may be felt less immediate. There are pros and cons. There is no silver bullet.
Imperative or declarative?
Imperative programming is about how you want to achieve something. Declarative programming is about what you want to achieve.
Developing workflows progressively
Developing workflows declaratively may feel less natural. How do we do that?
Start with earlier steps, run, debug, run, debug until satisfaction.
Continue with later steps only afterwords.
How to run only first step of our example workflow? Use TARGET
step option:
$ reana-client run -w roofit -o TARGET=gendata
[INFO] Creating a workflow...
roofit.2
[INFO] Uploading files...
File code/gendata.C was successfully uploaded.
File code/fitdata.C was successfully uploaded.
[INFO] Starting workflow...
roofit.2 is running
After a minute, let us check the status:
$ reana-client status -w roofit
NAME RUN_NUMBER CREATED STARTED ENDED STATUS PROGRESS
roofit 2 2020-02-17T16:07:29 2020-02-17T16:07:33 2020-02-17T16:08:48 finished 1/1
and the workspace content:
$ reana-client ls -w roofit
NAME SIZE LAST-MODIFIED
code/gendata.C 1937 2020-02-17T16:07:30
code/fitdata.C 1648 2020-02-17T16:07:31
results/data.root 154458 2020-02-17T16:08:43
As we can see, the workflow run only the first command and the data.root
file was well
generated. The final fitting step was not run and the final plot was not produced.
Workflow runs
We have run the analysis example anew. Similar to Continuous Integration systems, the REANA platform
runs each workflow in an independent workspace. To distinguish between various workflow runs of the
same analysis, the REANA platform keeps an incremental “run number”. You can obtain the list of all
your workflows by using the list
command:
$ reana-client list
NAME RUN_NUMBER CREATED STARTED ENDED STATUS
roofit 2 2020-02-17T16:07:29 2020-02-17T16:07:33 2020-02-17T16:08:48 finished
roofit 1 2020-02-17T16:01:45 2020-02-17T16:01:48 2020-02-17T16:02:50 finished
You can use myanalysis.myrunnumber
to refer to a given run number of an analysis:
$ reana-client ls -w roofit.1
$ reana-client ls -w roofit.2
To quickly know the differences between various workflow runs, you can use the diff
command:
$ reana-client diff roofit.1 roofit.2 --brief
No differences in reana specifications.
DIFFERENCES IN WORKSPACE LISTINGS:
Files roofit.1/results/data.root and roofit.2/results/data.root differ
Only in roofit.1/results: plot.png
Workflow parameters
Another useful technique when developing a workflow is to use smaller data samples until the workflow is debugged. For example, instead of generating 20000 events, we can generate only 1000. While you could achieve this by simply modifying the workflow definition, REANA offers an option to run parametrised workflows, meaning that you can pass the wanted value on the command line:
$ reana-client run -w roofit -p events=1000
[INFO] Creating a workflow...
roofit.3
[INFO] Uploading files...
File code/gendata.C was successfully uploaded.
File code/fitdata.C was successfully uploaded.
[INFO] Starting workflow...
roofit.3 is running
The generated ROOT file is much smaller:
$ reana-client ls -w roofit.1 | grep data.root
results/data.root 154457 2020-02-17T16:02:17
$ reana-client ls -w roofit.3 | grep data.root
results/data.root 19216 2020-02-17T16:18:45
and the plot much coarser:
$ reana-client download results/plot.png -w roofit.3
Developing further steps
Now that we are happy with the beginning of the workflow, how do we continue to develop the rest? Running a new workflow every time could be very time consuming; running skimming may require many more minutes than running statistical analysis.
In these situations, you can take advantage of the restart
functionality. The REANA platform
allows to restart a part of the workflow on the given workspace starting from the workflow step
specified by the FROM
option:
$ reana-client restart -w roofit.3 -o FROM=fitdata
roofit.3.1 has been queued
Note that the run number got an extra digit, meaning the number of restarts of the given workflow.
The full semantics of REANA run numbers is myanalysis.myrunnumber.myrestartnumber
.
Let us enquire about the status of the restarted workflow:
$ reana-client status -w roofit.3.1
NAME RUN_NUMBER CREATED STARTED ENDED STATUS PROGRESS
roofit 3.1 2020-02-17T16:26:09 2020-02-17T16:26:10 2020-02-17T16:27:24 finished 1/1
Looking at the number of steps of the 3.1 rerun, and looking at modification timestamps of the workspace files:
$ reana-client ls -w roofit.3.1
NAME SIZE LAST-MODIFIED
code/gendata.C 1937 2020-02-17T16:17:00
code/fitdata.C 1648 2020-02-17T16:17:01
results/plot.png 16754 2020-02-17T16:27:20
results/data.root 19216 2020-02-17T16:18:45
We can see that only the last step of the workflow was rerun, as wanted.
This technique is useful to debug later stages of the workflow without having to rerun the lengthy former stages of the workflow.
Exercise
Consider we would like to produce the final plot of the roofit example and change the title from “Fit example” to “RooFit example”. How do you do this in the most efficient way?
Solution
Amend
fitdata.C
, upload changed file to the workspace, and rerun the past successful workflow starting from the fitdata step:$ reana-client list $ vim code/fitdata.C # edit title printing statement $ reana-client upload ./code/fitdata.C -w roofit.3 $ reana-client restart -w roofit.3 -o FROM=fitdata $ reana-client list $ reana-client status -w roofit.3.2 $ reana-client download -w roofit.3.2
Compile-time vs runtime code changes
Sometimes you have to build a new container image when code changes (e.g. C++ compilation). sometimes you don’t (e.g. Python code, ROOT macros). Use latter for more productivity when developing workflows.
Key Points
Develop workflows progressively; add steps as needed
When developing a workflow, stay on the same workspace
When developing a bytecode-interpreted code, stay on the same container
Use smaller test data before scaling out
Use workflows as Continuous Integration; make atomic commits that always work
HiggsToTauTau analysis: serial
Overview
Teaching: 5 min
Exercises: 20 minQuestions
Challenge: write the HiggsToTauTau analysis serial workflow and run it on REANA
Objectives
Develop a full HigssToTauTau analysis workflow using serial language
Get acquainted with writing moderately complex REANA examples
Overview
We have practiced writing and running workflows on REANA using a simple RooFit analysis example.
In this lesson we shall go back to the HiggsToTauTau analysis used throughout this workshop and we shall write a serial workflow to run the analysis on the REANA platform.
Recap
The past two days you have containerised HiggsToTauTau analysis by means of two GitLab repositories:
awesome-analysis-eventselection
with skimming and histogramming steps;awesome-analysis-statistics
with the fit.
You have used GitLab CI to build Docker images for these repositories such as:
gitlab-registry.cern.ch/johndoe/awesome-analysis-eventselection
gitlab-registry.cern.ch/johndoe/awesome-analysis-statistics
You have run the containerised analysis “manually” using docker
commands such as:
bash skim.sh ...
bash histograms.sh ...
bash plot.sh ...
bash fit.sh ...
Objective
Let us now write a serial workflow how the HiggsToTauTau example can be run sequentially on REANA.
Note: efficiency
Note that the serial workflow will not be necessarily efficient here, since it will run sequentially over various dataset files and not process them in paralell. Do not pay attention to this inefficiency here. We shall speed up the example via parallel processing in a forthcoming HiggsToTauTau analysis: parallel episode coming after the coffee break.
Note: container directories and workspace directories
The awesome-analysis-eventselection
and awesome-analysis-statistics
repositories assume that you
run code from certain absolute directories such as /analysis/skim
. Note that when REANA starts a
new workflow run, it creates a certain unique “workspace directory” for sharing read/write files by
the workflow steps. It is a good practice to have code and data directories readable and
workflow’s workspace writable in a clearly separated manner. In this way, the workflow won’t risk
to write over the inputs or the code provided by the container, which is good both for
reproducibility purposes (inputs aren’t accidentally modified) and security purposes (code is not
accidentally modified).
Note: REANA_WORKSPACE environment variable
REANA platform uses a convenient set of environment variables that you can use in your scripts. One
of them is REANA_WORKSPACE
which points to the workflow’s workspace which is unique for each run.
You can use $$REANA_WORKSPACE
environment variable in your reana.yaml
recipe to share the
output of skimming, histogramming, plotting and fitting steps. (Note the use of two leading dollar
signs to escape the workflow parameter expansion that we have seen previously.)
OK, challenge time!
With the above hits, please try to write workflow either individually or in pairs.
Exercise
Write
reana.yaml
representing HiggsToTauTau analysis and run it on the REANA cloud.
Solution
$ cat reana.yaml version: 0.6.0 inputs: parameters: eosdir: root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced workflow: type: serial specification: steps: - name: skimming environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master commands: - mkdir $$REANA_WORKSPACE/skimming && cd /analysis/skim && bash ./skim.sh ${eosdir} $$REANA_WORKSPACE/skimming - name: histogramming environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master commands: - mkdir $$REANA_WORKSPACE/histogramming && cd /analysis/skim && bash ./histograms_with_custom_output_location.sh $$REANA_WORKSPACE/skimming $$REANA_WORKSPACE/histogramming - name: plotting environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master commands: - mkdir $$REANA_WORKSPACE/plotting && cd /analysis/skim && bash ./plot.sh $$REANA_WORKSPACE/histogramming/histograms.root $$REANA_WORKSPACE/plotting 0.1 - name: fitting environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3:master commands: - mkdir $$REANA_WORKSPACE/fitting && cd /fit && bash ./fit.sh $$REANA_WORKSPACE/histogramming/histograms.root $$REANA_WORKSPACE/fitting outputs: files: - fitting/fit.png
Key Points
Writing serial workflows is like chaining shell script commands
Coffee break
Overview
Teaching: 15 min
Exercises: 0 minQuestions
Coffee break
Objectives
Refresh your mind
Discuss your experience
Key Points
Refresh your mind
Discuss your experience
Developing parallel workflows
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How to scale up and run thousands of jobs?
What is a DAG?
What is a Scatter-Gather paradigm?
How to run Yadage workflows on REANA?
Objectives
Learn about Directed Acyclic Graphs (DAG)
Understand Yadage workflow language
Practice running and inspecting parallel workflows
Overview
We now know how to develop reproducible analyses on small scale using serial workflows.
In this lesson we shall learn how to scale-up for real life work, which requires using paraller workflows.
Workflows as Directed Acyclic Graphs (DAG)
The computational analyses can be expressed as a set of steps where some steps depends on other steps before they can begin their computations. In other words, the computational steps as expressed as Directed Acyclic Graphs, for example:
The REANA platform supports several DAG workflow specification languages:
- Common Workflow Language (CWL) originated in life sciences
- Yadage originated in particle physics
In this lesson we shall use the Yadage workflow specification language.
Yadage
Yadage enables to describe complex computational workflows. Let us start having a look at the Yadag e specification for the RooFit example we have used in the beginning episodes:
stages:
- name: gendata
dependencies: [init]
scheduler:
scheduler_type: 'singlestep-stage'
parameters:
events: {step: init, output: events}
gendata: {step: init, output: gendata}
outfilename: '{workdir}/data.root'
step:
process:
process_type: 'interpolated-script-cmd'
script: root -b -q '{gendata}({events},"{outfilename}")'
publisher:
publisher_type: 'frompar-pub'
outputmap:
data: outfilename
environment:
environment_type: 'docker-encapsulated'
image: 'reanahub/reana-env-root6'
imagetag: '6.18.04'
- name: fitdata
dependencies: [gendata]
scheduler:
scheduler_type: 'singlestep-stage'
parameters:
fitdata: {step: init, output: fitdata}
data: {step: gendata, output: data}
outfile: '{workdir}/plot.png'
step:
process:
process_type: 'interpolated-script-cmd'
script: root -b -q '{fitdata}("{data}","{outfile}")'
publisher:
publisher_type: 'frompar-pub'
outputmap:
plot: outfile
environment:
environment_type: 'docker-encapsulated'
image: 'reanahub/reana-env-root6'
imagetag: '6.18.04'
We can see that the workflow consists of two steps, gendata
does not depending on anything
([init]
) and fitdata
depending on gendata
. This is how linear workflows are expressed
in the Yadage language.
Running Yadage workflows
Let us write the above workflow as workflow.yaml
in the RooFit example directory.
How can we run the example on REANA platform? We have to instruct REANA that we are going to use
Yadage as our workflow engine. We can do that by editing reana.yaml
and specifying:
version: 0.6.0
inputs:
parameters:
events: 20000
gendata: code/gendata.C
fitdata: code/fitdata.C
workflow:
type: yadage
file: workflow.yaml
outputs:
files:
- fitdata/plot.png
We can run the example on REANA in the usual way:
$ reana-client run -w roofityadage -f reana-yadage.yaml
Exercise
Run RooFit example using Yadage workflow engine on the REANA cloud. Upload code, run workflow, inspect status, check logs, download final plot.
Solution
Nothing changes in the usual user interaction with the REANA platform:
$ reana-client create -w roofityadage -f ./reana-yadage.yaml $ reana-client upload ./code -w roofityadage $ reana-client start -w roofityadage $ reana-client status -w roofityadage $ reana-client logs -w roofityadage $ reana-client ls -w roofityadage $ reana-client download plot.png -w roofityadage
Physics code vs orchestration code
Note that it wasn’t necessary to change anything in our code: we simply modified the workflow definition and we could run the RooFit code “as is” using another workflow engine. This is a simple demonstration of the separation of concerns between “physics code” and “orchestration code”.
Parallelism via step dependencies
We have seen how serial workflows can be also expressed in Yadage syntax using step dependencies. Note that if dependency graph would have permitted, the workflow steps not depending on each other or on the results of previous computations could have been executed in parallel by the workflow engine – the physicist only has to supply knoweldge about which steps depend on which other steps and the workflow engine takes care about efficiently starting and scheduling tasks.
HiggsToTauTau analysis: simple version
Let us demonstrate how to write simple Yaage workflow for the HiggsToTauTau analysis using simple step dependencies.
The workflow looks like:
stages:
- name: skim
dependencies: [init]
scheduler:
scheduler_type: singlestep-stage
parameters:
input_dir: {step: init, output: input_dir}
output_dir: '{workdir}/output'
step: {$ref: 'steps.yaml#/skim'}
- name: histogram
dependencies: [skim]
scheduler:
scheduler_type: singlestep-stage
parameters:
input_dir: {step: skim, output: skimmed_dir}
output_dir: '{workdir}/output'
step: {$ref: 'steps.yaml#/histogram'}
- name: fit
dependencies: [histogram]
scheduler:
scheduler_type: singlestep-stage
parameters:
histogram_file: {step: histogram, output: histogram_file}
output_dir: '{workdir}/output'
step: {$ref: 'steps.yaml#/fit'}
- name: plot
dependencies: [histogram]
scheduler:
scheduler_type: singlestep-stage
parameters:
histogram_file: {step: histogram, output: histogram_file}
output_dir: '{workdir}/output'
step: {$ref: 'steps.yaml#/plot'}
where steps are expressed as:
skim:
process:
process_type: 'interpolated-script-cmd'
script: |
mkdir {output_dir}
bash skim.sh {input_dir} {output_dir}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
skimmed_dir: '{output_dir}'
histogram:
process:
process_type: 'interpolated-script-cmd'
script: |
mkdir {output_dir}
bash histograms_with_custom_output_location.sh {input_dir} {output_dir}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
histogram_file: '{output_dir}/histograms.root'
plot:
process:
process_type: 'interpolated-script-cmd'
script: |
mkdir {output_dir}
bash plot.sh {histogram_file} {output_dir}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
datamc_plots: '{output_dir}'
fit:
process:
process_type: 'interpolated-script-cmd'
script: |
mkdir {output_dir}
bash fit.sh {histogram_file} {output_dir}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
fitting_plot: '{output_dir}/fit.png'
The workflow definition is similar to that of the Serial workflow, and, as we can see, it can already lead to certain parallelism, because the fitting step and the plotting step can run simultaneously once the histograms are produced.
The graphical representation of the workflow is:
Let us try to run it on REANA cloud.
Exercise
Run HiggsToTauTau analysis example using simple Yadage workflow version. Take the workflow definition listed above, write corresponding
reana.yaml
, and run the example on REANA cloud.
Solution
$ vim workflow.yaml # take contents above and store it as workflow.yaml $ vim steps.yaml # take contents above and store it as steps.yaml $ vim reana.yaml # this was the task $ cat reana.yaml version: 0.6.0 inputs: parameters: input_dir: root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced workflow: type: yadage file: workflow.yaml outputs: files: - fit/output/fit.png
Parallelism via scatter-gather paradigm
A useful paradigm of workflow languages is a “scatter-gather” behaviour where we instruct the workflow engine to run a certain step over a certain input array in parallel as if each element of the input were a single item input (the “scatter” operation). The partial results processed in parallel are then assembled together (the “gather” operation). The “scatter-gather” paradigm allows to express “map-reduce” operations with a minimal of syntax without having to duplicate workflow code or statements.
Here is an example of scatter-gather paradim in the Yadage language:
expressed as:
stages:
- name: map
dependencies: [init]
scheduler:
scheduler_type: multistep-stage
parameters:
input: {stages: init, output: input, unwrap: true}
batchsize: 3
scatter:
method: zip
parameters: [input]
- name: map2
dependencies: [map]
scheduler:
scheduler_type: multistep-stage
parameters:
input: {stages: map, output: outputA, unwrap: true}
batchsize: 2
scatter: ...
- name: reduce
dependencies: [map2]
scheduler:
scheduler_type: singlestep-stage
parameters:
input: {stages: 'map2', output: outputA}
Note the “scatter” happening over “input” with a wanted batch size.
In the next episode we shall see how the scatter paradigm can be used to speed up the HiggsToTauTau workflow using more parallelism.
Key Points
Computational analysis is a graph of inter-dependent steps
Fully declare inputs and outputs for each step
Use Scatter/Gather or Map/Reduce to avoid copy-paste coding
HiggsToTauTau analysis: parallel
Overview
Teaching: 10 min
Exercises: 20 minQuestions
Challenge: write the HiggsToTauTau analysis parallel workflow and run it on REANA
Objectives
Develop a full HigssToTauTau analysis workflow using parallel language
Overview
We have seen an example of a full DAG-aware workflow language called Yadage and how it can be used to describe and run the RooFit example and a simple version of HiggsToTauTau example.
In this episode we shall see how to efficiently apply parallelism to speed up the HiggsToTauTau example via the scatter-gather paradigm introduced in the previous episode.
HiggsToTauTau analysis
The overall reana.yaml
looks like:
version: 0.6.0
inputs:
parameters:
files:
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/GluGluToHToTauTau.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/VBF_HToTauTau.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/DYJetsToLL.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/TTbar.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W1JetsToLNu.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W2JetsToLNu.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W3JetsToLNu.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012B_TauPlusX.root
- root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012C_TauPlusX.root
cross_sections:
- 19.6
- 1.55
- 3503.7
- 225.2
- 6381.2
- 2039.8
- 612.5
- 1.0
- 1.0
short_hands:
- [ggH]
- [qqH]
- [ZLL,ZTT]
- [TT]
- [W1J]
- [W2J]
- [W3J]
- [dataRunB]
- [dataRunC]
workflow:
type: yadage
file: workflow.yaml
outputs:
files:
- fit/fit.png
Note that we define input files and cross sections and short names as an array. It is this array that we shall be scattering around.
HiggsToTauTau skimming
The skimming step definition looks like:
- name: skim
dependencies: [init]
scheduler:
scheduler_type: multistep-stage
parameters:
input_file: {step: init, output: files}
cross_section: {step: init, output: cross_sections}
output_file: '{workdir}/skimmed.root'
scatter:
method: zip
parameters: [input_file, cross_section]
step: {$ref: 'steps.yaml#/skim'}
where the step is defined as:
skim:
process:
process_type: 'interpolated-script-cmd'
script: |
./skim {input_file} {output_file} {cross_section} 11467.0 0.1
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
skimmed_file: '{output_file}'
Note the scatter paradigm that will cause nine parallel jobs for each input dataset file.
HiggsToTauTau histogramming
The histograms can be produced as follows:
- name: histogram
dependencies: [skim]
scheduler:
scheduler_type: multistep-stage
parameters:
input_file: {stages: skim, output: skimmed_file}
output_names: {step: init, output: short_hands}
output_dir: '{workdir}'
scatter:
method: zip
parameters: [input_file, output_names]
step: {$ref: 'steps.yaml#/histogram'}
with:
histogram:
process:
process_type: 'interpolated-script-cmd'
script: |
for x in {output_names}; do
python histograms.py {input_file} $x {output_dir}/$x.root;
done
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
glob: true
publish:
histogram_file: '{output_dir}/*.root'
HiggsToTauTau merging
Gather time! How do we merge scattered results?
- name: merge
dependencies: [histogram]
scheduler:
scheduler_type: singlestep-stage
parameters:
input_files: {stages: histogram, output: histogram_file, flatten: true}
output_file: '{workdir}/merged.root'
step: {$ref: 'steps.yaml#/merge'}
with:
merge:
process:
process_type: 'interpolated-script-cmd'
script: |
hadd {output_file} {input_files}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
merged_file: '{output_file}'
HiggsToTauTau fitting
The fit can be performed as follows:
- name: fit
dependencies: [merge]
scheduler:
scheduler_type: singlestep-stage
parameters:
histogram_file: {step: merge, output: merged_file}
fit_outputs: '{workdir}'
step: {$ref: 'steps.yaml#/fit'}
with:
fit:
process:
process_type: 'interpolated-script-cmd'
script: |
python fit.py {histogram_file} {fit_outputs}
environment:
environment_type: 'docker-encapsulated'
image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3
imagetag: master
publisher:
publisher_type: interpolated-pub
publish:
fit_results: '{fit_outputs}/fit.png'
HiggsToTauTau plotting
Challenge time! Add plotting step to the workflow.
Exercise
Following the example above, write plotting step and plug it into the overall workflow.
Solution
- name: plot dependencies: [merge] scheduler: scheduler_type: singlestep-stage parameters: histogram_file: {step: merge, output: merged_file} plot_outputs: '{workdir}' step: {$ref: 'steps.yaml#/plot'}
with:
plot: process: process_type: 'interpolated-script-cmd' script: | python plot.py {histogram_file} {plot_outputs} 0.1 environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3 imagetag: master publisher: publisher_type: interpolated-pub publish: fitting_plot: '{plot_outputs}'
Full workflow
Assembling the previous stages visually, the full workflow looks like:
Running full workflow
We are now ready to run the example of REANA cloud.
Exercise
Run HiggsToTauTau parallel workflow on REANA cloud. How many job does the workflow have? How much faster it is executed when compared to the simple Serial version?
Solution
reana.yaml:
version: 0.6.0 inputs: parameters: files: - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/GluGluToHToTauTau.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/VBF_HToTauTau.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/DYJetsToLL.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/TTbar.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W1JetsToLNu.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W2JetsToLNu.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W3JetsToLNu.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012B_TauPlusX.root - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012C_TauPlusX.root cross_sections: - 19.6 - 1.55 - 3503.7 - 225.2 - 6381.2 - 2039.8 - 612.5 - 1.0 - 1.0 short_hands: - [ggH] - [qqH] - [ZLL,ZTT] - [TT] - [W1J] - [W2J] - [W3J] - [dataRunB] - [dataRunC] workflow: type: yadage file: workflow.yaml outputs: files: - fit/fit.png
workflow.yaml:
stages: - name: skim dependencies: [init] scheduler: scheduler_type: multistep-stage parameters: input_file: {step: init, output: files} cross_section: {step: init, output: cross_sections} output_file: '{workdir}/skimmed.root' scatter: method: zip parameters: [input_file, cross_section] step: {$ref: 'steps.yaml#/skim'} - name: histogram dependencies: [skim] scheduler: scheduler_type: multistep-stage parameters: input_file: {stages: skim, output: skimmed_file} output_names: {step: init, output: short_hands} output_dir: '{workdir}' scatter: method: zip parameters: [input_file, output_names] step: {$ref: 'steps.yaml#/histogram'} - name: merge dependencies: [histogram] scheduler: scheduler_type: singlestep-stage parameters: input_files: {stages: histogram, output: histogram_file, flatten: true} output_file: '{workdir}/merged.root' step: {$ref: 'steps.yaml#/merge'} - name: fit dependencies: [merge] scheduler: scheduler_type: singlestep-stage parameters: histogram_file: {step: merge, output: merged_file} fit_outputs: '{workdir}' step: {$ref: 'steps.yaml#/fit'} - name: plot dependencies: [merge] scheduler: scheduler_type: singlestep-stage parameters: histogram_file: {step: merge, output: merged_file} plot_outputs: '{workdir}' step: {$ref: 'steps.yaml#/plot'}
steps.yaml:
skim: process: process_type: 'interpolated-script-cmd' script: | ./skim {input_file} {output_file} {cross_section} 11467.0 0.1 environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3 imagetag: master publisher: publisher_type: interpolated-pub publish: skimmed_file: '{output_file}' histogram: process: process_type: 'interpolated-script-cmd' script: | for x in {output_names}; do python histograms.py {input_file} $x {output_dir}/$x.root; done environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3 imagetag: master publisher: publisher_type: interpolated-pub glob: true publish: histogram_file: '{output_dir}/*.root' merge: process: process_type: 'interpolated-script-cmd' script: | hadd {output_file} {input_files} environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3 imagetag: master publisher: publisher_type: interpolated-pub publish: merged_file: '{output_file}' fit: process: process_type: 'interpolated-script-cmd' script: | python fit.py {histogram_file} {fit_outputs} environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3 imagetag: master publisher: publisher_type: interpolated-pub publish: fit_results: '{fit_outputs}/fit.png' plot: process: process_type: 'interpolated-script-cmd' script: | python plot.py {histogram_file} {plot_outputs} 0.1 environment: environment_type: 'docker-encapsulated' image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3 imagetag: master publisher: publisher_type: interpolated-pub publish: fitting_plot: '{plot_outputs}'
Key Points
Use step dependencies to express main analysis stages
Use scatter-gather paradigm in staged to massively parallelise DAG workflow execution
REANA usage scenarios remain the same regardless of workflow language details
A glimpse on advanced topics
Overview
Teaching: 15 min
Exercises: 5 minQuestions
Can I publish workflow results on EOS?
Can I use Kerberos to access restricted resources?
Can I use CVMFS software repositeries?
Can I dispatch heavy computations to HTCondor?
Can I dispatch heavy computations to Slurm?
Can I open Jupyter notebooks on my REANA workspace?
Can I connect my GitLab repositories with REANA?
Objectives
Learn about advanced possibilities of REANA platform
Learn how to use Kerberos secrets to access restricted resources
Learn how to interact with remote storage solutions (EOS)
Learn how to interact with remote compute backends (HTCondor, Slurm)
Learn how to interact with remote code repositories (CVMFS, GitLab)
Learn how to open interactive sessions (Jupyter notebooks)
Overview
We now know how to write serial and parallel workflows.
What do we need more in order to use the system for real life physics analyses?
Let’s scratch the surface of some more advanced topics:
- Publishing workflow artifacts on EOS
- Using CVMFS software repositories
- Using high-throughput computing backends: HTCondor
- Using high-performance computing backends: Slurm
- Opening interactive environments (notebooks) on workflow workspace
- Bridging GitLab with REANA
Publishing workflow results on EOS
REANA uses shared filesystem for storing results of your running workflows. They may be
garbage-collected after a certain period of time. You can use the reana-client download
command
to download the results of your workflows, as we have seen in Episode 2. Is there a more automatic
way?
One possibility is to add a final step to your workflow that would publish the results of interest in outside filesystem. For example, how can you publish all resulting plots in your personal EOS folder?
First, you have to let the REANA platform know your Kerberos keytab so that the writing is authorised. We can do this by uploading appropriate “secrets”:
$ reana-client secrets-add --env CERN_USER=johndoe
$ reana-client secrets-add --env CERN_KEYTAB=johndoe.keytab
$ reana-client secrets-add --file ~/johndoe.keytab
Second, once we have the secrets, we can use a Kerberos-aware container image (such as
reanahub/krb5
) in the final publishing step of the workflow:
workflow:
type: serial
specification:
steps:
- name: myfirststep
...
- name: mysecondstep
...
- name: publish
kerberos: true
environment: 'reanahub/krb5'
commands:
- mkdir -p /eos/home/j/johndoe/myanalysis-outputs
- cp myplots/*.png /eos/home/j/johndoe/myanalysis-outputs/
Note the presence of kerberos: true
classifier in the final publishing step, which tells the
REANA system to initialise Kerberos authentitation using provided secrets for the workflow step at
hand.
Exercise
Publish all produced HigssToTauTau analysis plots to your EOS home directory.
Solution
Modify your workflow to add a final publishing step.
Hint: Use a previously finished analysis run and the
restart
command so that you don’t have to rerun the full analysis again.
Using CVMFS software repositories
Many physics analyses need software living in CVMFS filesystem. Packaging this software into the container is possible, but it could make the container size enourmous. Can we access CVMFS filesystem at runtime?
REANA allows to specify custom resource need declarations in reana.yaml
by means of a
resources
clause. An example:
workflow:
type: serial
resources:
cvmfs:
- fcc.cern.ch
specification:
steps:
- environment: 'cern/slc6-base'
commands:
- ls -l /cvmfs/fcc.cern.ch/sw/views/releases/
Exercise
Write a workflow that would run ROOT from the SFT repository on CVMFS and that would list all configuration flags enabled in that executable.
Solution
See also REANA documentation on CVMFS.
Using high-throughput computing backends: HTCondor
REANA uses Kerberos as its default compute backend.
Massively parallel physics analyses profit from HTC computing systems such as HTCondor to launch same procedures on mutually independent data files.
If you would like to dispatch parts of the workflow to HTCondor backend, you can use the
compute_backend
clause of the workflow specification, for example:
workflow:
type: serial
specification:
steps:
- name: gendata
environment: 'reanahub/reana-env-root6:6.18.04'
compute_backend: htcondorcern
commands:
- mkdir -p results
- root -b -q 'code/gendata.C(${events},"${data}")'
Note that the access control will be handled automatically via Kerberos, so this requires you to
submit your keytab
as in the EOS publishing example above.
Exercise
Modify HiggsToTauTau analysis to run the skimming part on the HTCondor cluster.
Solution
See also REANA documentation on HTCondor.
Using high-performance computing backends: Slurm
Another useful compute backend architecture is HPC with inter-connected nodes. This is useful for MPI and similar programming techniques.
REANA supports Slurm job scheduler to send jobs to HPC clusters. You can simply use the
compute_backend
clause again to specify wanted compute backend for each step:
workflow:
type: serial
specification:
steps:
- name: gendata
environment: 'reanahub/reana-env-root6:6.18.04'
compute_backend: slurmcern
commands:
- mkdir -p results
- root -b -q 'code/gendata.C(${events},"${data}")'
Exercise
Modify HiggsToTauTau analysis to run the histogramming part on the Slurm cluster.
Solution
See also REANA documentation on Slurm.
Opening interactive environments (notebooks) on workflow workspace
While your analysis workflows are running, you may want to open interactive session processes on the
workflow workspace. For example, to run a Jupyter notebook. This can be achieved via the open
command:
$ reana-client open -w myanalysis.42
The command will generate unique URL that will become active after a minute or two and where you will be able to open a notebook or a remote terminal on your workflow workspace.
When the notebook is no longer needed, it can be brought down via the close
command:
$ reana-client close -w myanalysis.42
Exercise
Open a Jupyter notebook on your HiggsToTauTau analysis example and inspect the ROOT files there.
Solution
See also REANA documentation on running notebooks.
Bridging GitLab with REANA
WHen using GitLab for source code development, the GItLab’s native Continuous Integration runners offer a comfortable testing environment for your analyses.
However, the COU time is usually limited.
If you would like to run REANA workflows directly from GitLab, it is useful to bridge REANA platform and the GItLab platform via OAuth technology.
This can be easily achieved from “Your profile” page on REANA user interface:
$ firefox https://reana.cern.ch/
Exercise
Connect your REANA account and your GitLab account and run an example analysis from GitLab on REANA cloud.
Solution
Key Points
Workflow specification uses hints to hide implementation complexity
Use
kerberos: true
clause to automatically trigger Kerberos token initalisationUse
resources
clause to access CVMFS repositoriesUse
compute_backend
hint in your workflow steps to dispatch jobs to various HPC/HTC backendsUse
open/close
commands to open and close interactive sessions on your workspaceEnable REANA application on GitLab to run long-standing tasks that would time out in GitLab CI
Wrap-up
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What have we learned today?
Where to go from here?
Objectives
Learn about REANA community
Wrapping up
We have discussed the problem of reproducibility and reusability of scientific analyses.
We have practiced the declarative programming approach through several examples of containerised scientific workflows.
We have seen some of the capabilities of the REANA platform allowing you to easily run containerised scientific workflows on remote compute clouds.
Please experiment with the containerised scientific workflows and let us know what you think!
The REANA platform is in a pilot stage; the early feedback with real-life use cases is greatly appreciated.
Get in touch:
- Discuss on Forum
- Chat on Mattermost
- Follow us on Twitter
- Collaborate on GitHub
Thanks and here’s to many preproducible analyses ;)
Key Points
Experiment with containerised workflows to advance scientific reproducibility in your research