Developing serial workflows

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How to write serial workflows?

What is declarative programming?

How to develop workflows progressively?

Can I temporarily override workflow parameters?

Do I always have to build new Docker image when my code changes?

Objectives

Understand pros/cons between imperative and declarative programming styles

Get familiar with serial workflow development practices

Understand run numbers of your analysis

See how you can run only parts of the workflow

See how you can repeat workflow to fix a failed step

Overview

We have seen how to use REANA client to run containerised analyses on the REANA cloud.

In this lesson we see more use cases suitable for developing serial workflows.

Imperative vs declarative programming

Imperative programming feels natural: use a library and just write code. Example: C.

for (int i = 0; i < sizeof(people) / sizeof(struct people); i++) {
  if (people[i].age < 20) {
    printf("%s\n", people[i].name)
  }
}

However, it has also its drawbacks. If you write scientific workflows imperatively and you need port the code to use GPUs, to run on different compute architectures, or to scale up, it may be necessary to do considerable code refactoring. This is not writing science code, but rather writing orchestration for the said science code onto different deployment scenarios.

Enter declarative programming that “expresses the logic of a computation without describing its control flow”. Example: SQL.

SELECT name FROM people WHERE age<20

The idea of declarative approach to scientific worflows is to express research as a series of data analysis steps and let an independent “orchestration tool” or a “workflow system” the task of running things properly on various deployment architectures.

This achieves better separation of concerns between physics code knowledge and computing orchestration glue code knowledge. However, the development may be felt less immediate. There are pros and cons. There is no silver bullet.

Imperative or declarative?

Imperative programming is about how you want to achieve something. Declarative programming is about what you want to achieve.

Developing workflows progressively

Developing workflows declaratively may feel less natural. How do we do that?

Start with earlier steps, run, debug, run, debug until satisfaction.

Continue with later steps only afterwords.

How to run only first step of our example workflow? Use TARGET step option:

$ reana-client run -w roofit -o TARGET=gendata

[INFO] Creating a workflow...
roofit.2
[INFO] Uploading files...
File code/gendata.C was successfully uploaded.
File code/fitdata.C was successfully uploaded.
[INFO] Starting workflow...
roofit.2 is running

After a minute, let us check the status:

$ reana-client status -w roofit

NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
roofit   2            2020-02-17T16:07:29   2020-02-17T16:07:33   2020-02-17T16:08:48   finished   1/1

and the workspace content:

$ reana-client ls -w roofit

NAME                SIZE     LAST-MODIFIED
code/gendata.C      1937     2020-02-17T16:07:30
code/fitdata.C      1648     2020-02-17T16:07:31
results/data.root   154458   2020-02-17T16:08:43

As we can see, the workflow run only the first command and the data.root file was well generated. The final fitting step was not run and the final plot was not produced.

Workflow runs

We have run the analysis example anew. Similar to Continuous Integration systems, the REANA platform runs each workflow in an independent workspace. To distinguish between various workflow runs of the same analysis, the REANA platform keeps an incremental “run number”. You can obtain the list of all your workflows by using the list command:

$ reana-client list

NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS
roofit   2            2020-02-17T16:07:29   2020-02-17T16:07:33   2020-02-17T16:08:48   finished
roofit   1            2020-02-17T16:01:45   2020-02-17T16:01:48   2020-02-17T16:02:50   finished

You can use myanalysis.myrunnumber to refer to a given run number of an analysis:

$ reana-client ls -w roofit.1
$ reana-client ls -w roofit.2

To quickly know the differences between various workflow runs, you can use the diff command:

$ reana-client diff roofit.1 roofit.2 --brief

No differences in reana specifications.

DIFFERENCES IN WORKSPACE LISTINGS:
Files roofit.1/results/data.root and roofit.2/results/data.root differ
Only in roofit.1/results: plot.png

Workflow parameters

Another useful technique when developing a workflow is to use smaller data samples until the workflow is debugged. For example, instead of generating 20000 events, we can generate only 1000. While you could achieve this by simply modifying the workflow definition, REANA offers an option to run parametrised workflows, meaning that you can pass the wanted value on the command line:

$ reana-client run -w roofit -p events=1000

[INFO] Creating a workflow...
roofit.3
[INFO] Uploading files...
File code/gendata.C was successfully uploaded.
File code/fitdata.C was successfully uploaded.
[INFO] Starting workflow...
roofit.3 is running

The generated ROOT file is much smaller:

$ reana-client ls -w roofit.1 | grep data.root

results/data.root   154457   2020-02-17T16:02:17

$ reana-client ls -w roofit.3 | grep data.root

results/data.root   19216   2020-02-17T16:18:45

and the plot much coarser:

$ reana-client download results/plot.png -w roofit.3

Developing further steps

Now that we are happy with the beginning of the workflow, how do we continue to develop the rest? Running a new workflow every time could be very time consuming; running skimming may require many more minutes than running statistical analysis.

In these situations, you can take advantage of the restart functionality. The REANA platform allows to restart a part of the workflow on the given workspace starting from the workflow step specified by the FROM option:

$ reana-client restart -w roofit.3 -o FROM=fitdata

roofit.3.1 has been queued

Note that the run number got an extra digit, meaning the number of restarts of the given workflow. The full semantics of REANA run numbers is myanalysis.myrunnumber.myrestartnumber.

Let us enquire about the status of the restarted workflow:

$ reana-client status -w roofit.3.1

NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
roofit   3.1          2020-02-17T16:26:09   2020-02-17T16:26:10   2020-02-17T16:27:24   finished   1/1

Looking at the number of steps of the 3.1 rerun, and looking at modification timestamps of the workspace files:

$ reana-client ls -w roofit.3.1

NAME                SIZE    LAST-MODIFIED
code/gendata.C      1937    2020-02-17T16:17:00
code/fitdata.C      1648    2020-02-17T16:17:01
results/plot.png    16754   2020-02-17T16:27:20
results/data.root   19216   2020-02-17T16:18:45

We can see that only the last step of the workflow was rerun, as wanted.

This technique is useful to debug later stages of the workflow without having to rerun the lengthy former stages of the workflow.

Exercise

Consider we would like to produce the final plot of the roofit example and change the title from “Fit example” to “RooFit example”. How do you do this in the most efficient way?

Solution

Amend fitdata.C, upload changed file to the workspace, and rerun the past successful workflow starting from the fitdata step:
$ reana-client list
$ vim code/fitdata.C # edit title printing statement
$ reana-client upload ./code/fitdata.C -w roofit.3
$ reana-client restart -w roofit.3 -o FROM=fitdata
$ reana-client list
$ reana-client status -w roofit.3.2
$ reana-client download -w roofit.3.2

Compile-time vs runtime code changes

Sometimes you have to build a new container image when code changes (e.g. C++ compilation). sometimes you don’t (e.g. Python code, ROOT macros). Use latter for more productivity when developing workflows.

Key Points

Develop workflows progressively; add steps as needed

When developing a workflow, stay on the same workspace

When developing a bytecode-interpreted code, stay on the same container

Use smaller test data before scaling out

Use workflows as Continuous Integration; make atomic commits that always work

previous episode

Reproducible analyses

next episode

Developing serial workflows

Overview

Overview

Imperative vs declarative programming

Imperative or declarative?

Developing workflows progressively

Workflow runs

Workflow parameters

Developing further steps

Exercise

Solution

Compile-time vs runtime code changes

Key Points

previous episode

next episode