Step 3: Understanding how your pipeline works

Note

Remember to look at the example code:

The trickiest part of developing pipelines is understanding how your data flows through the pipeline.

Parameters and files are passed from one task to another down the chain of pipelined functions.

Whether you are learning how to use ruffus, or trying out a new feature in ruffus, or just have a horrendously complicated pipeline to debug (we have colleagues with >100 criss-crossing pipelined stages), your best friend is pipeline_printout(...)

Printing out which jobs will be run

pipeline_printout(...) takes the same parameters as pipeline_run but just prints the tasks which are and are not up-to-date.

The verbose parameter controls how much detail is displayed.

Let us take the two step pipelined code we have previously written, but call pipeline_printout(...) instead of pipeline_run(...). This lists the tasks which will be run in the pipeline:

../../_images/simple_tutorial_pipeline_printout11.png

To see the input and output parameters of each job in the pipeline, we can increase the verbosity from the default (1) to 3:

../../_images/simple_tutorial_pipeline_printout21.png
This is very useful for checking that the input and output parameters have been specified
correctly.

Determining which jobs are out-of-date or not

It is often useful to see which tasks are or are not up-to-date. For example, if we were to run the pipeline in full, and then modify one of the intermediate files, the pipeline would be partially out of date.

Let us start by run the pipeline in full but then modify job1.stage so that the second task is no longer up-to-date:

pipeline_run([second_task])

# modify job1.stage1
open("job1.stage1", "w").close()

At a verbosity of 5, even jobs which are up-to-date will be displayed. We can now see that the there is only one job in second_task(...) which needs to be re-run because job1.stage1 has been modified after job1.stage2 (highlighted in blue):

../../_images/simple_tutorial_pipeline_printout31.png