Chapter 4: Chaining pipeline Tasks together automatically

In the previous chapter, we explained that ruffus determines the data flow through your pipeline by calling your task functions (normal python functions written by you) with the right parameters at the right time, making sure that

  1. only out-of-date parts of the pipeline will be re-run
  2. multiple jobs can be run in parallel (on different processors if possible)
  3. pipeline stages can be chained together automatically

This chapter is devoted to the last item: how the output of one stage of the pipeline is piped into as the input of the next stage.

Tasks in the inputs parameters: Implicit dependencies

Ruffus treats the first two parameters of each job in each task as the inputs and outputs parameters respectively. If the inputs parameter contains strings, these will be treated as the names of files required by that job.

If the inputs parameter contains any tasks, Ruffus will take the output from these specified tasks as part of the current inputs parameter. In addition, such tasks will be listed as prequisites, much as if you had included them in a separate @follows decorator.

For example, supposed we wanted to take the output files from task1 and feed them automatically to task2, we might write the following code

task1_ouput_files = ("task1.output_a", "task1.output_b", "task1.output_c")

@follows(task1)
@files(task1_ouput_files, "task2.output")
def task2(input, output):
    pass

This can be replaced by the much more concise syntax:

@files(task1, "task2.output")
def task2(input, output):
    pass
This means:
  • Take the output from task1, and feed it automatically into task2.
  • Also make sure that task2 becomes a dependency of task1.

In other words, task1 and task2 have been chained together automatically. This is both a great convenience and makes the flow of data through a pipeline much clearer.

Refering to tasks by name in the inputs parameters

Chapter 1 explains that task functions can be defined in any order so long as undefined tasks are referred to by their (fully qualified if necessary) function name string.

You can similarly refer to tasks in the inputs parameter by name, as a text string. Normally Ruffus assumes that strings are file names. To indicate that that you are referring to task function names instead, you need to wrap the relevant parameter or (nested) parts of the parameter with the indicator object output_from(“task_name”). Thus,

@split(["a.file", ("b.file", output_from("task1", 76, "task2"))], "*.split")
def task2(input, output):
    pass

is equivalent to:

@split(["a.file", ("b.file", (task1, 76, task2))], "*.split")
def task2(input, output):
    pass

Globs in the inputs parameters

As a syntactic convenience, Ruffus also allows you to specify a glob pattern (e.g. *.txt) in the input parameter, it will be expanded automatically to the actually matching file names. This applies to any strings within inputs which contain the letters: *?[].

Mixing globs, tasks and files as inputs

Ruffus is very flexible in allowing you to mix glob patterns, references to tasks and file names in the data structures you pass as the inputs parameters.

Suppose, in the previous example,

  • that task1 produces the files
    "task1.output_a"
    "task1.output_b"
    "task1.output_c"
    
  • that the following additional files are also present
    "extra.a"
    "extra.c"
    

Then,

@files(["1_more.file", "2_more.file", task1, "extra.*"], "task2.output")
def task2(input, output):
    pass

would result in the combination of the specified file name, the expansion of the glob, and the results from the previous task:

input == [
            "1_more.file"   ,           # specified file
            "2_more.file"   ,           # specified file
            "task1.output_a",           # from previous task
            "task1.output_b",           # from previous task
            "task1.output_c",           # from previous task
            "extra.a"       ,           # from glob expansion
            "extra.c"       ,           # from glob expansion
         ]

In other words, glob patterns and tasks are expanded “in place” when they are part of python lists, sets, or tuples.

Appending globs or tasks to pre-existing lists, sets or tuples

Sometimes we want to the inputs parameter to contain be a combination of globs and tasks, and an existing list of file names.

To elaborate on the above example, suppose we have a list of files:

file_list = [   "1_more.file",
                "2_more.file"]

Now we want the input to task2 to be:

file_list + task1 + "extra.*"

The closest that we can express this in python syntax is by turning task1 and the glob to a list first then adding them together:

@files(file_list + [task1] + ["extra.*"], "task2.output")
def task2(input, output):
    pass

The same also works with tuples:

file_list = (   "1_more.file",
                "2_more.file")

@files(file_list + (task1, "extra.*"), "task2.output")
def task2(input, output):
    pass

and sets (using the set concatenation operator):

file_list = set([   "1_more.file",
                    "2_more.file"])

@files(file_list | set([task1 + "extra.*"]), "task2.output")
def task2(input, output):
    pass

Understanding complex inputs and outputs parameters

In all cases, Ruffus tries to do the right thing, and to make the simple or obvious case require the simplest, least onerous syntax.

If sometimes Ruffus does not behave the way you expect, please write to the authors: it may be a bug!

In all other cases, the best thing to do, is write your Ruffus specifications, and check the results of pipeline_printout to make sure that your wishes are properly reflected in the parameters sent to your pipelined tasks.

In other words, read the next chapter!