Chapter 9: Merge multiple input into a single result

At the conclusion of our pipeline, or at key selected points, we might need a summary of our progress, gathering data from a multitude of files or disparate inputs, and summarised in the output of a single job.

Ruffus uses the @merge decorator for this purpose.

Although, @merge tasks multiple inputs and produces a single output, Ruffus is again agnostic as to the sort of data contained within output. It can be a single (string) file name, or an arbitrary complicated nested structure with numbers, objects etc. As always, strings contained (even with nested sequences) within output will be treated as file names for the purpose of checking if the task is up-to-date.

@merge

This example is borrowed from step 6 of the simple tutorial.

Combining partial solutions: Calculating variances

Step 6 from:

../../_images/simple_tutorial_step5_sans_key.png

We wanted to calculate the sample variance of a large list of random numbers. We have seen previously how we can split up this large problem into small pieces (using @split in Chapter 7), and work out the partial solutions for each sub-problem (calculating sums with @transform in Chapter 8 ).

All that remains is to join up the partial solutions from the different .sums files and turn these into the variance as follows:

variance = (sum_squared - sum * sum / N)/N

where N is the number of values

See the wikipedia entry for a discussion of why this is a very naive approach!

To do this, all we have to do is go through all the values in *.sums, i.e. add up the sums and sum_squared for each chunk. We can then apply the above (naive) formula.

Merging files is straightforward in Ruffus:
@merge(step_5_calculate_sum_of_squares, "variance.result")
def step_6_calculate_variance (input_file_names, output_file_name):
    #
    #   add together sums and sums of squares from each input_file_name
    #       calculate variance and write to output_file_name
    ""

The @merge decorator tells Ruffus to take all the files from the step 5 task (i.e. *.sums), and produced a merge file in the form of variance.result.

Thus if step_5_calculate_sum_of_squares created
1.sums and
2.sums etc.

This would result in the following function call:

step_6_calculate_variance (["1.sums", "2.sums"], "variance.result")

The final result is, of course, in variance.result.