The second half of this tutorial is a worked example to calculate the sample variance of 10,000 random numbers.This is similar to many computational projects: we are tackling a big problem by splitting it up into many tiny problems solved in parallel. We can then merge our piecemeal solutions into our final answer. These embarassingly parallel problems motivated the original design of Ruffus.Ruffus has three dedicated decorators to handle these problems with ease:
- @split to break up the big problem
- @transfrom to solve the parts in parallel
- @merge to merge our piecemeal solutions into the final answer.
Step 5 from:
![]()
Suppose we had a list of 100,000 random numbers in the file random_numbers.list:
import random f = open('random_numbers.list', 'w') for i in range(NUMBER_OF_RANDOMS): f.write('%g\n' % (random.random() * 100.0))We might want to calculate the sample variance more quickly by splitting them into NNN parcels of 1000 numbers each and working on them in parallel. In this case we known that NNN == 100 but usually the number of resulting files is only apparent after we have finished processing our starting file.
Our pipeline function needs to take the random numbers file random_numbers.list, read the random numbers from it, and write to a new file every 100 lines.
The Ruffus decorator @split is designed specifically for splitting up input into an indeterminate NNN number of output files:
![]()
Ruffus will set
input_file_name to "random_numbers.list"output_files to all files which match *.chunks (i.e. "1.chunks", "2.chunks" etc.).The first time you run this function *.chunks will return an empty list because no .chunks files have been created, resulting in the following:
step_5_split_numbers_into_chunks ("random_numbers.list", [])After that *.chunks will match the list of current .chunks files created by the previous pipeline run. Some of these files will be out of date or superfluous. These file names are usually only useful for removing detritus from previous runs (have a look at step_5_split_numbers_into_chunks(...)).
Note
The great value of specifying correctly the list of output files will become apparent in the next step of this tutorial when we shall see how pipeline tasks can be “chained” together conveniently.
Remember to specify globs patterns which match all the files you are splitting up. You can cover different directories, or groups of file names by using a list of globs: e.g.
@split("input.file", ['a*.bits', 'b*.pieces', 'somewhere_else/c*.stuff']) def split_function (input_filename, output_files): "Code to split up 'input.file'"