Computational pipelines transform your data in stages until the final result is produced. Ruffus automates the plumbing in your pipeline. You supply the python functions which perform the data transformation, and tell Ruffus how these pipeline stages or task functions are connected together.
Note
The best way to design a pipeline is to:
- write down the file names of the data as it flows across your pipeline
- write down the names of functions which transforms the data at each stage of the pipeline.
By letting Ruffus manage your pipeline parameters, you will get the following features for free:
- only out-of-date parts of the pipeline will be re-run
- multiple jobs can be run in parallel (on different processors if possible)
- pipeline stages can be chained together automatically
Let us start with the simplest pipeline with a single input data file transformed into a single output file. We will add some arbitrary extra parameters as well.
The @transform decorator tells Ruffus that this task function transforms each and every piece of input data into a corresponding output.
In other words, inputs and outputs have a 1 to 1 relationship.Note
In the second part of the tutorial, we will encounter more decorators which can split up, or join together or group inputs.
In other words, inputs and output can have many to one, many to many etc. relationships.Let us provide inputs and outputs to our new pipeline:
The @transform decorator tells Ruffus to generate the appropriate arguments for our python function:
- The input file name is as given: job1.input
- The output file name is the input file name with its suffix of .input replaced with .output1
- There are two extra parameters, a string and a number.
This is exactly equivalent to the following function call:
first_task('job1.input', 'job1.output1', "some_extra.string.for_example", 14)Even though this (empty) function doesn’t do anything just yet, the output from Ruffus pipeline_run will show that that this part of the pipeline completed successfully:
This may seem like a lot of effort and complication for something so simple: a normal python function call. However, now that we have annotated a task, we can start using it as part of our computational pipeline:
Each task function of the pipeline is a recipe or rule which can be applied repeatedly to our data.
For example, one can have
- a compile() task which will compile any number of source code files, or
- a count_lines() task which will count the number of lines in any file or
- an align_dna() task which will align the DNA of many chromosomes.
Note
Key Ruffus Terminology:
A task is an annotated python function which represents a recipe or stage of your pipeline.
A job is each time your recipe is applied to a piece of data, i.e. each time Ruffus calls your function.
Each task or pipeline recipe can thus have many jobs each of which can work in parallel on different data.
In the original example, we have made a single output file by supplying a single input parameter. We shall use much the same syntax to apply the same recipe to multiple input files. Instead of providing a single input, and a single output, we are going to specify the parameters for three jobs at once:
# previously, # first_task_params = 'job1.input' first_task_params = [ 'job1.input', 'job2.input' 'job3.input' ] # make sure the input files are there open('job1.input', "w") open('job2.input', "w") open('job3.input', "w") pipeline_run([first_task])Just by changing the inputs from a single file to a list of three files, we now have a pipeline which runs independently on three pieces of data. The results should look familiar:
>>> pipeline_run([first_task]) Job = [job1.input -> job1.output1, some_extra.string.for_example, 14] completed Job = [job2.input -> job2.output1, some_extra.string.for_example, 14] completed Job = [job3.input -> job3.output1, some_extra.string.for_example, 14] completed Completed Task = first_task
Best of all, it is easy to add another step to our initial pipeline.
We have to
- add another @transform decorated function (second_task()),
- specify first_task() as the source:
- use a suffix which matches the output from first_task()
@transform(first_task, suffix(".output1"), ".output2") def second_task(input_file, output_file): # make output file open(output_file, "w")
- call pipeline_run() with the correct final task (second_task())
The full source code can be found here
With very little effort, we now have three independent pieces of information coursing through our pipeline. Because second_task() transforms the output from first_task(), it magically knows its dependencies and that it too has to work on three jobs.
Though, three jobs have been specified in parallel, Ruffus defaults to running each of them successively. With modern CPUs, it is often a lot faster to run parts of your pipeline in parallel, all at the same time.
To do this, all you have to do is to add a multiprocess parameter to pipeline_run:
>>> pipeline_run([second_task], multiprocess = 5)In this case, ruffus will try to run up to 5 jobs at the same time. Since our second task only has three jobs, these will be started simultaneously.
A job will be run only if the output file timestamps are out of date. If you ran the same code a second time,
>>> pipeline_run([second_task])
- Nothing would happen because:
- job1.output2 is more recent than job1.output1 and
- job2.output2 is more recent than job2.output1 and
- job3.output2 is more recent than job3.output1.
- Let us see what happens when just 1 out of 3 pieces of data is modified
open("job1.input1", "w") pipeline_run([second_task], verbose =2, multiprocess = 5)You would see that only the out of date jobs (highlighted) have been re-run:
>>> pipeline_run([second_task], verbose =2, multiprocess = 5) Job = [job1.input -> job1.output1, some_extra.string.for_example, 14] completed Job = [job3.input -> job3.output1, some_extra.string.for_example, 14] unnecessary: already up to date Job = [job2.input -> job2.output1, some_extra.string.for_example, 14] unnecessary: already up to date Completed Task = first_task Job = [job1.output1 -> job1.output2] completed Job = [job2.output1 -> job2.output2] unnecessary: already up to date Job = [job3.output1 -> job3.output2] unnecessary: already up to date Completed Task = second_task
In the above examples, the input and output parameters are file names. Ruffus was designed for pipelines which save intermediate data in files. This is not compulsory but saving your data in files at each step provides a few advantages:
- Ruffus can use file system time stamps to check if your pipeline is up to date
- Your data is persistent across runs
- This is a good way to pass large amounts of data across processes and computational nodes
Otherwise, task parameters could be all sorts of data, from lists of files, to numbers, sets or tuples. Ruffus imposes few constraints on what you would like to send to each stage of your pipeline.
Ruffus does, however, assume that all strings in your input and output parameters represent file names.
input parameters which contains a glob pattern (e.g. *.txt) are expanded to the matching file names.
@transform is a 1:1 operation because it keeps the number of jobs constant entering and leaving the task. Each job can accept, for example, a pair of files as its input, or generate more than one output files.
- Let us see this in action using the previous example:
first_task_params is changed to 3 pairs of file names
- @transform for first_task is modified to produce pairs of file names
- .output.1
- .output.extra.1
from ruffus import * #--------------------------------------------------------------- # Create pairs of input files # first_task_params = [ ['job1.a.input', 'job1.b.input'], ['job2.a.input', 'job2.b.input'], ['job3.a.input', 'job3.b.input'], ] for input_file_pairs in first_task_params: for input_file in input_file_pairs: open(input_file, "w") #--------------------------------------------------------------- # # first task # @transform(first_task_params, suffix(".input"), [".output.1", ".output.extra.1"], "some_extra.string.for_example", 14) def first_task(input_files, output_file_pairs, extra_parameter_str, extra_parameter_num): # make both pairs of output files for output_file in output_file_pairs: open(output_file, "w") #--------------------------------------------------------------- # # second task # @transform(first_task, suffix(".output.1"), ".output2") def second_task(input_files, output_file): # make output file open(output_file, "w") #--------------------------------------------------------------- # # Run # pipeline_run([second_task])This gives the following results:
>>> pipeline_run([pipeline_task])We see that apart from having a file pair where previously there was a single file, little else has changed. We still have three pieces of data going through the pipeline in three parallel jobs.