The python functions which do the actual work of each stage or task of a Ruffus pipeline are written by you.The role of Ruffus is to make sure these functions are called in the right order, with the right parameters, running in parallel using multiprocessing if desired.Ruffus manages the data flowing through your pipeline by supplying the correct parameters to your pipeline functions. In this way, you will get the following features for free:
- only out-of-date parts of the pipeline will be re-run
- multiple jobs can be run in parallel (on different processors if possible)
- pipeline stages can be chained together automatically
Much of the functionality of ruffus involves determining the data flow through your pipeline, by governing how the output of one stage of the pipeline is supplied as parameters to the functions of the next.
Very often it will necessary to re-run a computational pipeline, because part of the data has changed. Ruffus will run only those stages of the pipeline which are absolutely necessary.
By default, Ruffus uses file modification times to see which parts of the pipeline are out of date, and which tasks need to be run again. This is so convenient that even if a pipeline is not file-based (if it, for example, uses database tables instead), it may be worth while to use dummy, “sentinel” files to manage the stages of a pipeline.
(It is also possible, as we shall see later, to add custom functions to determine which parts of the pipeline are out of date. see @parallel and @check_if_uptodate.)
Ruffus treats the first two parameters of each job in each task as the inputs and outputs parameters respectively. If these parameters are strings, or are sequences which contain strings, these will be treated as the names of files required by and produced by that job. The presence and modification times of the inputs and outputs files will be used to check if it is necessry to rerun the job.
Apart from this, Ruffus imposes no other restrictions on the parameters for jobs, which are passed verbatim to task functions.
Most of the time, it is sensible to stick with file names (strings) in the inputs and outputs parameters but Ruffus does not try to second-guess what sort of data you will be passing through your pipelines (except that strings represent file names).
Thus, given the following over-elaborate parameters (parameter passing will be discussed in more detail from |manual.files.chapter_num|):
[ [[1, 3], "afile.name", ("bfile.name", 72)], [[56, 3.3], set(custom_object(), "output.file")], 33.3, "oops"]This will be passed “as is” to your task function:
do_something([[1, 3], "afile.name", ("bfile.name", 72)], # input [[56, 3.3], set(custom_object(), "output.file")], # output 33.3, # extra parameter "oops") # extra parameterRuffus will interprete this as:
Input_parameter = [[1, 3], "afile.name", ("bfile.name", 72)] Output_parameter = [[56, 3.3], set(custom_object(), "output.file")] Other_parameter_1 = 33.3 Other_parameter_2 = "oops"Ruffus disregards the structure of your data, only identifying the (nested) strings. Thus there are 2 input files:
"afile.name" "bfile.name"and 1 output file:
"output.file"
The following simple rules are used by Ruffus.
The pipeline stage will be rerun if:
- If any of the inputs files are new (newer than the output files)
- If any of the output files are missing
In addition, it is possible to run jobs which create files from scratch.
- If no inputs file names are supplied, the job will only run if any output file is missing.
Finally, if no outputs file names are supplied, the job will always run.
The example in the next chapter shows how this works in practice.
If the inputs files for a job are missing, the task function will have no way to produce its output. In this case, a MissingInputFileError exception will be raised automatically. For example,
task.MissingInputFileError: No way to run job: Input file ['a.1'] does not exist for Job = ["a.1" -> "a.2", "A file"]
Note that modification times have precision to the nearest second under some older file systems (ext2/ext3?). This may be also be true for networked file systems.Ruffus is very conservative, and assumes that files with exactly the same date stamp might have been created in the wrong order, and will treat the job as out-of-date. This would result in some jobs re-running unnecessarily, simple because an underlying coarse-grained file system does not distinguish between successively created files with sufficiently accuracy.To get around this, Ruffus makes sure that each task is punctuated by a 1 second pause (via time.sleep()). If this is gets in the way, and you are using a modern file system with nanosecond timestamp resolution, you can turn off the delay by setting one_second_per_job to False in pipeline_run
Later versions of Ruffus will allow file modification times to be saved at higher precision in a log file or database to get around this.