Step 1: An introduction to Ruffus pipelines

Overview

Starting Data task_1() task_2() task_3() IntermediateData 1 IntermediateData 2 FinalResult

Computational pipelines transform your data in stages until the final result is produced. One easy way to understand pipelines is by imagining your data flowing across a series of pipes until it reaches its final destination. Even quite complicated processes can be simplified if we broke things down into simple stages. Of course, it helps if we can visualise the whole process.

Ruffus is a way of automating the plumbing in your pipeline: You supply the python functions which perform the data transformation, and tell Ruffus how these pipeline task functions are connected up. Ruffus will make sure that the right data flows down your pipeline in the right way at the right time.

Note

Ruffus refers to each stage of your pipeline as a task.

A gentle introduction to Ruffus syntax

Let us start with the usual “Hello World” programme.
We have the following two python functions which we would like to turn into an automatic pipeline:
def first_task():
    print "Hello "

def second_task():
    print "world"

The simplest Ruffus pipeline would look like this:

from ruffus import * def first_task(): print "Hello " @follows(first_task) def second_task(): print "world" pipeline_run([second_task]) 1. Input Ruffus 3.Run pipeline } } Your codewhich doesthe actualwork! 2. Decorate pipeline functions

The functions which do the actual work of each stage of the pipeline remain unchanged. The role of Ruffus is to make sure these functions are called in the right order, with the right parameters, running in parallel using multiprocessing if desired.

There are three simple parts to building a ruffus pipeline

  1. importing ruffus
  2. “Decorating” functions which are part of the pipeline
  3. Running the pipeline!

“Decorators”

You need to tag or decorator existing code to tell Ruffus that they are part of the pipeline.

Note

python decorators are ways to tag or mark out functions.

They start with a @ prefix and take a number of parameters in parenthesis.

@follows(first_task)def second_task(): "" Decorator Normal Python Function

The ruffus decorator @follows makes sure that second_task follows first_task.

Multiple decorators can be used for each task function to add functionality to Ruffus pipeline functions.
However, the decorated python functions can still be called normally, outside of Ruffus.
Ruffus decorators can be added to (stacked on top of) any function in any order.

Running the pipeline

We run the pipeline by specifying the last stage (task function) of your pipeline. Ruffus will know what other functions this depends on, following the appropriate chain of dependencies automatically, making sure that the entire pipeline is up-to-date.

Because second_task depends on first_task, both functions are executed in order.

>>> pipeline_run([second_task], verbose = 1)

Ruffus by default prints out the verbose progress through the pipelined code, interleaved with the Hello printed by first_task and World printed by second_task.

>>> pipeline_run([second_task], verbose = 1)Hello Job completedCompleted Task = first_taskworld Job completedCompleted Task = second_task