It is often very useful to group together disparate inputs into several categories, each of which lead to a separate output. In the example shown below, we produce separate summaries of results depending on which species the file belongs to.
Ruffus uses the term collate in a rough analogy to the way printers group together copies of documents appropriately.
Our example starts with some files which presumably have been created by some earlier stages of our pipeline. We simulate this here with this code:
files_names = [ "mammals.tiger.wild.animals" "mammals.lion.wild.animals" "mammals.lion.handreared.animals" "mammals.dog.tame.animals" "mammals.dog.wild.animals" "mammals.dog.feral.animals" "reptiles.crocodile.wild.animals" ] for f in files_names: open(f, "w").write(f)However, we are only interested in mammals, and we would like the files of each species to end up in its own directory, i.e. tiger, lion and dog:
import os os.mkdir("tiger") os.mkdir("lion") os.mkdir("dog")Now we would like to place each file in a different destination, depending on its species. The following regular expression marks out the species name r'mammals.([^.]+)'. For mammals.tiger.wild.animals, the first matching group (\1) == "tiger"
Then, the following:
from ruffus import * @collate('*.animals', # inputs = all *.animal files regex(r'mammals.([^.]+)'), # regular expression r'\1/animals.in_my_zoo', # single output file per species r'\1' ) # species name def capture_mammals(infiles, outfile, species): # summarise all animals of this species print "Collating %s" % species o = open(outfile, "w") for i in infiles: o.write(open(infile).read() + "\ncaptured\n") pipeline_run([capture_mammals])puts each captured mammal in its own directory:
Task = capture_mammals Job = [(mammals.lion.handreared.animals, mammals.lion.wild.animals) -> lion/animals.in_my_zoo] completed Job = [(mammals.tiger.wild.animals, ) -> tiger/animals.in_my_zoo] completed Job = [(mammals.dog.tame.animals, mammals.dog.wild.animals, mammals.dog.feral.animals) -> dog/animals.in_my_zoo] completedThe crocodile has been discarded because it isn’t a mammal and the file name doesn’t match the mammal part of the regular expression.