Chapter 15: add_inputs() and inputs(): Controlling both input and output files with @transform

The standard @transform allows you to send a list of data files to the same pipelined function and for the resulting outputs parameter to be automatically inferred from file names in the inputs.

There are two situations where you might desire additional flexibility:
  1. You need to add additional prequisites or filenames to the inputs of every single one of your jobs
  2. (Less often,) the actual inputs file names are some variant of the outputs of another task.

Either way, it is occasionally very useful to be able to generate the actual inputs as well as outputs parameters by regular expression substitution. The following examples will show you both how and why you would want to do this.

Adding additional input prerequisites per job

1.) Example: compiling c++ code

Suppose we wished to compile some c++ ("*.cpp") files:
source_files = "hasty.cpp", "tasty.cpp", "messy.cpp"
for source_file in source_files:
    open(source_file, "w")
The ruffus code would look like this:
from ruffus import *

@transform(source_files, suffix(".cpp"), ".o")
def compile(input_filename, output_file_name):
    open(output_file_name, "w")
This results in the following jobs:
>>> pipeline_run([compile], verbose = 2, multiprocess = 3)

    Job = [None -> hasty.cpp] completed
    Job = [None -> tasty.cpp] completed
    Job = [None -> messy.cpp] completed
Completed Task = prepare_cpp_source

    Job = [hasty.cpp -> hasty.o] completed
    Job = [messy.cpp -> messy.o] completed
    Job = [tasty.cpp -> tasty.o] completed
Completed Task = compile

2.) Example: Adding a header file with add_inputs(..)

All this is plain vanilla @transform syntax. But suppose that we need to add a common header file "universal.h" to our compilation. The add_inputs provides for this with the minimum of fuss:

# create header file
open("universal.h", "w")

# compile C++ files with extra header
@transform(prepare_cpp_source, suffix(".cpp"), add_inputs("universal.h"), ".o")
def compile(input_filename, output_file_name):
    open(output_file_name, "w")

Now the input file is a python list, with "universal.h" added to each "*.cpp"

>>> pipeline_run([compile], verbose = 2, multiprocess = 3)

    Job = [ [hasty.cpp, universal.h] -> hasty.o] completed
    Job = [ [messy.cpp, universal.h] -> messy.o] completed
    Job = [ [tasty.cpp, universal.h] -> tasty.o] completed
Completed Task = compile

Additional input prerequisites can be globs, tasks or pattern matches

A common requirement is to include the corresponding header file in compilations. It is easy to use add_inputs to look up additional files via pattern matches.

3.) Example: Adding matching header file

To make this example more fun, we shall also:
  1. Give each source code file its own ordinal
  2. Use add_inputs to add files produced by another task function
# each source file has its own index
source_names = [("hasty.cpp", 1),
                ("tasty.cpp", 2),
                ("messy.cpp", 3), ]
header_names = [sn.replace(".cpp", ".h") for (sn, i) in source_names]
header_names.append("universal.h")

#
#   create header and source files
#
for source, source_index in source_names:
    open(source, "w")

for header in header_names:
    open(header, "w")



from ruffus import *

#
#   lookup embedded strings in each source files
#
@transform(source_names, suffix(".cpp"), ".embedded")
def get_embedded_strings(input_filename, output_file_name):
    open(output_file_name, "w")



# compile C++ files with extra header
@transform(source_names, suffix(".cpp"),
           add_inputs(  "universal.h",
                        r"\1.h",
                        get_embedded_strings     ), ".o")
def compile(input_params, output_file_name):
    open(output_file_name, "w")


pipeline_run([compile], verbose = 2, multiprocess = 3)

This script gives the following output

>>> pipeline_run([compile], verbose = 2, multiprocess = 3)

    Job = [[hasty.cpp,  1] -> hasty.embedded] completed
    Job = [[messy.cpp,  3] -> messy.embedded] completed
    Job = [[tasty.cpp,  2] -> tasty.embedded] completed
Completed Task = get_embedded_strings

    Job = [[[hasty.cpp,  1],                                            # inputs
            universal.h,                                                # common header
            hasty.h,                                                    # corresponding header
            hasty.embedded, messy.embedded, tasty.embedded]             # output of get_embedded_strings()
           -> hasty.o] completed
    Job = [[[messy.cpp, 3],                                             # inputs
            universal.h,                                                # common header
            messy.h,                                                    # corresponding header
            hasty.embedded, messy.embedded, tasty.embedded]             # output of get_embedded_strings()
           -> messy.o] completed
    Job = [[[tasty.cpp, 2],                                             # inputs
            universal.h,                                                # common header
            tasty.h,                                                    # corresponding header
            hasty.embedded, messy.embedded, tasty.embedded]             # output of get_embedded_strings()
           -> tasty.o] completed
Completed Task = compile
We can see that the compile(...) task now has four sets of inputs:
  1. The original inputs (e.g. [hasty.cpp,  1])
And three additional added by add_inputs(...)
  1. A header file (universal.h) common to all jobs
  2. The matching header (e.g. hasty.h)
  3. The output from another task get_embedded_strings() (e.g. hasty.embedded, messy.embedded, tasty.embedded)

Note

For input parameters with nested structures (lists or sets), the pattern matching is on the first filename string Ruffus comes across (DFS).

So for ["hasty.c", 0], the pattern matches "hasty.c".

If in doubt, use pipeline_printout to check what parameters Ruffus is using.

4.) Example: Using regex(..) instead of suffix(..)

Suffix pattern matching is much simpler and hence is usually preferable to the more powerful regular expressions. We can rewrite the above example to use regex as well to give exactly the same output.

# compile C++ files with extra header
@transform(source_names, regex(r"(.+)\.cpp"),
           add_inputs(  "universal.h",
                        r"\1.h",
                        get_embedded_strings     ), r"\1.o")
def compile(input_params, output_file_name):
    open(output_file_name, "w")

Note

The backreference \g<0> usefully substitutes the entire substring matched by the regular expression.

Replacing all input parameters with inputs(...)

More rarely, it is necessary to replace all the input parameters wholescale.

4.) Example: Running matching python scripts

In the following example, we are not compiling C++ source files but invoking corresponding python scripts which have the same name.

Given three c++ files and their corresponding python scripts:

# each source file has its own index
source_names = [("hasty.cpp", 1),
                ("tasty.cpp", 2),
                ("messy.cpp", 3), ]

#
#   create c++ source files and corresponding python files
#
for source, source_index in source_names:
    open(source, "w")
    open(source.replace(".cpp", ".py"), "w")

The Ruffus code will call each python script corresponding to their c++ counterpart:

from ruffus import *


# run corresponding python files
@transform(source_names, suffix(".cpp"), inputs(  r"\1.py"), ".results")
def run_python_file(input_params, output_file_name):
    open(output_file_name, "w")


pipeline_run([run_python_file], verbose = 2, multiprocess = 3)
Resulting in this output:
>>> pipeline_run([run_python_file], verbose = 2, multiprocess = 3)
    Job = [hasty.py -> hasty.results] completed
    Job = [messy.py -> messy.results] completed
    Job = [tasty.py -> tasty.results] completed
Completed Task = run_python_file

5.) Example: Using regex instead of suffix

Again, the same code can be written (less clearly) using the more powerful regex and python regular expressions:

from ruffus import *


# run corresponding python files
@transform(source_names, regex(r"(.+)\.cpp"), inputs(  r"\1.py"), r\"1.results")
def run_python_file(input_params, output_file_name):
    open(output_file_name, "w")


pipeline_run([run_python_file], verbose = 2, multiprocess = 3)

This is about as sophisticated as @transform ever gets!