- @transform syntax in detail
The standard @transform allows you to send a list of data files to the same pipelined function and for the resulting outputs parameter to be automatically inferred from file names in the inputs.
- There are two situations where you might desire additional flexibility:
- You need to add additional prequisites or filenames to the inputs of every single one of your jobs
- (Less often,) the actual inputs file names are some variant of the outputs of another task.
Either way, it is occasionally very useful to be able to generate the actual inputs as well as outputs parameters by regular expression substitution. The following examples will show you both how and why you would want to do this.
- Suppose we wished to compile some c++ ("*.cpp") files:
source_files = "hasty.cpp", "tasty.cpp", "messy.cpp" for source_file in source_files: open(source_file, "w")
- The ruffus code would look like this:
from ruffus import * @transform(source_files, suffix(".cpp"), ".o") def compile(input_filename, output_file_name): open(output_file_name, "w")
- This results in the following jobs:
>>> pipeline_run([compile], verbose = 2, multiprocess = 3) Job = [None -> hasty.cpp] completed Job = [None -> tasty.cpp] completed Job = [None -> messy.cpp] completed Completed Task = prepare_cpp_source Job = [hasty.cpp -> hasty.o] completed Job = [messy.cpp -> messy.o] completed Job = [tasty.cpp -> tasty.o] completed Completed Task = compile
All this is plain vanilla @transform syntax. But suppose that we need to add a common header file "universal.h" to our compilation. The add_inputs provides for this with the minimum of fuss:
# create header file open("universal.h", "w") # compile C++ files with extra header @transform(prepare_cpp_source, suffix(".cpp"), add_inputs("universal.h"), ".o") def compile(input_filename, output_file_name): open(output_file_name, "w")Now the input file is a python list, with "universal.h" added to each "*.cpp"
>>> pipeline_run([compile], verbose = 2, multiprocess = 3) Job = [ [hasty.cpp, universal.h] -> hasty.o] completed Job = [ [messy.cpp, universal.h] -> messy.o] completed Job = [ [tasty.cpp, universal.h] -> tasty.o] completed Completed Task = compile
A common requirement is to include the corresponding header file in compilations. It is easy to use add_inputs to look up additional files via pattern matches.
- To make this example more fun, we shall also:
- Give each source code file its own ordinal
- Use add_inputs to add files produced by another task function
# each source file has its own index source_names = [("hasty.cpp", 1), ("tasty.cpp", 2), ("messy.cpp", 3), ] header_names = [sn.replace(".cpp", ".h") for (sn, i) in source_names] header_names.append("universal.h") # # create header and source files # for source, source_index in source_names: open(source, "w") for header in header_names: open(header, "w") from ruffus import * # # lookup embedded strings in each source files # @transform(source_names, suffix(".cpp"), ".embedded") def get_embedded_strings(input_filename, output_file_name): open(output_file_name, "w") # compile C++ files with extra header @transform(source_names, suffix(".cpp"), add_inputs( "universal.h", r"\1.h", get_embedded_strings ), ".o") def compile(input_params, output_file_name): open(output_file_name, "w") pipeline_run([compile], verbose = 2, multiprocess = 3)This script gives the following output
>>> pipeline_run([compile], verbose = 2, multiprocess = 3) Job = [[hasty.cpp, 1] -> hasty.embedded] completed Job = [[messy.cpp, 3] -> messy.embedded] completed Job = [[tasty.cpp, 2] -> tasty.embedded] completed Completed Task = get_embedded_strings Job = [[[hasty.cpp, 1], # inputs universal.h, # common header hasty.h, # corresponding header hasty.embedded, messy.embedded, tasty.embedded] # output of get_embedded_strings() -> hasty.o] completed Job = [[[messy.cpp, 3], # inputs universal.h, # common header messy.h, # corresponding header hasty.embedded, messy.embedded, tasty.embedded] # output of get_embedded_strings() -> messy.o] completed Job = [[[tasty.cpp, 2], # inputs universal.h, # common header tasty.h, # corresponding header hasty.embedded, messy.embedded, tasty.embedded] # output of get_embedded_strings() -> tasty.o] completed Completed Task = compile
- We can see that the compile(...) task now has four sets of inputs:
- The original inputs (e.g. [hasty.cpp, 1])
- And three additional added by add_inputs(...)
- A header file (universal.h) common to all jobs
- The matching header (e.g. hasty.h)
- The output from another task get_embedded_strings() (e.g. hasty.embedded, messy.embedded, tasty.embedded)
Note
For input parameters with nested structures (lists or sets), the pattern matching is on the first filename string Ruffus comes across (DFS).
So for ["hasty.c", 0], the pattern matches "hasty.c".
If in doubt, use pipeline_printout to check what parameters Ruffus is using.
Suffix pattern matching is much simpler and hence is usually preferable to the more powerful regular expressions. We can rewrite the above example to use regex as well to give exactly the same output.
# compile C++ files with extra header @transform(source_names, regex(r"(.+)\.cpp"), add_inputs( "universal.h", r"\1.h", get_embedded_strings ), r"\1.o") def compile(input_params, output_file_name): open(output_file_name, "w")Note
The backreference \g<0> usefully substitutes the entire substring matched by the regular expression.
More rarely, it is necessary to replace all the input parameters wholescale.
In the following example, we are not compiling C++ source files but invoking corresponding python scripts which have the same name.
Given three c++ files and their corresponding python scripts:
# each source file has its own index source_names = [("hasty.cpp", 1), ("tasty.cpp", 2), ("messy.cpp", 3), ] # # create c++ source files and corresponding python files # for source, source_index in source_names: open(source, "w") open(source.replace(".cpp", ".py"), "w")The Ruffus code will call each python script corresponding to their c++ counterpart:
from ruffus import * # run corresponding python files @transform(source_names, suffix(".cpp"), inputs( r"\1.py"), ".results") def run_python_file(input_params, output_file_name): open(output_file_name, "w") pipeline_run([run_python_file], verbose = 2, multiprocess = 3)
- Resulting in this output:
>>> pipeline_run([run_python_file], verbose = 2, multiprocess = 3) Job = [hasty.py -> hasty.results] completed Job = [messy.py -> messy.results] completed Job = [tasty.py -> tasty.results] completed Completed Task = run_python_file
Again, the same code can be written (less clearly) using the more powerful regex and python regular expressions:
from ruffus import * # run corresponding python files @transform(source_names, regex(r"(.+)\.cpp"), inputs( r"\1.py"), r\"1.results") def run_python_file(input_params, output_file_name): open(output_file_name, "w") pipeline_run([run_python_file], verbose = 2, multiprocess = 3)This is about as sophisticated as @transform ever gets!