See also

@collate with add_inputs and inputs

@collate ( tasks_or_file_names, regex(matching_regex) | formatter(matching_formatter), [inputs(input_pattern_or_glob) | add_inputs(input_pattern_or_glob)] , output_pattern, [extra_parameters,...] )

Purpose:

Groups / collates sets of input files, each into a separate summary.

This variant of @collate allows additional inputs or dependencies to be added dynamically to the task.

Output file names are determined from tasks_or_file_names, i.e. from the output of up stream tasks, or a list of file names.

This variant of @collate allows input file names to be derived in the same way.

add_inputs nests the the original input parameters in a list before adding additional dependencies.

inputs replaces the original input parameters wholescale.

Only out of date tasks (comparing input and output files) will be run

Example of add_inputs

regex(r".*(\..+)"), "\1.summary" creates a separate summary file for each suffix. But we also add date of birth data for each species:

animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  add_inputs(r"\1.date_of_birth"), r'\1.summary')
def summarize(infiles, summary_file):
    pass

This results in the following equivalent function calls:

summarize([ ["shark.fish",  "fish.date_of_birth"   ],
            ["tuna.fish",   "fish.date_of_birth"   ] ], "fish.summary")
summarize([ ["cat.mammals", "mammals.date_of_birth"],
            ["dog.mammals", "mammals.date_of_birth"] ], "mammals.summary")

Example of add_inputs

using inputs(...) will summarise only the dates of births for each species group:

animal_files = "tuna.fish", "shark.fish", "dog.mammals", "cat.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  inputs(r"\1.date_of_birth"), r'\1.summary')
def summarize(infiles, summary_file):
    pass

This results in the following equivalent function calls:

summarize(["fish.date_of_birth"   ], "fish.summary")
summarize(["mammals.date_of_birth"], "mammals.summary")

Parameters:

  • tasks_or_file_names

    can be a:

    1. Task / list of tasks (as in the example above).

      File names are taken from the output of the specified task(s)

    2. (Nested) list of file name strings.
      File names containing *[]? will be expanded as a glob.

      E.g.:"a.*" => "a.1", "a.2"

  • matching_regex

    is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

  • input_pattern

    Specifies the resulting input(s) to each job. Must be wrapped in an inputs or an inputs indicator object.

    Can be a:

    1. Task / list of tasks (as in the example above).

      File names are taken from the output of the specified task(s)

    2. (Nested) list of file name strings.

      Strings will be subject to substitution. File names containing *[]? will be expanded as a glob. E.g.:"a.*" => "a.1", "a.2"

  • output_pattern

    Specifies the resulting output file name(s).

  • extra_parameters

    Any extra parameters are passed verbatim to the task function

  1. outputs and optional extra parameters are passed to the functions after string substitution in any strings. Non-string values are passed through unchanged.
  2. Each collate job consists of input files which are aggregated by string substitution to a single set of output / extra parameter matches

See @collate for more straightforward ways to use collate.