See also

@collate

@collate ( tasks_or_file_names, regex(matching_regex) | formatter(matching_formatter), output_pattern, [extra_parameters,...] )

Purpose:

Groups / collates sets of input files, each into a separate summary.

Only out of date tasks (comparing input and output files) will be run

Output file names and strings in the extra parameters are determined from tasks_or_file_names, i.e. from the output of up stream tasks, or a list of file names.

String replacement occurs either through suffix matches via suffix or the formatter or regex indicators.

@collate groups together all Input which result in identical Output and extra parameters.

It is a many to fewer operation.

Example:

regex(r".*(\..+)"), "\1.summary" creates a separate summary file for each suffix:

animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r"\.(.+)$"),  r'\1.summary')
def summarize(infiles, summary_file):
    pass

Parameters:

  • tasks_or_file_names

    can be a:

    1. Task / list of tasks (as in the example above).

      File names are taken from the output of the specified task(s)

    2. (Nested) list of file name strings.
      File names containing *[]? will be expanded as a glob.

      E.g.:"a.*" => "a.1", "a.2"

  • matching_regex

    is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

  • output_pattern

    Specifies the resulting output file name(s).

  • extra_parameters

    Any extra parameters are passed verbatim to the task function

  1. outputs and optional extra parameters are passed to the functions after string substitution in any strings. Non-string values are passed through unchanged.
  2. Each collate job consists of input files which are aggregated by string substitution to a single set of output / extra parameter matches
  3. In the above cases, a.fish and b.fish both produce fish.summary after regular expression subsitution, and are collated into a single job: ["a.fish", "b.fish" -> "fish.summary"] while c.mammals, d.mammals both produce mammals.summary, are collated in a separate job: ["c.mammals", "d.mammals" -> "mammals.summary"]

Example2:

Suppose we had the following files:

cows.mammals.animal
horses.mammals.animal
sheep.mammals.animal

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal

pufferfish.fish.animal

and we wanted to end up with three different resulting output:

cow.mammals.animal
horse.mammals.animal
sheep.mammals.animal
    -> mammals.results

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal
    -> reptile.results

pufferfish.fish.animal
    -> fish.results

This is the @collate code required:

animals = [     "cows.mammals.animal",
                "horses.mammals.animal",
                "sheep.mammals.animal",
                "snake.reptile.animal",
                "lizard.reptile.animal",
                "crocodile.reptile.animal",
                "pufferfish.fish.animal"]

@collate(animals, regex(r"(.+)\.(.+)\.animal"),  r"\2.results")
# \1 = species [cow, horse]
# \2 = phylogenetics group [mammals, reptile, fish]
def summarize_animals_into_groups(species_file, result_file):
    " ... more code here"
    pass

See @merge for an alternative way to summarise files.