See also
- Purpose:
Groups / collates sets of input files, each into a separate summary.
Only out of date tasks (comparing input and output files) will be run
Output file names and strings in the extra parameters are determined from tasks_or_file_names, i.e. from the output of up stream tasks, or a list of file names.
String replacement occurs either through suffix matches via suffix or the formatter or regex indicators.
@collate groups together all Input which result in identical Output and extra parameters.
It is a many to fewer operation.
- Example:
regex(r".*(\..+)"), "\1.summary" creates a separate summary file for each suffix:
animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals" # summarise by file suffix: @collate(animal_files, regex(r"\.(.+)$"), r'\1.summary') def summarize(infiles, summary_file): passParameters:
- tasks_or_file_names
can be a:
- Task / list of tasks (as in the example above).
File names are taken from the output of the specified task(s)
- (Nested) list of file name strings.
- File names containing *[]? will be expanded as a glob.
E.g.:"a.*" => "a.1", "a.2"
- matching_regex
is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax
- matching_formatter
a formatter indicator object containing optionally a python regular expression (re).
- output_pattern
Specifies the resulting output file name(s).
- extra_parameters
Any extra parameters are passed verbatim to the task function
- outputs and optional extra parameters are passed to the functions after string substitution in any strings. Non-string values are passed through unchanged.
- Each collate job consists of input files which are aggregated by string substitution to a single set of output / extra parameter matches
- In the above cases, a.fish and b.fish both produce fish.summary after regular expression subsitution, and are collated into a single job: ["a.fish", "b.fish" -> "fish.summary"] while c.mammals, d.mammals both produce mammals.summary, are collated in a separate job: ["c.mammals", "d.mammals" -> "mammals.summary"]
Example2:
Suppose we had the following files:
cows.mammals.animal horses.mammals.animal sheep.mammals.animal snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal pufferfish.fish.animaland we wanted to end up with three different resulting output:
cow.mammals.animal horse.mammals.animal sheep.mammals.animal -> mammals.results snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal -> reptile.results pufferfish.fish.animal -> fish.resultsThis is the @collate code required:
animals = [ "cows.mammals.animal", "horses.mammals.animal", "sheep.mammals.animal", "snake.reptile.animal", "lizard.reptile.animal", "crocodile.reptile.animal", "pufferfish.fish.animal"] @collate(animals, regex(r"(.+)\.(.+)\.animal"), r"\2.results") # \1 = species [cow, horse] # \2 = phylogenetics group [mammals, reptile, fish] def summarize_animals_into_groups(species_file, result_file): " ... more code here" pass
See @merge for an alternative way to summarise files.