Converting FROM the Collins Economics Result Object (CERO) format

Output-Independent Instructions

Setting up a FromCERO configuration file

Like all other configuration files for this program, the configuration file must be in YAML format. The highest hierarchical level (i.e. with the least/no indentation) is referred to as a FromCERO object). It is necessary (for the configuration to output anything meaningful) to define the option:

  • procedures: (list[dict|str]), where procedures is a list of one or more procedure objects. Procedure Objects are explained below .

Procedures define the mutation(s) and, if desired, the export of the mutated data to a file. If a procedure does not specify an export file then ConCERO will, by default, output the procedure to output.csv in the current working directory. The default output file can be overwridden by specifying the file option of the FromCERO object.

It is recommended that the following option be specified:

  • file: (str) - Names the file for which all procedure objects are exported into. Procedure objects will export into this file unless a procedure-object-specific file has been defined. The extension of file determines the exported file type. Supported file types are:

    • Numpy arrays - npy
    • GAMS Data eXchange format - gdx (temporarily unsupported)
    • HAR files - har
    • Shock files - shk
    • Portable Network Graphics format - png
    • Portable Document Format - pdf
    • PostScript - ps
    • Encapsulated PostScript - eps
    • Scalable Vector Graphics - svg

Other options include:

  • sets: (dict: str -> List[str]) - sets is a dictionary mapping str to a list of str. sets provides an easy and convenient way to select groups of CERO identifiers (see CERO Identifiers), as opposed to simply listing all the identifiers that are of interest for output. More detail about sets is provided below in the section Sets.
  • map: (dict: str -> str) - key-value pairs that maps the “old” identifier to a “new” identifier.
  • ref_dir: (str) where ref_dir is a file path relative to the current working directory. By default, all file names are interpreted as being relative to the configuration file. Providing this option overrides the default.
  • lstrip: (str) where str, if provided, strips the left-most substring from all identifiers that make up the input. If the string does not match the start of the identifier (if the identifier is a str), or the first field of the identifier (if the identifier is a tuple), then a ValueError is raised. This option is designed to correspond to CEROs generated using ToCERO with the auto_prepend option provided.
  • libfuncs: (str|list[str]) - paths relative to ref_dir of python files containing functions to use as operation functions. Note that the .py filename extension must be included. The structure of a libfuncs file is discussed below in Libfuncs Files.

Note that, in general, properties at a lower level (i.e. more indentation) ‘inherit’ from a higher level.

So, an example configuration, in YAML format, is:

file: a_csv_file.csv
procedures:
    - <Procedure Object A>
    - <Procedure Object B>
    - <Procedure Object C>
    - <etc.>

Examples of complete configuration files can be found in the tests/data subdirectory of the ConCERO install path.

Procedure Objects

Conceptually, procedure objects provide the instructions to select data from a CERO, mutate that data (if necessary), and then either, (a): output this data into a file, or (b): return outputs for later export into a global file (specified by the file option in the outputs object). Any mutations that are applied to a procedure object’s inputs are isolated from any other procedure object and the CERO itself - i.e. each procedure can be considered a ‘silo’ separate from others.

A procedure object can be either a str or a dict. The dict form is the more general form - if a procedure object is provided as the (str) ser_obj, it is immediately converted to the equivalent form {'name': ser_obj}. The complete list of options is:

  • name: (str) - the name given to the procedure. Will, by default, given the name Unnamed_proc.
  • file: (str) - if provided, the output from this procedure object, and only this procedure object, will be exported to file.
  • inputs: list(str|list(str)) - is a list of identifiers corresponding to identifiers in the CERO. If an item of the list is a string with one or more commas, or is itself a list, then the item will be interpreted as a tuple-form identifier. See CERO Identifiers.
  • outputs: list(str|list(str)) - a list of identifiers that are to be exported to the file. If outputs are not specified, then ConCERO will export all updated inputs after all operations are performed. Read Description of the output process-flow to understand what is meant by updated inputs. If it is desirable that none of the data series be exported to a file in a conventional manner, which is the case if - for example - plotting output, then specify outputs, but leave the corresponding value blank to indicate a value of None. If an item of the list is a string with one or more commas, or is itself a list, then the item will be interpreted as a tuple-form identifier. See CERO Identifiers.
  • operations: list[operations objects] - to mutate the inputs into a desirable form for export, operations must be applied to mutate the data. operations is a list of operations objects, which modify the data in a sequential manner. See Operations Objects for more information.
  • libfuncs: (str|list[str]) - Identical in meaning to the equivalent FromCERO object option. Is inherited from a FromCERO object if not given.

Below is a shell showing the two different procedure object types:

procedures:
    - name: (str)
      inputs: (list[str])
      operations: (list[operation])
      output_file: (str)
    - (str)

The 1st procedure object is in dictionary form, and the 2nd is in string form.

Inheritance paths

Below is an outline of how options are inherited:

  • inputs - If inputs is undefined, then inputs is the entire CERO (whatever that may be at runtime).
  • outputs - If outputs is undefined or True, then all inputs are outputs. If outputs is False or None, then there are no outputs. A list or str can be provided to select specific outputs.

Operations Objects

An operation refers to the process of applying a function to some inputs to return an output(s). Unlike separate procedures, operations (within the same procedure object) can not be considered to operate in a ‘silo-ed’ manner, and therefore the order of operations is significant. Each item of the list operations must be an operation object - that is, a dict, which may contain the options:

  • func: (str) - func is the name of a function present in a libfuncs library that is applied to arrays (see below). The functions available can be easily expanded by:
    1. Correctly identifying the class of the new function - see Classes of User-Specified Functions for operating on CEROs.
    2. Adding the function to a python source code file, with the associated function decorator (as explained in Classes of User-Specified Functions for operating on CEROs), and providing that file to ConCERO with the libfuncs procedure option. The system libfuncs.py will be searched after any referenced files.
  • arrays: list(str|list(str)) - arrays defines which of the inputs that func will manipulate. If arrays is not provided, arrays defaults to all procedure object inputs. Note that any manipulation applied to arrays will be in effect for all subsequent operations.
  • rename: (list|dict) - providing this option as a list renames arrays after the application of func (if provided). If rename is provided as a list, then the list is parsed as identifiers (see CERO Identifiers) and must be the same length as arrays. If provided as a dict, only those arrays matching keys in the dict are renamed to the corresponding value. Regardless of the form of rename (i.e. list or dict), references to sets can be made. In the specific case that there is one and only one arrays, then rename can be provided as a str. If rename is provided and the new identifier values are not already in arrays, then rename expands arrays to include the new identifers (and the data series corresponding to the original identifiers are left untouched). By using this behaviour, rename can be used to apply func to specific arrays without altering the original arrays.
  • start_year: (int) - this option constrains the dataset to years after and including start_year. This option may be useful to avoid attempting to apply func to missing data.
  • end_year: (int) - this option constrains the dataset to years before and including end_year. This option may be useful to avoid attempting to apply func to missing data.

Any additional options are passed to func as keyword arguments.

Sets

The sets option must have the following form:

sets: dict[str -> list(str)]

The sets option provides a powerful way to list many identifiers with a small amount of references. An example configuration of sets is:

sets:
    ASET:
        - a
        - b
        - c

A user can then specify all the elements of the set (for inputs, arrays and outputs) by referencing the set. For example:

sets:
    ASET:
        - a
        - b
        - c
        - d
        - e
procedures:
    - name: a_procedure
      inputs: ASET
      operations:
        - func: a_func
    - name: b_procedure
      inputs: ASET
      operations:
        - func: b_func

Which is equivalent to the more verbose:

procedures:
    - name: a_procedure
      inputs:
        - a
        - b
        - c
        - d
        - e
      operations:
        - func: a_func
    - name: b_procedure
      inputs:
        - a
        - b
        - c
        - d
        - e
      operations:
        - func: b_func

Specifying sets is even more powerful when using them in the context of tuple-identifiers. For example, consider that these (100*100 = 10,000) identifiers were in the CERO (in python list form):

[('1', '1'), ('1', '2'), ('1', '3'), ..., ('1', '100'), ('2', '1'), ('2', '2'), ..., ('2', '100'),
 ('3', '1'), ..., ('3', '100'), ..., ('100', '100')]

Rather than listing all 10,000 identifiers, a user can create a set:

sets = {'century': ['1', '2', '3', ..., '100']}

and select all 10,000 identifiers by referencing the set twice with a comma inbetween - e.g. in YAML:

inputs:
    - century,century

Note that the selection takes place by using the cartesian product operation, and it is necessary that the cartesian product be convex.

Libfuncs Files

A libfuncs file is a standard python source file. However, to use the definitions as operations in ConCERO, it is necessary to wrap the functions with specialised wrappers. Therefore, an example python source code file that provides ConCERO-compatible operations is:

from concero.libfuncs_wrappers import recursive_op

@recursive_op
def double_values(x):
    return 2*x

Where the double_values function will simply double the value of all input series. Note that series_op and dataframe_op are also wrappers to encapsulate functions to ensure they are ConCERO-compatible. For more information on how to use the wrappers, please consult Classes of User-Specified Functions for operating on CEROs .

Description of the output process-flow

Each procedure object corresponds to the output of an object into a file. Every procedure takes inputs (from a CERO), mutates this inputs in some way (or not and then outputs some, if not all, of the mutated inputs into a file. More specifically, in converting a CERO to an output file, the general process flow is:

  1. From the given CERO, identify using inputs the relevant data series by their identifier.
  2. Copy those inputs to avoid disturbing/mutating the original CERO.
  3. From the copied inputs, perform a sequence of operations where, for each operation:
    1. All of the inputs, or a subset of inputs is selected (that is, the arrays).
    2. A function mutates the arrays.
    3. If given, arrays are rename d.
    4. The copied inputs get updated with the mutated arrays. For values of arrays that match inputs, those inputs are overwritten. Otherwise (in the event arrays have been rename ‘d) they are added to inputs.
  4. Export outputs to either:
    1. file, if file is specified the procedure object, or
    2. file as defined in the FromCERO object, if specified, or
    3. output.csv if file is unspecified in either the procedure or FromCERO objects.

FromCERO Technical Specification

class from_cero.FromCERO(conf: dict, *args, parent=None, **kwargs)[source]

Any additional arguments and keyword arguments are passed to the superclass at initialisation (i.e. the dict class).

Parameters:
  • conf ("Union[str,dict]") – A dictionary containing the configuration. If a str is provided, it is interpreted as a file (in YAML format) containing a configuration dictionary (relative to the current working directory).
  • parent (dict) – If provided, the created object will inherit from parent (a dict).
exec_procedures(cero)[source]

Execute all the procedures of the FromCERO object . :param pandas.DataFrame cero: A CERO to serve as input for the procedures. The argument is not mutated/modified.

static is_valid(conf: dict, raise_exception=True)[source]

Performs static checks on conf to verify if conf can be converted to a FromCERO object.

Checks include:
  • Valid type.
  • Valid procedures.
  • If file given, that the user has write permissions in that directory.
Parameters:
  • conf (dict) – The object to check the validity of.
  • raise_exception (bool) – If True (the default) then an exception will be raised on failure. Otherwise an error message will be printed to stdout and False returned.
Return bool:

True if conf passes all static checks.

static load_config(conf, parent=None)[source]

Loads configuration of FromCERO. If conf is a str, this is interpreted as a file (in YAML format) containing a configuration dictionary (relative to the current working directory). Otherwise conf must be a dictionary.

Parameters:conf ('Union[str,dict]') –
Return dict:
static run_checks(conf: dict, cero: pandas.core.frame.DataFrame, raise_exception=True)[source]

Performs runtime checks on conf, given cero.

Parameters:
  • conf (dict) – The object to check the validity of.
  • raise_exception (bool) – If True (the default) then an exception will be raised on failure. Otherwise an error message will be printed to stdout and False returned.
Return bool:

True if conf passes all runtime checks.

Created on Jan 22 08:44:08 2018

Section author: Lyle Collins <Lyle.Collins@csiro.au>

Code author: Lyle Collins <Lyle.Collins@csiro.au>