Source code for libfuncs_wrappers

#      ConCERO - a program to automate data format conversion and the execution of economic modelling software.
#      Copyright (C) 2018  CSIRO Energy Business Unit
#
#     This program is free software: you can redistribute it and/or modify
#     it under the terms of the GNU General Public License as published by
#     the Free Software Foundation, either version 3 of the License, or
#     (at your option) any later version.
#
#     This program is distributed in the hope that it will be useful,
#     but WITHOUT ANY WARRANTY; without even the implied warranty of
#     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#     GNU General Public License for more details.
#
#     You should have received a copy of the GNU General Public License
#     along with this program.  If not, see <https://www.gnu.org/licenses/>.

"""
The set of functions that *could* be applied to the CERO, and data series within the CERO, is infinitely large, \
so it is obviously impossible to provide all these functions. It is therefore necessary that the user provide \
functions as they are needed, by writing the appropriate python 3 code and including this function in ``libfuncs.py``. \
To minimise the difficulty and complexity of achieving this, ConCERO includes 3 classes of *wrapper functions*, that \
significantly reduce the difficulty for the user in extending the power of ``FromCERO``.

A *wrapper function* is a function that encapsulates another function, and therefore has access to both the inputs \
and outputs of the encapsulated function. Because the wrapper function has access to the inputs, it can provide \
pre-processing on the input to reshape it into a specific form, and because it has access to the output of the \
function, it can post-process the output of the function - mutating it into a desirable form.

A wrapper function can be directed to encapsulate a function by preceding the function with a *decorator*. A \
*decorator* is a simple one line statement that starts with the '\@' symbol and then the name of the wrapper \
function. For example, to encapsulate ``func`` with the ``dataframe_op`` wrapper, the code is:

.. code-block:: python

    @dataframe_op
    def func(*args, **kwargs):
        ...
        return cero


The wrapper functions themselves are stored in  the ``libfuncs_wrappers`` module, but the wrappers themselves should *never* be altered by the user.

What the 3 classes of wrappers are, and how to apply the function wrappers, are explained below, in addition to the case where no wrapper/decorator is provided.

Class 1 Functions - DataFrame Operations
----------------------------------------

Class 1 functions are the most general type of wrapper functions, and can be considered a superset of the other two. Class 1 functions operate on a ``pandas.DataFrame`` object, and therefore can operate on an entire CERO if need be. A class 1 function must have the following function signature:

.. code-block:: python

    @dataframe_op
    def func_name(df, *args, **kwargs):
        ...
        return cero

Note the following key features:

    * The function is proceeded by the ``dataframe_op`` decorator (imported from ``libfuncs_wrappers``).
    * The first argument provided to ``func_name``, that is ``df``, will be a CERO (an instance of a pandas.DataFrame), \
    reduced by the ``arrays``/``inputs`` options.
    * The returned object (``cero``) must be a valid CERO. A valid CERO is a ``pandas.DataFrame`` object with a ``DatetimeIndex``for columns and tuples/string-type values for the index values.

The ``libfuncs`` function ``merge`` provides a simple example of how to apply this wrapper:

.. code-block:: python

    @dataframe_op
    def merge(df):
        df.iloc[0, :] = df.sum(axis=0) # Replaces the first series with the sum of all the series
        return df

Class 2 Functions - Series Operations
-------------------------------------

Class 2 functions operate on a single ``pandas.Series`` object. Note that a single row of a ``pandas.DataFrame`` is \
an instance of a ``pandas.Series``. The series operations class can be considered a subset of DataFrame operations, \
and a superset of all recursive operations (discussed below).

Similar to class 1 functions, class 2 functions must fit the form:

.. code-block:: python

    @series_op
    def func(series, *args, **kwargs):
        ...
        return pandas_series

With similar features:

    * The function is proceeded by the ``@series_op`` decorator (imported from ``libfuncs_wrappers``).
    * The first argument (``series``) must be of ``pandas.Series`` type.
    * Return an object of ``pandas.Series`` type (``pandas_series``). ``pandas_series`` must be of the \
    same ``shape`` as ``series``.

Class 3 Functions - Recursive Operations
----------------------------------------

Recursive operations must fit the form:

.. code-block:: python

    @recursive_op
    def func(*args, **kwargs):
        ...
        return calc

Noting that:

    * Positional arguments are provided in the same order as their sequence in the data series.
    * The return value ``calc`` must be a single floating-point value.

Note that options can be provided to an operation object to alter the behaviour of the recursive operation. Those \
options are:

    * ``init: list(float)`` - values that precede the data series that serve as initialisation values.
    * ``post: list(float)`` - values that follow the data series for non-causal recursive functions.
    * ``auto_init: init`` - automatically prepend the first value in the array an ``auto_init`` number of times to the series (and therefore using that as the initial conditions).
    * ``auto_post: int`` - automatically postpend the last value in the array an ``auto_post`` number of times to the series (and therefore using that as the post conditions).
    * ``init_cols: list(int)`` - specifies the year(s) to use as initialisation values.
    * ``post_cols: list(int)`` - specifies the year(s) to use as post-pended values.
    * ``init_icols: list(int)`` - specifies the index (zero-indexed) to use as initialisation values.
    * ``post_icols: list(int)`` - specifies the index (zero-indexed) to use as post-pended values.
    * ``inplace: bool`` - If ``True``, then the recursive operation will be applied on the array \
        inplace, such that the result from a previous iteration is used in subsequent \
        iterations. If ``False``, the operation proceeds ignorant of the results of \
        previous iterations. ``True`` by default.

How these items are to be applied is probably best explained with an example - consider the recursive operation is \
a 3 sample moving point averaging filter. This can be implemented by including ``mv_avg_3()`` (below) in \
``libfuncs.py``:

.. code-block:: python

    @recursive_op
    def mv_avg_3(a, b, c):
        return (a + b + c)/3

It is also necessary to provide the arguments, ``init`` and ``post`` in the configuration file, so the operation \
object looks somthing like:

.. code-block:: yaml

    func: mv_avg_3
    init:
        - 1
    post:
        - 2

This operation would transform the data series ``[2, 1, 3]`` to the values \
``[1.3333, 1.7777, 2.2593]`` - i.e. ``[(1+2+1)/3, (1.333+1+3)/3, (1.7777+3+2)/3]``. If, instead, the configuration \
file looks like:

.. code-block:: yaml

    func: mv_avg_3
    init:
        - 1
    post:
        - 2
    inplace: False

Then the output of the same series would be ``[1.3333, 2, 2]`` - that is, ``[(1+2+1)/3, (2+1+3)/3, (1+3+2)/3]``.

Wrapper-less Functions
----------------------

It is **strongly recommended** that a user use the defined wrappers to encapsulate functions. This section should only be used as guidance to understand how the wrappers operate with the ``FromCERO`` module, and for understanding how to write additional wrappers (which is a non-trivial exercise).

A function that is not decorated with a pre-defined wrapper (as discussed previously) must have the following function signature to be compatible with the ``FromCERO`` module:

.. code-block:: python

    def func_name(df, *args, locs=None, **kwargs):
        ...
        return cero

Where:

    * ``df`` receives the entire CERO (as handled by the calling class), and
    * ``locs`` receives a list of all identifiers specifying which series of the CERO have been specified, and
    * ``cero`` is the returned dataframe and must be of CERO type. The FromCERO module will overwrite any values of its own CERO with those provided by ``cero``, based on an index match (after renaming takes place).

Other Notes
-----------

    * Avoid trying to create a renaming function - use the ``cero.rename_index_values()`` method - it has been designed to work \
    around a bug in Pandas (Issue #19497).
    * The system module ``libfuncs`` serves as a source of examples for how to use the function wrappers.

Technical Specifications of Decorators
--------------------------------------

.. autofunction:: dataframe_op
.. autofunction:: series_op
.. autofunction:: recursive_op
.. autofunction:: log_func

Created on Thu Dec 21 16:36:02 2017

@author: Lyle Collins
@email: Lyle.Collins@csiro.au
"""
import functools

import pandas as pd

import concero.conf as conf
from concero._identifier import _Identifier
from concero.cero import CERO

log = conf.setup_logger(__name__)

[docs]def log_func(func): """Logging decorator - for debugging purposes. To apply to function ``func``: .. code-block:: python @log_func def func(*args, **kwargs): ... """ @functools.wraps(func) def wrapper(*args, **kwargs): log.debug("Function call: %s(%s, %s)" % (func.__name__, args, kwargs)) result = func(*args, **kwargs) log.debug("Returned: %s" % (result,)) return result return wrapper
[docs]def dataframe_op(func): """ This decorator is designed to provide ``func`` (the encapsulated function) with a restricted form \ of ``df`` (a CERO). A \ *restricted* ``df`` is the original ``df`` limited to a subset of rows and/or columns. Note that a restriction on ``df.columns`` \ will be *compact* (the mathematical property), but this is not necessarily the case for restriction on ``df.index``. """ @functools.wraps(func) def wrapper(df: pd.DataFrame, *args, locs: "List[Union[tuple, str]]" = None, ilocs: "List[int]" = None, start_year: "Union[pd.datetime, int]" = None, end_year: "Union[pd.datetime, int]" = None, **kwargs): """ :param df: An CERO, which may or may not be a strict superset of data to perform the operation on. :param args: Passed to the encapsulated function as positional arguments, immediately after the restricted \ ``df``. :param locs: ``locs``, if provided, must be a list of identifiers that correspond to values of ``df.index``. \ It is ``df``, reduced to these specific indices, that a wrapped function will receive as an argument. An \ error is raised if both ``locs`` and ``ilocs`` is specified. :param ilocs: Identical in nature to ``locs``, though instead a list of integers (zero-indexed) is \ provided (corresponding to the row number of ``df``). An \ error is raised if both ``locs`` and ``ilocs`` is specified. :param start_year: Note that ``df`` is a CERO, and CEROs have a ``pandas.DatetimeIndex`` on columns. \ ``start_year`` restricts the CERO to years after and including ``start_year``. :param end_year: Note that ``df`` is a CERO, and CEROs have a ``pandas.DatetimeIndex`` on columns. \ ``end_year`` restricts the CERO to years up to and including ``end_year``. :param kwargs: Keyword arguments to be passed to the encapsulated function. :return: The return value of the encapsulated function. """ try: assert(isinstance(df, pd.DataFrame)) except AssertionError: raise TypeError("First function argument must be of pandas.DataFrame type.") # Convert integer to datetime type if isinstance(start_year, int): start_year = pd.datetime(start_year, 1, 1) if isinstance(end_year, int): end_year = pd.datetime(end_year, 1, 1) # Get index locations if start_year is not None: start_year = df.columns.get_loc(start_year) if end_year is not None: end_year= df.columns.get_loc(end_year) if (locs is not None) and (ilocs is not None): raise TypeError("Only one of 'locs' or 'ilocs' can be provided (not both).") if locs is not None: ilocs = [df.index.get_loc(loc) for loc in locs] if ilocs is None: ilocs = pd.IndexSlice[0:] df_cp = df.iloc[ilocs, start_year:end_year].copy(deep=False) # df_cp is always different object to df ret = func(df_cp, *args, **kwargs) if ret is None: return ret elif issubclass(type(ret), pd.Series): # If series, convert to dataframe ret = pd.DataFrame(data=[ret]) CERO.is_cero(ret) # Performs checks to ensure ret is a valid CERO return ret return wrapper
def _rename(df, old_names: "Union[List[Union[tuple, str]], tuple, str]", new_names: "Union[List[Union[tuple, str]], tuple, str]", *args, **kwargs): """If list provided for ``old_names`` and ``new_names``, must be of equal length. This method is an \ obtuse way to do this in comparison to `rename` method, but pandas has a bug that this method is designed to \ work around... (GitHub pandas issue #19497)""" if isinstance(old_names, (str, tuple)): old_names = [old_names] new_names = [new_names] old_index_name = df.index.name labels = df.index.tolist() for old_name, new_name in zip(old_names, new_names): labels[df.index.get_loc(old_name)] = _Identifier.tupleize_name(new_name) # The line below *should* work when the bug is fixed (obviously untested) # df.rename({old_name: new_name}, axis="index", inplace=True) df.index = pd.Index(labels, tupleize_cols=False, name=old_index_name) return df
[docs]def series_op(func): """This decorator provides ``func`` (the encapsulated function) with the first ``pandas.Series`` \ in a ``pandas.DataFrame`` (i.e. the first row in ``df``). Note that this wrapper is encapsulated within \ the ``dataframe_op`` wrapper.""" @dataframe_op @functools.wraps(func) def wrapper(df: pd.DataFrame, *args, **kwargs): """ :param df: A dataframe with a single row. ``df`` must be of CERO type. :param args: Passed to the encapsulated function as positional arguments immediately after the \ ``pandas.Series`` object. :param kwargs: Passed to the encapsulated function as keyword arguments. :return: Returns ``df`` with the first data series updated with the result of the encapsulated function. """ for idx, ser in df.iterrows(): # Note that pandas slicing is inclusive (in contrast to standard python list slicing)... valid_ser = ser[ser.first_valid_index(): ser.last_valid_index()] result = func(valid_ser, *args, **kwargs) try: assert (isinstance(result, pd.Series)) except AssertionError: raise TypeError("A \'series_op\' must return pandas.Series object.") df.loc[idx, ser.first_valid_index(): ser.last_valid_index()] = result return df # return None return wrapper
[docs]def recursive_op(func): """Applies the encapsulated function (``func``) iteratively to the elements of \ ``array`` from left to right, with ``init`` prepended to ``array`` \ and ``post`` postpended.""" @series_op @functools.wraps(func) def wrapper(array: pd.Series, *args, init: list = None, post: list = None, inplace: bool = True, auto_init: int = None, auto_post: int = None, init_cols: list = None, post_cols: list = None, init_icols: list = None, post_icols: list = None, **kwargs) -> pd.Series: ''' :param pandas.Series array: A ``pandas`` series for which the encapsulated recursive function will be applied to. :param list init: ``init`` is pre-pended to ``array`` before the recursive operation is applied. :param list post: ``post`` is post-pended to ``array`` before the recursive operation is applied. :param int auto_init: Automatically prepend the first value in ``array`` an ``auto_init`` number of times to the series (and therefore using that as the initial conditions). :param int auto_post: Automatically postpend the last value in ``array`` an ``auto_post`` number of times to the series (and therefore using that as the post conditions). :param 'Union[int, List[int]]' init_cols: Specifies the year to use as initialisation values. :param 'Union[int, List[int]]' post_cols: Specifies the year to use as post-pended values. :param 'Union[int, List[int]]' init_icols: Specifies the index (zero-indexed) to use as initialisation values. :param 'Union[int, List[int]]' post_icols: Specifies the index (zero-indexed) to use as post-pended values. :arg (bool) inplace: If `True` (the default), the operation will be applied on the array inplace, such that the result from a previous iteration is used in subsequent iterations. If `False`, the operation proceeds ignorant of the results of previous iterations. :returns (pandas.Series): Returns the result of the recursively-applied function. Will copy ``name`` and ``index`` properties of the provided ``pandas.Series`` object to the returned object.''' if [bool(init is not None), bool(auto_init is not None), bool(init_cols is not None), bool(init_icols is not None)].count(True) >= 2: msg = "Only one of the keyword arguments 'init', 'auto_init', 'init_cols' and 'init_icols' must be provided." log.error(msg) raise ValueError(msg) if [bool(post is not None), bool(auto_post is not None), bool(post_cols is not None), bool(post_icols is not None)].count(True) >= 2: msg = "Only one of the keyword arguments 'post', 'auto_post', 'post_cols' and 'post_icols' must be provided." log.error(msg) raise ValueError(msg) if not init: init = [] if not post: post = [] if not auto_init: auto_init = 0 if not auto_post: auto_post = 0 if not init_cols: init_cols = [] if not post_cols: post_cols = [] if init_icols is None: init_icols = [] if post_icols is None: post_icols = [] if not isinstance(auto_init, int) or auto_init < 0: msg = "'auto_init' keyword argument must be provided as an integer greater than 0." log.error(msg) raise TypeError(msg) if not isinstance(auto_post, int) or auto_post < 0: msg = "'auto_post' keyword argument must be provided as an integer greater than 0." log.error(msg) raise TypeError(msg) if auto_init: init = [array[0]]*auto_init if auto_post: post = [array[-1]]*auto_post dis_start = len(init) dis_end = len(post) sl_start = None sl_end = None if init_icols != []: if issubclass(type(init_icols), int): init_icols = [init_icols] init_cols = [dt.year for dt in array.index[init_icols].tolist()] if post_icols != []: if issubclass(type(post_icols), int): post_icols = [post_icols] post_cols = [dt.year for dt in array.index[post_icols].tolist()] if init_cols: if isinstance(init_cols, int): init_cols = [init_cols] try: init = array.loc[pd.to_datetime(init_cols, format="%Y")].tolist() except KeyError: msg = "Selected years for 'init_cols' (%s) are outside of range of available data." % init_cols log.error(msg) raise KeyError(msg) sl_start = len(init) if post_cols: if isinstance(post_cols, int): post_cols = [post_cols] try: post = array.loc[pd.to_datetime(post_cols, format="%Y")].tolist() except KeyError: msg = "Selected years for 'post_cols' (%s) are outside of range of available data." % post_cols log.error(msg) raise KeyError(msg) sl_end = -len(post) sl = slice(sl_start, sl_end) array_list = init + array.values.tolist()[sl] + post no_args = len(init) + len(post) + 1 no_ops = len(array_list) - no_args + 1 new_array = init + [None]*no_ops + post for i in range(no_ops): rec_args = array_list[i: i + no_args] + list(args) try: tmp = func(*rec_args, **kwargs) # tmp = func(*array_list[i - dis_start:i + dis_end + 1], # This form can only be used for Python 3.5 onwards... # *args, # **kwargs) except TypeError as e: msg = e.__str__() + ". A likely cause is that initial conditions (or columns) have not been specified." log.error(msg) raise TypeError(msg) if tmp is None: raise ValueError("'recursive_op' functions must return a floating point value.") if inplace: array_list[i + len(init)] = tmp else: new_array[i + len(init)] = tmp if inplace: new_array = array_list new_array = new_array[dis_start: len(new_array) - dis_end] # Copy names and index to new series new_array = pd.Series(data=new_array, index=array.index, name=array.name) return new_array return wrapper