Source code for to_cero

#      ConCERO - a program to automate data format conversion and the execution of economic modelling software.
#      Copyright (C) 2018  CSIRO Energy Business Unit
#
#     This program is free software: you can redistribute it and/or modify
#     it under the terms of the GNU General Public License as published by
#     the Free Software Foundation, either version 3 of the License, or
#     (at your option) any later version.
#
#     This program is distributed in the hope that it will be useful,
#     but WITHOUT ANY WARRANTY; without even the implied warranty of
#     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#     GNU General Public License for more details.
#
#     You should have received a copy of the GNU General Public License
#     along with this program.  If not, see <https://www.gnu.org/licenses/>.

"""
.. to_cero:

The ToCERO class provides methods for converting data files **to** the CERO format.

Critical to the successful use of this class is a configuration file in YAML format. \
Do not be intimidated by the acronym - the YAML format is very simple and human readable. Typically, \
study [1]_ of the YAML format should be unnecessary - copying a working configuration file and then \
altering it for the desired purpose should satisfy most users (the ``tests/data`` subdirectory provides many examples). This documentation will show you how to build a YAML \
configuration file for use with the `ToCERO` class in a gradual, example-by-example process. A technical reference to the ``ToCERO`` class \
will follows.

Building a YAML file from scratch to convert *TO* the CERO format
-----------------------------------------------------------------

The configuration file can differ significantly depending on the type of file from which data is \
imported, but one aspect that all configuration files **must** have in common is the ``files`` field. As the \
name suggests, ``files`` specifies the input files that are sources of data for the conversion process. \
It therefore follows that a minimal (albeit useless) YAML configuration file will look like this:

    ``files:``

That is, a single line that doesn't specify anything. This simple file is interpreted as a `dict` with the key ``"files"`` with a corresponding value of `None` - the ``:`` identifies the key-value nature of the data. That is:

    ``{"files": None}``

This top-level dictionary object - is referred to as a *ToCERO* object. The obvious next step is to specify some input files to convert. \
This is done by adding **indented** [2]_ subsequent lines with a hyphen, **followed by a space**, followed by the relevant data. For example:

.. code-block:: yaml

    files:
        - <File Object A>
        - <File Object B>
        - <File Object C>
        - etc.

The hyphens (followed by a space) on subsequent lines identify separate items that collectively are interpreted as a python `list`. The indented nature of the list identifies that this list is the value for the key in the line above. Basically the previous example is interpreted as the python object:

.. code-block:: python

    {"files": [<Python interpretation of File Object A>,
                      <Python interpretation of File Object B>,
                      <Python interpretation of File Object C>,
                      <etc.>]}

Note that each item of the ``"files"`` list can be either a `str` or a `dict`. If a `str`, the string must refer to a YAML file containing a `dict` defining a *file* object. If a `dict`, then that dict must be a file object. A file object is a dictionary with one mandatory key-value pair - that is, (in YAML form):

.. code-block:: yaml

    file: name_of_file

Where ``name_of_file`` is a file path *relative to the configuration file*. The option ``search_paths: List[str]`` provided as on option to the file object (or the encompassing ToCERO object) overrides this behaviour (where paths earlier in the list are searched before paths later in the list).

Without further specification, if the file *type* is comma-separated-values (``CSV``) *and* if the data is of the default format, ConCERO can import the entire file. The 'default format' is discussed on this page :ref:`import_guidelines`. ConCERO determines the file type:

    1. by the key-value pair ``type: <ext>`` in the *file object*, and if not provided then
    2. by the key-value pair ``type: <ext>`` in the *ToCERO object*, and if not provided then
    3. by determining the extension of the value of ``file`` in the *file object*, and if not determined then
    4. an error is raised.

Providing the ``type`` option allows the user to potentially extend the program to import files that the program author was not aware existed, if the file is of a similar format to one of the known and supported formats. For example, if the program author was not aware ``shk`` files existed (and thus did not provide support for them), ``shk`` files could be imported by specifying ``type: har`` (given their similarity to ``har`` files). As it is, ``shk`` files *are* supported, so this is not necessary. Naturally, whether the import succeeds will be dependent on whether the underlying library allows importing that file type.

With respect to step 2 (of determining the file type), it can be said that the file object *inherits* from the ToCERO object. Many key-value pairs can be inherited from the ToCERO object, which reduces duplicating redundant information in the case that some properties apply to all the input files. Given that every key-value pair has some effect on configuration, the term *option* is used to refer to a key-value pair collectively. So an example of a YAML file including all points discussed so far is:

.. code-block:: yaml

    files:
        - file: a_file.csv
        - file: b_file
          type: csv

In the example above, ``a_file.csv`` and ``b_file`` would be successfully imported (assuming they are both of default format). \
The file extension can be discerned with respect to ``a_file.csv``, and \
``b_file`` has the corresponding ``type`` specified. Note that the ``type`` option (for ``b_file`` is indented at \
the same level as file option, *not* the list).

A minimal configuration form that demonstrates inheritance (and assuming ``c_file`` is of default ``csv`` type) is:

.. code-block:: yaml

    type: 'csv'
    files:
        - file: a_file.csv
        - file: b_file
        - file: c_file

Note that, alternatively, the file name of c_file could be changed to include a file extension. \
An important point is that the inheritance of ``type`` does not \
mean you - the user - can lazily drop the file extensions. The file extension is part of the file name, and so it \
must be provided, if it exists, to find the correct file.

In most cases, more specification in the file object is necessary to import data. \
The necessary and additional options in the file object depend on the type of the file - whether it be \
`CSV files`_, `Excel files`_, `HAR files`_ or `GDX files`_. That is, the supported types are \
``ToCERO.supported_file_types`` - a set of:

    * ``"csv"``
    * ``"xlsx"``
    * ``"har"``
    * ``"shk"``
..    * ``"gdx"``

.. _CSV files:

File Objects - CSV files
--------------------------

CSV files can be considered the simplest case with respect to data import. 'Under the hood' ConCERO uses the \
``pandas.read_csv()`` method to implement data import from CSVs (documentation for which can be found \
`here <http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.read_csv.html>`_ ). Any option \
available for the ``pandas.read_csv()`` method is also available to ConCERO by including that option \
in the file object.

There are also a few additional options that can be provided that provide specific functionality for ConCERO. These \
options are:

**series: (list)**

   the list specifies the series in the index that are \
   relevant, so therefore providing a way to select data for export to the CERO. Each item in the list is referred to \
   as a *series object*, which is a dictionary with the following options:

        **name: (str)**

           ``name`` identifies the elements of the index that will be converted into a CERO. ``name`` is a mandatory option.

        **rename: (str)**

           If provided, after export into the CERO changes ``name`` to value provided by ``rename``.

A series object can be provided as a string - this is equivalent to the series object ``{'name': series_name}``.

**orientation: (str)**

   ``'rows'`` by default. If the data is in columns with respect to time, change this option to ``'cols'``, (and therefore effectively calling a transposition operation).

**skip_cols: (str|list)**

    A column name, or a list of column names to ignore.

And other ``pandas.read_csv()`` options that are regularly used include:

**usecols: (list)**

   From pandas documentation - Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Note that ``usecols`` will take precedence over ``skip_cols``, and that the argument format for ``usecols`` for a ``csv`` file differs slightly to that for an ``xlsx`` file.

**index_col: (int|list)**

   The column or list of columns (zero-indexed) in which the identifiers reside or, if ``orientation=="cols"``, the column with the date index.

**header: (int|list)**

   The row or list of rows (zero-indexed) in which the date index resides or, if ``orientation=="cols"``, the rows with the data identifiers.

**nrows: (int)**

   Number of rows of the file to read. May be useful with very large ``csv`` files that have a lot of irrelevant data.

For further documentation, please consult the \
`pandas documentation <http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.read_csv.html>`_ \
documentation.

.. _Excel files:

File Objects - Excel files
--------------------------

The process for importing Excel files is very similar to that of csv files. Underneath, the ``pandas.read_excel()`` \
method is used, with virtually identical options with identical meanings. Consequently, not all the standard options \
will be mentioned here - just the differences in contrast to those for ``csv`` files. For a complete list of available \
options, please consult `the pandas documentation <https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.read_excel.html>`_.

**sheet: (str)** or **sheet_name: (str)**

   The name of the sheet in the workbook to be imported.

**usecols: (list[int]|str)**

   Similar to the ``csv`` form of the option, ``usecols`` accepts a list of zero-indexed integers to identify the \
columns to be imported. **Unlike the csv option**, ``usecols`` will **not** accept a ``list`` of ``str``, but will accept \
a single ``str`` with an excel-like specification of columns. For example, ``usecols: A,C,E:H`` specifies the \
import of columns ``A``, ``C`` and all columns between ``E`` and ``H`` *inclusive*.

.. _HAR files:

File Objects - HAR (or SHK) files
-----------------------------------

In reading this section of the documentation, ``shk`` files can be considered equivalent to ``har`` files, so \
references to ``shk`` files can be dropped.

``har`` files contain one or more *header arrays*, and with each header array is an array of one or more dimensions \
(to a maximum of 7). Each dimension of each array has an associated *set*. Note that the terminology *set* can \
be considered misleading because, unlike the mathematical concept of a set, HAR sets *have an order*. \
The order of the set corresponds to the placement of items within the array.

To specify the import of a har file, only one option in the file object is necessary - that is, ``head_arrs`` with \
an associated list of strings specifying the names of header arrays to import from the file. Therefore, an example configuration file that specifies the import of a ``har`` file could \
look like:

.. code-block:: yaml

    files:
      - file: har_file.har
        head_arrs:
          - HEA1
          - HEA2

With the example configuration, header arrays ``HEA1`` and ``HEA2`` would be imported from file ``har_file.har``. *Note* \
that it is a restriction of the ``har`` format itself that header names can not be longer than 4 characters.

In the example above, each header array name is interpreted as a string. The more general format for a header definition is \
a ``dict``, referred to as ``header_dict``. Each ``header_dict`` *must* have the option:

    * ``name: header_name``, where ``header_name`` is the name of the header.

``header_dict`` *must* also have the following option *if one of the dimensions of the array is to be interpreted as a time \
dimension*:

    * ``time_dim: (str)``, where the string is the name of the set indexing the time-dimension (note that the \
    format/data-type of the time dimension is irrelevant).

If the data has no time dimension (which *definitely should be avoided*) and therefore ``time_dim`` is not specified, \
then ``default_year`` **must** be provided (or inherited from the file object) - otherwise a ValueError will be \
thrown.

Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format \
that deviates from the default. Please see `File independent options`_ for more information.

File Objects - VD files
-----------------------

The coder writing the import connector is not familiar with the diversity of VEDA data files (if there are any). Consequently, the VEDA data file importer has been written with several assumptions. Specifically:

    #. Lines starting with an asterisk (*) are comments.
    #. The number of data columns remain constant throughout a single file.

If these assumptions are incorrect, please raise an issue on GitHub.

To specify the import of a vd file, it is mandatory to specify:

    * ``date_col: (int)``, where ``date_col`` is the zero-indexed number of the column containing the date.
    * ``val_col: (int)``, where ``val_col`` is the zero-indexed number of the column containing the values.

And optional to specify:

    * ``default_year: (int)`` - If left unspecified, all records with an invalid date in ``date_col`` are dropped. If specified (as a year), the value of ``date_col`` in all records with an invalid date are changed to ``default_year``.

Example:

.. code-block:: yaml

    files:
      - file: a_file.vd
        date_col: 3
        val_col: 8
        default_year: 2018

Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format \
that deviates from the default. Please see `File independent options`_ for more information.

.. _GDX files:

File Objects - GDX files
------------------------

GDX files can be imported by providing the option:
    * ``symbols: list(dict)`` - where each `list` item is a `dict` (referred to as a "symbol dict").

Each symbol dict must have the options:

    * ``name: (str)`` - where ``name`` is the name of the symbol to load.
    * ``date_col: (int)`` - where ``date_col`` specifies the (zero-indexed) column that includes the date reference.

.. _File independent options:

File Independent Options:
-------------------------

The options in this section are relevant to all input files, regardless of their type. They are:

**time_regex: (str)**

**time_fmt: (str)**

**default_year: (int)**

A fundamental principle ConCERO relies upon is that all data has some reference to time (noting that all data to date has been observed to reference the year only). The time-index data will typically be in a string format, and the year is interpreted by \
searching through the string, using the regular expression ``time_regex``. The default - ``'.*(\d{4})$'`` - will \
attempt to interpret the last four characters of the string as the year. Importantly, the match returns the \
year as the 1st 'group' (regular expression lingo). It is the first group that ``time_fmt`` is used with to \
convert the string to a datetime object. The default - ``'%Y'`` assumes that the string contains 4 digits \
corresponding to the year (and only that).

In the event that the date-time data isn't stored in the file itself, a ``default_year`` option (a single integer corresponding to the year - e.g. ``2017``) **must** be provided. \

What follows is an example, using the defaults of ``time_regex`` and ``time_fmt``, to \
demonstrate how this works...

Let's assume the time index series is given, in CSV form, by:

    .. code-block:: text

        bs1b-2017,bs1b-br1r-pl1p-2018,bs1b-br1r-pl1p-2019,...

which is typically seen with VURM-related data. The last four digits is obviously the year, so the default \
setting is appropriate. The regex essentially simplifies the data to a list of strings:

    ``['2017', '2018', '2019', etc...]``

However, ConCERO needs to convert these strings to ``pandas.datetime`` format. This is done by the \
`pandas.datetime.strftime() <https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior>`_ \
method, which relies on matching the strings with a pattern. The default - ``'%Y'`` - \
will interpret the strings as four digits corresponding to the year - an obviously satisfactory result. Hence, the \
following options are appropriate to include in the YAML configuration file.

    .. code-block:: yaml

        time_regex: .*(\d{4})$
        time_fmt: '%Y'

*Note*: if the default settings (as per the example immediately above) are appropriate, specifying them is **not** necessary.

.. [1] For a more thorough yet simple introduction to YAML files, `<http://docs.ansible.com/ansible/latest/YAMLSyntax.html>`_\
 is recommended.
.. [2] *'Indented'* can refer to a tab, 4 spaces or any combination of tabs/spaces. It is however critical that the \
indentation pattern *remains consistent* (which is a requirement in common with python).

ToCERO Technical Specification
------------------------------

.. autoclass:: ToCERO
    :members:

Created on Fri Jan 19 11:49:23 2018

.. sectionauthor:: Lyle Collins <Lyle.Collins@csiro.au>
.. codeauthor:: Lyle Collins <Lyle.Collins@csiro.au>
"""

import concero.conf
if getattr(concero.conf, "gdxpds_installed", False):
    import gdxpds #: Warning given if not imported before pandas
if getattr(concero.conf, "harpy_installed", False):
    import harpy

import re
import os
import itertools as it
from collections import OrderedDict
import builtins
import getpass

import pandas as pd
import xlrd

from concero.format_convert_tools import read_yaml
from concero._identifier import _Identifier
from concero.cero import CERO


[docs]class ToCERO(dict): _logger = concero.conf.setup_logger(__name__) class _FileObj(dict): supported_file_types = ["har", "csv", "xlsx", "xls", "shk", "gdx", "vd"] def __init__(self, *args, parent: dict=None, **kwargs): """ :param args: Passed to the superclass (`dict`) at initialisation. :param dict parent: Inherits at initialisation from parent. :param kwargs: Passed to the superclass (`dict`) at initialisation. """ conf = ToCERO._FileObj.load_config(*args, parent=parent, **kwargs) super().__init__(conf) @staticmethod def load_config(conf: dict, *args, parent: dict=None, **kwargs): """ :param 'Union[dict,str]' conf: The configuration dict, or if a `str`, the path (relative to the current working directory) of a YAML-format file containing the configuration dict. :return dict: """ # Defaults _conf = {"header":0, "index_col":0, "time_regex":r".*(\d{4})$", "time_fmt": r"%Y", "search_paths": [], "overwrite": False} if parent is None: parent = {} _conf.update(parent) if isinstance(conf, str): sp = _conf.get("search_paths") if not sp: sp = [os.path.abspath(".")] conf = ToCERO._FileObj._find_file(conf, sp) conf = read_yaml(conf) _conf.update(dict(conf, *args, **kwargs)) # search_paths initialisation, if not inherited if isinstance(_conf["search_paths"], str): _conf["search_paths"] = [os.path.abspath(_conf["search_paths"])] if _conf["search_paths"] == []: _conf["search_paths"].append(os.path.abspath(".")) # Identify file type by extension if not given - the type determines which import function to use _conf["type"] = _conf.get("type", os.path.splitext(_conf["file"])[1][1:]).lower() # Series limits the data import to only those data series specified if _conf.get("series"): for idx in range(len(_conf["series"])): if not isinstance(_conf["series"][idx], dict): # Attempt to convert to dict... _conf["series"][idx] = {"name": _conf["series"][idx]} _conf["series"][idx]["name"] = _Identifier.tupleize_name(_conf["series"][idx]["name"]) if _conf["series"][idx].get("rename"): _conf["series"][idx]["rename"] = _Identifier.tupleize_name(_conf["series"][idx]["rename"]) return _conf @staticmethod def is_valid(conf, raise_exception=True): if not conf.get("file"): msg = "Key-value pair \"file: FILE_NAME\" must be provided for all " + "\'files\'." ToCERO._logger.error(msg) if raise_exception: raise TypeError(msg) print(msg) return False return True @staticmethod def run_checks(conf, raise_exception=True): file = ToCERO._FileObj._find_file(conf["file"], conf["search_paths"], raise_exception=raise_exception) if not file: return False if not ToCERO._FileObj._check_permissions(file, raise_exception=raise_exception): return False return True @staticmethod def check_config(conf, *args, raise_exception=True, runtime=False, parent=None, **kwargs): conf = ToCERO._FileObj.load_config(conf, *args, parent=parent, **kwargs) if runtime: return ToCERO._FileObj.run_checks(conf, raise_exception=raise_exception) return ToCERO._FileObj.is_valid(conf, raise_exception=raise_exception) def import_file_as_cero(self): """ Executes the import process. :return pandas.DataFrame: The CERO. """ self["file"] = ToCERO._FileObj._find_file(self["file"], self["search_paths"]) try: df = self._import_file() # _import_file documents the state of df. except xlrd.biffh.XLRDError as e: msg = e.__str__() + " Failed to import file '%s' - invalid sheet name." % self["file"] ToCERO._logger.error(msg) raise ImportError(msg) except Exception as e: msg = e.__str__() + " Failed to import file '%s'." % self["file"] ToCERO._logger.error(msg) raise e.__class__(msg) # Throw away unnecessary rows if self.get("series"): df = ToCERO._FileObj._filter_series(df, [series["name"] for series in self["series"]]) assert isinstance(df, pd.DataFrame) assert isinstance(df.index, pd.Index) for series in self["series"]: # Rename rows if specified if "rename" in series: ds = df.loc[series["name"],] ds.name = series["rename"] df = df.append(ds) # Note that this will move data series to the end of dataframe. df.drop(labels=series["name"], inplace=True) assert (series[ "rename"] in df.index.tolist()) # Check new name has been properly added to the series assert isinstance(df, pd.DataFrame) # Find year in strings if isinstance(df.columns, pd.Int64Index): # Assumption: If column names are interpreted as integers, the integers must be years ts = pd.to_datetime(df.columns, format="%Y") else: try: ts = pd.Series([re.match(self["time_regex"], x).group(1) for x in df.columns.tolist()]) except AttributeError as e: msg = ("Error attempting to perform string matching on datetime values for file '%s'. A " +\ "likely cause is too few datetimes for the size of the data array.") % self["file"] ToCERO._logger.error(msg) raise e.__class__(msg) ts = pd.to_datetime(ts, format=self["time_fmt"]) # Interpret as datetime df.columns = ts if "prepend" in self: new_values = [_Identifier.prepend_identifier(self["prepend"], name) for name in df.index.tolist()] df.index = CERO.create_cero_index(new_values) CERO.is_cero(df) # Will raise exception if invalid CERO return df @staticmethod def _find_file(file, search_paths: list, raise_exception=True): """ Locates first occurance of ``file`` on ``search_paths`` and returns relative OS-specific path. :return str: """ orig_filename = file file = os.path.relpath(os.path.normpath(file)) for sp in search_paths: test_path = os.path.join(sp, file) msg = "ToCERO.find_file(): testing path: %s" % test_path ToCERO._logger.debug(msg) if os.path.isfile(test_path): return os.path.abspath(test_path) # return test_path else: msg = "File '%s' not found on any of the paths %s." % (orig_filename, search_paths) ToCERO._logger.error(msg) if raise_exception: raise FileNotFoundError(msg) print(msg) return False @staticmethod def _check_permissions(file, raise_exception=True): try: fp = open(file, 'r') fp.close() except PermissionError: msg = "Current user - '%s' - does not have permissions to read file '%s'." % (getpass.getuser(), file) ToCERO._logger.error(msg) if raise_exception: raise PermissionError(msg) print(msg) return False return True def _import_file(self) -> pd.DataFrame: """ Executes appropriate function depending on file type, returning a ``pandas.DataFrame`` that is not necessarily of CERO type. :return: ``pandas.DataFrame``. ``df`` (the returned dataframe) must be of have identifiers as the index and the time-based data in columns. The index should be of CERO type, which can be ensured by using the ``CERO.create_CERO_index()`` method. """ if self.get("type") not in ToCERO._FileObj.supported_file_types: raise TypeError(("\'type\' for input file %s is either: (a), not provided/inherited/deduced; " + "or (b), not supported. Supported types are %s.") % (self["file"], ToCERO._FileObj.supported_file_types)) elif self["type"] in ['xlsx', 'xls', 'csv']: df = self._import_csv_or_xlsx() elif self["type"] == 'gdx': df = self._import_gdx() elif self["type"] in ['har', 'shk']: df = self._import_har() elif self["type"] in ['vd']: df = self._import_vd() # If df does not fit these requirements, must be error in import... assert isinstance(df, pd.DataFrame) assert isinstance(df.index, pd.Index) try: df = df.astype(pd.np.float32, copy=False) except ValueError as e: raise e.__class__("Invalid dataframe type detected. One possible reason is invalid columns - " + "for example, columns that do not refer to a year when other columns do.") return df def _import_csv_or_xlsx(self) -> pd.DataFrame: # CSV/XLSX specific operations pd_opts = self.copy() file = pd_opts.pop("file") # This is messy, but there's not a neat way to select only pandas-relevant options read_csv_kwargs = set(["filepath_or_buffer", "sep", "delimiter", "header", "names", "index_col", "usecols", "squeeze", "prefix", "mangle_dupe_cols", "dtype", "engine", "converters", "true_values", "false_values", "skipinitialspace", "skiprows", "nrows", "na_values", "keep_default_na", "na_filter", "verbose", "skip_blank_lines", "parse_dates", "infer_datetime_format", "keep_date_col", "date_parser", "dayfirst", "iterator", "chunksize", "compression", "thousands", "decimal", "lineterminator", "quotechar", "quoting", "escapechar", "comment", "encoding", "dialect", "tupleize_cols", "error_bad_lines", "warn_bad_lines", "skipfooter", "skip_footer", "doublequote", "delim_whitespace", "as_recarray", "compact_ints", "use_unsigned", "low_memory", "buffer_lines", "memory_map", "float_precision"]) read_excel_kwargs = set(["io", "sheet_name", "header", "skiprows", "skip_footer", "index_col", "names", "usecols", "parse_dates", "date_parser", "na_values", "thousands", "convert_float", "converters", "dtype", "true_values", "false_values", "engine", "squeeze"]) if self["type"] == "csv": pandas_kwargs = read_csv_kwargs elif self["type"] in ["xlsx", "xls"]: pandas_kwargs = read_excel_kwargs for opt in list(pd_opts.keys()): if opt not in pandas_kwargs: pd_opts.pop(opt, None) # Get rid of pandas-irrelevant options. Some of the options ^^ may \ # refer to other file types. pd_op = "read_csv" # Changed later to read_excel if necessary if "usecols" not in self: if self.get("skip_cols") is not None: if self["type"] in ["csv"]: # Need this parameter to skip irrelevant columns of data if isinstance(self["skip_cols"], str): self["skip_cols"] = [self["skip_cols"]] sk_col_list = self["skip_cols"] pd_opts["usecols"] = lambda x: x not in sk_col_list else: raise TypeError("'skip_cols' is a valid option for files of 'csv' type only.") if self["type"] in ['xlsx', 'xls']: # Check excel-specific requirements if self.get("sheet", None) is None: raise TypeError( "Key-value pair \"sheet: SHEET\" must be specified for file \'%s\'." % self["file"]) else: pd_opts["sheet_name"] = self["sheet"] pd_op = "read_excel" if self.get("header") and not self.get("skiprows"): self["skiprows"] = self["header"] self["header"] = 0 if isinstance(pd_opts.get("converters"), dict): for k, v in pd_opts["converters"].items(): if isinstance(v, str): pd_opts["converters"][k] = getattr(builtins, v) # for k in ["header", "index_col", "nrows", "usecols"]: # # Grab pandas-relevant options # pd_opts[k] = self[k] pd_op = getattr(pd, pd_op) # Identify the correct pandas read OPeration try: df = pd_op(file, **pd_opts) # Program will fail on the line above if: # - Pandas version <0.22 (failure observed when pandas = 0.20), AND # - read_excel operation, AND # - usecols is a string which has a column range in it, where one of the indices has more than # one letter - e.g. usecols="B,C,E:AW". except ValueError as e: if re.match(r"^Passed header", e.__str__()): raise ValueError(e.__str__() + ". This is likely because a range has been specified with a '-' instead of a ':'") raise e if self["type"] in ['xlsx', "xls"] and self.get("nrows"): df = df.iloc[:self["nrows"], :] if self.get("orientation", "") == "cols": # Put into CERO orientation if necessary df = df.transpose() df.index = CERO.create_cero_index(df.index.tolist()) return df def _import_har(self) -> pd.DataFrame: # Load file using harpy try: hfo = harpy.HarFileObj.loadFromDisk(filename=self["file"]) except PermissionError as e: raise e har_headers_list = hfo.getRealHeaderArrayNames() if not self.get("head_arrs", self.get("headers")): # "headers" is for backwards compatibility self["head_arrs"] = har_headers_list # If "headers" not specified, get all headers if self.get("headers"): raise DeprecationWarning("Option 'headers' has been depracated in favour of 'head_arrs'.") header_dfs = [] for header in self.get("head_arrs", self.get("headers")): # "headers" is for backwards compatibility if isinstance(header, str): # If string, convert to dictionary header = {"name": header} try: assert isinstance(header, dict) except AssertionError: raise TypeError("Invalid header format (can be either str or dict).") # Checks valid header name try: assert (header["name"] in har_headers_list) except AssertionError: msg = "\'%s\' is an invalid header name for file \'%s\'." % (header["name"], self["file"]) raise ValueError(msg) # Inherits time_dim, default_year from self if possible header["time_dim"] = header.get("time_dim", self.get("time_dim")) header["default_year"] = header.get("default_year", self.get("default_year")) header["obj"] = hfo.getHeaderArrayObj(header["name"]) # ASSUMPTION: User wants to retrieve entire tensor # func = lambda x: header["obj"].SetElements[x] labels = OrderedDict() if header.get("time_dim"): if isinstance(header["time_dim"], str): # time_dim is positional index for idx, s in enumerate(header["obj"]["sets"]): if s["name"] == header["time_dim"]: header["time_dim"] = idx break else: raise ValueError("'time_dim' does not exist in har file.") try: assert isinstance(header["time_dim"], int) except AssertionError: raise TypeError("'time_dim' (for header %s) must be provided as an 'int' or 'str' - " + "an int if indexing the dimension in the header name list, or a str if " + "naming the time set." % header["obj"]["name"]) try: assert (header["time_dim"] < len(header["obj"]["sets"])) except AssertionError: raise TypeError("Invalid 'time_dim' (for header %s) - integer (zero-indexed) is too large " + "for the number of sets." % header["obj"].HeaderName) labels = [(x["name"], x["dim_desc"]) for i, x in enumerate(header["obj"]["sets"]) if i != header["time_dim"]] time_dim_labels = header["obj"]["sets"][header["time_dim"]]["dim_desc"] # Move that dimension to be the last... tspse_tup = tuple([i for i in range(len(header["obj"]["sets"])) if i != header["time_dim"]] + [ header["time_dim"]]) else: if not header.get("default_year"): raise ValueError( "The 'default_year' option must be provided for har files that do not have a specified 'time_dim' (time dimension).") # Assume we have to create time dimension... time_dim_labels = ["%d" % header.get("default_year")] # TODO: Ask Thomas what year the data references if time_dim is not specified labels = [(x["name"], x["dim_desc"]) for x in header["obj"]["sets"]] tspse_tup = tuple([i for i in range(len(header["obj"]["sets"]))]) # transpose-tuple array = header["obj"]["array"].transpose(tspse_tup) # ^^ ASSUMPTION: Sets and Elements in array have the same order as that HAR.Header.SetNames, SetElements # UPDATE: Checked with Florian that this assumption is correct. # Reshape into 2-dimensional array shape = [len(labs[1]) for labs in labels] new_dims = 1 for i in range(len(shape)): new_dims = new_dims * shape[i] new_dims = (new_dims, len(time_dim_labels)) ToCERO._logger.debug("new_dims: %s" % (new_dims,)) array = array.reshape(new_dims) # Note that reshaping is in C-order, which is itertools.product() order columns = time_dim_labels labels = list(it.product(*[x[1] for x in labels])) labels = [_Identifier.tupleize_name(name) for name in labels] if self.get("har_auto_prepend"): labels = [_Identifier.prepend_identifier(header["obj"]["coeff_name"], name) for name in labels] index = CERO.create_cero_index(labels) ToCERO._logger.debug("index: %s" % index) df = pd.DataFrame(data=array, index=index, columns=columns) header_dfs.append(df) return CERO.combine_ceros(header_dfs, overwrite=False, verify_cero=False) def _import_vd(self): # VEDA data file """ Import VEDA data files. Assumption: The number of columns in first line of data is consistent throughout file. :return: pandas.DataFrame (not of CERO type). """ self["default_year"] = self.get("default_year", None) if not issubclass(type(self["date_col"]), int): raise TypeError("'date_col' for file '%s' must be provided as an int." % self["file"]) if not issubclass(type(self["val_col"]), int): raise TypeError("'val_col' for file '%s' must be provided as an int." % self["file"]) with open(self["file"], "r") as f: data = f.readlines() data = [l.rstrip() for l in data if (l[0] != "\n" and l[0] != "*")] # Remove comments and empty lines data = [[l.rstrip("\"").lstrip("\"") for l in l.split(",")] for l in data] # Strip quotation marks def drop_data(line): try: line[self["date_col"]] = int(line[self["date_col"]]) # Attempt to convert to int except ValueError: if self["default_year"] is None: return False # False has the effect of dropping this record... line[self["date_col"]] = self["default_year"] # Set to the given default if provided return line data = list(map(drop_data, data)) data = list(filter(None, data)) no_cols = len(data[0]) # Assumes number of columns in first line holds for the rest index_col = [x for x in range(no_cols) if ((x != self["val_col"]) and (x != self["date_col"]))] df = pd.DataFrame(data=data) df.index = CERO.create_cero_index([[l[x] for x in index_col] for l in data]) df = df[[self["date_col"], self["val_col"]]] # Remove the now-unneeded data df = df.pivot(columns=self["date_col"]) # NOTE: Pivot can change index to non-logical order df.columns = df.columns.droplevel() return df def _import_gdx(self) -> pd.DataFrame: """ Import a gdx file. Some assumptions are made: * Year index is always the lowest-level in column hierarchy * gdxpds does not provide columns with distinct names (which is true at time of writing) :return: """ parent_dict = self.copy() parent_dict.pop("file") sym_defs = {} sym_defs.update(parent_dict) # if issubclass(type(self.get("symbols")), str): # self["symbols"] = [{"name": self["symbols"]}] if issubclass(type(self.get("symbols")), dict): self["symbols"] = [self["symbols"]] # elif self.get("symbols", []) == []: # self["symbols"] = [{"name": s.name} for s in gdxpds.list_symbols(self["file"])] # If symbols aren't specified, assume that user wants them all elif not issubclass(type(self.get("symbols", [])), list): raise TypeError("'symbols' must be provided as a dict, or a list of dicts. Each symbol must have 'name' and 'date_col' specified.") sym_tmp = [] for sym in self["symbols"]: tmp = sym_defs.copy() # if issubclass(type(sym), str): # tmp.update({"name": sym}) if issubclass(type(sym), dict): tmp.update(sym) else: raise TypeError("Symbol '%s' is of invalid type (not a dict)." % (sym)) sym_tmp.append(tmp) self["symbols"] = sym_tmp req_keys = ["name", "date_col"] for sym in self["symbols"]: try: assert issubclass(type(sym), dict) except AssertionError as e: ToCERO._logger.error("Symbol %s is not of dict type." % sym) raise e try: assert all([(k in sym) for k in req_keys]) except AssertionError as e: msg = "Symbol %s does not have all of %s specified." % (sym, req_keys) ToCERO._logger.error(msg) raise TypeError(msg) dfs_dict = OrderedDict([(sym["name"], gdxpds.to_dataframe(self["file"], sym["name"])[sym["name"]]) for sym in self["symbols"]]) df_list = [] for idx, sym in enumerate(self["symbols"]): df = dfs_dict[sym["name"]] # Renames the initial columns to a number string... col_labels = ["%d" % i for i in range(df.shape[1])] col_labels[-1] = "Value" col_labels[sym["date_col"]] = "YEAR" df.columns = pd.Index(col_labels) df.set_index(col_labels[:-1], inplace=True) new_level_order = [col_labels[sym["date_col"]]] + [x for i, x in enumerate(col_labels[:-1]) if i != sym["date_col"]] df = df.reorder_levels(new_level_order) #: Assumption: Year index is always the lowest-level in column hierarchy df = df.unstack(0) df.columns = df.columns.droplevel() df.index = pd.Index(df.index.tolist(), tupleize_cols=False) df_list.append(df) return CERO.combine_ceros(df_list, overwrite=False, verify_cero=False) @staticmethod def _filter_series(df: pd.DataFrame, names) -> pd.DataFrame: """Throws away unnecessary rows in ``df`` object.""" s_list = [] for name in names: try: s = df.loc[(name,),].iloc[0] # Ugly, but pandas isn't friendly with tuple index values assert (isinstance(s, pd.Series)) except KeyError as e: msg = e.__str__() + (". There are several likely reasons: \n" + \ "1. File orientation is in columns and this has not been specified.\n" + "2. Series names do not match those given in the file. Remember to " + "comma-separate the values if multiple columns are used as the index.\n" "3. Pandas is automatically converting the index values to a datatype other than " + "a string. Consider adding 'converters: {column_name: data_type}' to the configuration " + "file." ) raise KeyError(msg) except IndexError as e: msg = ("Invalid series identifier. A cause for this error may be " + "a lack of uniqueness in series identifier (consider expanding " + "the number index columns).") raise IndexError(msg) except AssertionError as e: raise e s_list.append(s) df = pd.concat(s_list, axis=1).transpose() return df def __init__(self, conf: dict, *args, parent: dict=None, **kwargs) -> pd.DataFrame: """Loads a ToCERO configuration, suitable for creating CEROs from data files. :param 'Union[dict,str]' conf: The configuration dictionary, or a path to a YAML file containing the configuration dictionary. If a path, it must be provided as an absolute path, or relative to the current working directory. :param args: Passed to the superclass (`dict`) at initialisation. :param kwargs: Passed to the superclass (`dict`) at initialisation. """ _conf = ToCERO.load_config(conf, parent=parent) super().__init__(_conf, *args, **kwargs) msg = "Loaded ToCERO configuration: %s" % self ToCERO._logger.debug(msg)
[docs] def create_cero(self): """ Create a CERO from the configuration (defined by ``self``). :return pd.DataFrame: A CERO is returned. """ cero_series = [] for file_obj in self["files"]: cero = file_obj.import_file_as_cero() cero_series.append(cero) cero = CERO.combine_ceros(cero_series, overwrite=[fo["overwrite"] for fo in self["files"]]) return cero
[docs] @staticmethod def load_config(conf, parent: dict=None): """ :param 'Union[dict,str]' conf: A configuration dictionary, or a `str` to a path containing a configuration dictionary. :param dict parent: A dict from which to inherit. :return dict: The configuration dictionary (suitable as a ToCERO object). """ _conf = {"header": 0, "index_col": 0, "time_regex": r".*(\d{4})$", "time_fmt": r"%Y", "search_paths": [], "files": []} # Defaults if parent is None: parent = {} _conf.update(parent) if isinstance(_conf["search_paths"], str): _conf["search_paths"] = [_conf["search_paths"]] if isinstance(conf, str): try: if not _conf["search_paths"]: _conf["search_paths"] = os.path.abspath(os.path.dirname(conf)) conf = read_yaml(conf) # If conf is a configuration file, this will succeed except UnicodeDecodeError: # Try auto-import (i.e. assumes that file is of default format) - not supported for all import formats # Works by feeding in appropriate kwargs conf_file_ext = os.path.splitext(conf)[1][1:].lower() if conf_file_ext in ["csv"]: _conf["files"].append({"file": conf}) elif conf_file_ext in ["xlsx", "xls"]: _conf["files"].append({"file": conf, "sheet": "CERO"}) conf = dict(conf) _conf.update(conf) # Arguments provided at initialisation supercede configuration file values if isinstance(_conf["search_paths"], str): _conf["search_paths"] = [_conf["search_paths"]] if not _conf["search_paths"]: _conf["search_paths"].append(os.path.abspath(".")) # Search in working directory if a dict is provided... par_dict = _conf.copy() par_dict.pop("files") # Prevents infinite recursive inheritance for idx, file_obj in enumerate(_conf["files"]): try: _conf["files"][idx] = ToCERO._FileObj(file_obj, parent=par_dict) except TypeError: raise TypeError("'files' must be provided as a list.") return _conf
[docs] @staticmethod def is_valid(conf, raise_exception=True): """ Performs static validity checks on ``conf`` as a ``ToCERO`` object. :param dict conf: An object, which may or may not suitable as a ToCERO object. :param bool raise_exception: If `True` (the default) an exception will be raised in the event a test is failed. Otherwise (in this event) an error message is printed to stdout and `False` is returned. :return bool: A `bool` indicating the validity of ``conf`` as a ``ToCERO`` object. """ if not issubclass(type(conf["files"]), list): msg = 'Files must be specified as a list (or a single item).' + \ 'For example:\n' + \ 'files:\n' + \ ' - file: FILE_A\n' + \ ' [properties of FILE_A] \n' + \ ' - file: FILE_B\n' + \ ' [properties of FILE_B] \n' + \ ' - ... ' ToCERO._logger.error(msg) if raise_exception: raise TypeError(msg) print(msg) return False for file_obj in conf["files"]: if not issubclass(type(file_obj), ToCERO._FileObj): msg = "Object '%s' is of '%s' type, not '_FileObj'." % (file_obj, type(file_obj)) ToCERO._logger.error(msg) if raise_exception: raise TypeError(msg) print(msg) return False return True
[docs] @staticmethod def run_checks(conf, raise_exception=True): """ Performs dynamic validity checks on ``conf`` as a ``ToCERO`` object. :param dict conf: An object, which may or may not suitable as a ToCERO object. :param bool raise_exception: If `True` (the default) an exception will be raised in the event a test is failed. Otherwise (in this event) an error message is printed to stdout and `False` is returned. :return bool: A `bool` indicating the validity of ``conf`` as a ``ToCERO`` object. """ for file_obj in conf["files"]: if not ToCERO._FileObj.run_checks(file_obj, raise_exception=raise_exception): return False return True
@staticmethod def check_config(conf, raise_exception=True, runtime=False): _conf = ToCERO.load_config(conf) if runtime: return ToCERO.run_checks(conf, raise_exception=raise_exception) return ToCERO.is_valid(_conf, raise_exception=raise_exception) @staticmethod def _find_file(file, search_paths: list, raise_exception=True): """ Locates first occurance of ``file`` on ``search_paths`` and returns relative OS-specific path. :return str: The relative path to file. """ orig_filename = file file = os.path.relpath(os.path.normpath(file)) for sp in search_paths: test_path = os.path.join(sp, file) msg = "ToCERO.find_file(): testing path: %s" % test_path ToCERO._logger.debug(msg) if os.path.isfile(test_path): return test_path else: msg = "File '%s' not found on any of the paths %s." % (orig_filename, search_paths) ToCERO._logger.error(msg) if raise_exception: raise FileNotFoundError(msg) print(msg) return False