Converting TO the Collins Economics Result Object (CERO) format¶
The ToCERO class provides methods for converting data files to the CERO format.
Critical to the successful use of this class is a configuration file in YAML format. Do not be intimidated by the acronym - the YAML format is very simple and human readable. Typically, study [1] of the YAML format should be unnecessary - copying a working configuration file and then altering it for the desired purpose should satisfy most users (the tests/data
subdirectory provides many examples). This documentation will show you how to build a YAML configuration file for use with the ToCERO class in a gradual, example-by-example process. A technical reference to the ToCERO
class will follows.
Building a YAML file from scratch to convert TO the CERO format¶
The configuration file can differ significantly depending on the type of file from which data is imported, but one aspect that all configuration files must have in common is the files
field. As the name suggests, files
specifies the input files that are sources of data for the conversion process. It therefore follows that a minimal (albeit useless) YAML configuration file will look like this:
files:
That is, a single line that doesn’t specify anything. This simple file is interpreted as a dict with the key "files"
with a corresponding value of None - the :
identifies the key-value nature of the data. That is:
{"files": None}
This top-level dictionary object - is referred to as a ToCERO object. The obvious next step is to specify some input files to convert. This is done by adding indented [2] subsequent lines with a hyphen, followed by a space, followed by the relevant data. For example:
files:
- <File Object A>
- <File Object B>
- <File Object C>
- etc.
The hyphens (followed by a space) on subsequent lines identify separate items that collectively are interpreted as a python list. The indented nature of the list identifies that this list is the value for the key in the line above. Basically the previous example is interpreted as the python object:
{"files": [<Python interpretation of File Object A>,
<Python interpretation of File Object B>,
<Python interpretation of File Object C>,
<etc.>]}
Note that each item of the "files"
list can be either a str or a dict. If a str, the string must refer to a YAML file containing a dict defining a file object. If a dict, then that dict must be a file object. A file object is a dictionary with one mandatory key-value pair - that is, (in YAML form):
file: name_of_file
Where name_of_file
is a file path relative to the configuration file. The option search_paths: List[str]
provided as on option to the file object (or the encompassing ToCERO object) overrides this behaviour (where paths earlier in the list are searched before paths later in the list).
Without further specification, if the file type is comma-separated-values (CSV
) and if the data is of the default format, ConCERO can import the entire file. The ‘default format’ is discussed on this page Guidelines for painless importing of data. ConCERO determines the file type:
- by the key-value pair
type: <ext>
in the file object, and if not provided then- by the key-value pair
type: <ext>
in the ToCERO object, and if not provided then- by determining the extension of the value of
file
in the file object, and if not determined then- an error is raised.
Providing the type
option allows the user to potentially extend the program to import files that the program author was not aware existed, if the file is of a similar format to one of the known and supported formats. For example, if the program author was not aware shk
files existed (and thus did not provide support for them), shk
files could be imported by specifying type: har
(given their similarity to har
files). As it is, shk
files are supported, so this is not necessary. Naturally, whether the import succeeds will be dependent on whether the underlying library allows importing that file type.
With respect to step 2 (of determining the file type), it can be said that the file object inherits from the ToCERO object. Many key-value pairs can be inherited from the ToCERO object, which reduces duplicating redundant information in the case that some properties apply to all the input files. Given that every key-value pair has some effect on configuration, the term option is used to refer to a key-value pair collectively. So an example of a YAML file including all points discussed so far is:
files:
- file: a_file.csv
- file: b_file
type: csv
In the example above, a_file.csv
and b_file
would be successfully imported (assuming they are both of default format). The file extension can be discerned with respect to a_file.csv
, and b_file
has the corresponding type
specified. Note that the type
option (for b_file
is indented at the same level as file option, not the list).
A minimal configuration form that demonstrates inheritance (and assuming c_file
is of default csv
type) is:
type: 'csv'
files:
- file: a_file.csv
- file: b_file
- file: c_file
Note that, alternatively, the file name of c_file could be changed to include a file extension. An important point is that the inheritance of type
does not mean you - the user - can lazily drop the file extensions. The file extension is part of the file name, and so it must be provided, if it exists, to find the correct file.
In most cases, more specification in the file object is necessary to import data. The necessary and additional options in the file object depend on the type of the file - whether it be CSV files, Excel files, HAR files or GDX files. That is, the supported types are ToCERO.supported_file_types
- a set of:
"csv"
"xlsx"
"har"
"shk"
File Objects - CSV files¶
CSV files can be considered the simplest case with respect to data import. ‘Under the hood’ ConCERO uses the pandas.read_csv()
method to implement data import from CSVs (documentation for which can be found here ). Any option available for the pandas.read_csv()
method is also available to ConCERO by including that option in the file object.
There are also a few additional options that can be provided that provide specific functionality for ConCERO. These options are:
series: (list)
the list specifies the series in the index that are relevant, so therefore providing a way to select data for export to the CERO. Each item in the list is referred to as a series object, which is a dictionary with the following options:
name: (str)
name
identifies the elements of the index that will be converted into a CERO.name
is a mandatory option.rename: (str)
If provided, after export into the CERO changesname
to value provided byrename
.
A series object can be provided as a string - this is equivalent to the series object {'name': series_name}
.
orientation: (str)
'rows'
by default. If the data is in columns with respect to time, change this option to'cols'
, (and therefore effectively calling a transposition operation).
skip_cols: (str|list)
A column name, or a list of column names to ignore.
And other pandas.read_csv()
options that are regularly used include:
usecols: (list)
From pandas documentation - Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Note thatusecols
will take precedence overskip_cols
, and that the argument format forusecols
for acsv
file differs slightly to that for anxlsx
file.
index_col: (int|list)
The column or list of columns (zero-indexed) in which the identifiers reside or, iforientation=="cols"
, the column with the date index.
header: (int|list)
The row or list of rows (zero-indexed) in which the date index resides or, iforientation=="cols"
, the rows with the data identifiers.
nrows: (int)
Number of rows of the file to read. May be useful with very largecsv
files that have a lot of irrelevant data.
For further documentation, please consult the pandas documentation documentation.
File Objects - Excel files¶
The process for importing Excel files is very similar to that of csv files. Underneath, the pandas.read_excel()
method is used, with virtually identical options with identical meanings. Consequently, not all the standard options will be mentioned here - just the differences in contrast to those for csv
files. For a complete list of available options, please consult the pandas documentation.
sheet: (str) or sheet_name: (str)
The name of the sheet in the workbook to be imported.
usecols: (list[int]|str)
Similar to thecsv
form of the option,usecols
accepts a list of zero-indexed integers to identify the columns to be imported. Unlike the csv option,usecols
will not accept alist
ofstr
, but will accept a singlestr
with an excel-like specification of columns. For example,usecols: A,C,E:H
specifies the import of columnsA
,C
and all columns betweenE
andH
inclusive.
File Objects - HAR (or SHK) files¶
In reading this section of the documentation, shk
files can be considered equivalent to har
files, so references to shk
files can be dropped.
har
files contain one or more header arrays, and with each header array is an array of one or more dimensions (to a maximum of 7). Each dimension of each array has an associated set. Note that the terminology set can be considered misleading because, unlike the mathematical concept of a set, HAR sets have an order. The order of the set corresponds to the placement of items within the array.
To specify the import of a har file, only one option in the file object is necessary - that is, head_arrs
with an associated list of strings specifying the names of header arrays to import from the file. Therefore, an example configuration file that specifies the import of a har
file could look like:
files:
- file: har_file.har
head_arrs:
- HEA1
- HEA2
With the example configuration, header arrays HEA1
and HEA2
would be imported from file har_file.har
. Note that it is a restriction of the har
format itself that header names can not be longer than 4 characters.
In the example above, each header array name is interpreted as a string. The more general format for a header definition is a dict
, referred to as header_dict
. Each header_dict
must have the option:
name: header_name
, whereheader_name
is the name of the header.
header_dict
must also have the following option if one of the dimensions of the array is to be interpreted as a time dimension:
time_dim: (str)
, where the string is the name of the set indexing the time-dimension (note that the format/data-type of the time dimension is irrelevant).
If the data has no time dimension (which definitely should be avoided) and therefore time_dim
is not specified, then default_year
must be provided (or inherited from the file object) - otherwise a ValueError will be thrown.
Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format that deviates from the default. Please see File independent options for more information.
File Objects - VD files¶
The coder writing the import connector is not familiar with the diversity of VEDA data files (if there are any). Consequently, the VEDA data file importer has been written with several assumptions. Specifically:
- Lines starting with an asterisk (*) are comments.
- The number of data columns remain constant throughout a single file.
If these assumptions are incorrect, please raise an issue on GitHub.
To specify the import of a vd file, it is mandatory to specify:
date_col: (int)
, wheredate_col
is the zero-indexed number of the column containing the date.val_col: (int)
, whereval_col
is the zero-indexed number of the column containing the values.
And optional to specify:
default_year: (int)
- If left unspecified, all records with an invalid date indate_col
are dropped. If specified (as a year), the value ofdate_col
in all records with an invalid date are changed todefault_year
.
Example:
files:
- file: a_file.vd
date_col: 3
val_col: 8
default_year: 2018
Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format that deviates from the default. Please see File independent options for more information.
File Objects - GDX files¶
- GDX files can be imported by providing the option:
symbols: list(dict)
- where each list item is a dict (referred to as a “symbol dict”).
Each symbol dict must have the options:
name: (str)
- wherename
is the name of the symbol to load.date_col: (int)
- wheredate_col
specifies the (zero-indexed) column that includes the date reference.
File Independent Options:¶
The options in this section are relevant to all input files, regardless of their type. They are:
time_regex: (str)
time_fmt: (str)
default_year: (int)
A fundamental principle ConCERO relies upon is that all data has some reference to time (noting that all data to date has been observed to reference the year only). The time-index data will typically be in a string format, and the year is interpreted by searching through the string, using the regular expression time_regex
. The default - '.*(\d{4})$'
- will attempt to interpret the last four characters of the string as the year. Importantly, the match returns the year as the 1st ‘group’ (regular expression lingo). It is the first group that time_fmt
is used with to convert the string to a datetime object. The default - '%Y'
assumes that the string contains 4 digits corresponding to the year (and only that).
In the event that the date-time data isn’t stored in the file itself, a default_year
option (a single integer corresponding to the year - e.g. 2017
) must be provided.
What follows is an example, using the defaults of time_regex
and time_fmt
, to demonstrate how this works…
Let’s assume the time index series is given, in CSV form, by:
bs1b-2017,bs1b-br1r-pl1p-2018,bs1b-br1r-pl1p-2019,...
which is typically seen with VURM-related data. The last four digits is obviously the year, so the default setting is appropriate. The regex essentially simplifies the data to a list of strings:
['2017', '2018', '2019', etc...]
However, ConCERO needs to convert these strings to pandas.datetime
format. This is done by the pandas.datetime.strftime() method, which relies on matching the strings with a pattern. The default - '%Y'
- will interpret the strings as four digits corresponding to the year - an obviously satisfactory result. Hence, the following options are appropriate to include in the YAML configuration file.
time_regex: .*(\d{4})$ time_fmt: '%Y'
Note: if the default settings (as per the example immediately above) are appropriate, specifying them is not necessary.
[1] | For a more thorough yet simple introduction to YAML files, http://docs.ansible.com/ansible/latest/YAMLSyntax.html is recommended. |
[2] | ‘Indented’ can refer to a tab, 4 spaces or any combination of tabs/spaces. It is however critical that the indentation pattern remains consistent (which is a requirement in common with python). |
ToCERO Technical Specification¶
-
class
to_cero.
ToCERO
(conf: dict, *args, parent: dict = None, **kwargs) → pandas.core.frame.DataFrame[source]¶ Loads a ToCERO configuration, suitable for creating CEROs from data files.
Parameters: - conf ('Union[dict,str]') – The configuration dictionary, or a path to a YAML file containing the configuration dictionary. If a path, it must be provided as an absolute path, or relative to the current working directory.
- args – Passed to the superclass (dict) at initialisation.
- kwargs – Passed to the superclass (dict) at initialisation.
-
create_cero
()[source]¶ Create a CERO from the configuration (defined by
self
).Return pd.DataFrame: A CERO is returned.
-
static
is_valid
(conf, raise_exception=True)[source]¶ Performs static validity checks on
conf
as aToCERO
object.Parameters: Return bool: A bool indicating the validity of
conf
as aToCERO
object.
-
static
load_config
(conf, parent: dict = None)[source]¶ Parameters: Return dict: The configuration dictionary (suitable as a ToCERO object).
Created on Fri Jan 19 11:49:23 2018
Section author: Lyle Collins <Lyle.Collins@csiro.au>
Code author: Lyle Collins <Lyle.Collins@csiro.au>