Code architecture

The code base of the project is a collection of Python packages stored as repositories in subgroups of the BONSAI coderefinery group. This chapter describes how it is organized, and it contains the folowing sections:

  1. Overview describes the organization of th BONSAI coderefinery group and introduces the concepts of utility, task and data transformation stage.

  2. Workflow management describes the management of the data transformation workflow.

  3. Task template describes the main features of the code architecture of a task package.

  4. Data package utiliity describes the main features of the utility to read and write data packages.

  5. Administration database

Code concerning calculations and the web application is not currently discussed here.

Overview

The BONSAI coderefinery group is organized as described in figure 1, with names in uppercase being placeholders.

bonsai/
  ├── documentation
  ├── personal/
  ├── web/
  ├── admin/
  ├── util/
  ├── collect/
  ├── clean/
  ├── load/
  ├── build/
  └── calc/

Figure 1. Tree structure of the coderefinery BONSAI group.

In the group root there is only one repository, documentation, hosting the current material. Subgroup personal may in turn hold one subgroup for each project member, to contain ad-hoc repositories, for example exploratory scripts. Subgroup web contains website-related repositories, such as frontend or backend. Subgroup admin hosts repositories used for managing the database server, database and the computation cluster, workflow. Subgroup calc hosts repositories used for calculations, e.g., footprints, various decompositions and uncertainty or sensitivity analyses. Subgroup util hosts utilities: these are Python packages which are do not receive or deliver data, but are called by other Python packages. Repositories in other subgroups hold Python packages that are used in the data transformation workflow: each folder corresponds to a different stage in that workflow; and each repository holds a Python package that is executed in a task in that stage.

Figure 2. Illustrative DAG of the BONSAI data transformation workflow.

The complete data transformation workflow, from raw data to the BONSAI input-output system, is illustrated in figure 2. Collect tasks for which there is a repository in bonsai/collect fetch data from the web and store raw data, here interpreted as a collection of files. Some raw data (bulk contribution) is manually provided and so has no corresponding code repository. Each task in the workflow delivers either one data folder (in the collect stage), or (in all other stages) one data package, which is a collection of csv files (each representing a table) and a metadata yaml file compliant with frictionless syntax. bonsai/clean tasks deliver stand-alone data packages, i.e., without foreign key relations to tables in other data packages. Clean data packages generated by clean tasks define either major or minor versions of the database. Element contributions define patches and are stored for now as separate data packages that follow the same entity-relation model as the major or minor clean data package, but which store either new or different records.

While working with files, the ‘merge’ stage tasks then combine the newer information from patches (element contribution) with older data to generate ‘merge’ data. All merge tasks execute the same merge algorithm, whose repository is stoed as bonsai/util/merge so there is no bonsai/merge folder. The last two stages are ‘load’ and ‘build’: they both deliver data packages in the main database, but they differ because ‘load’ tasks receive as input at least one ‘merge’ data package, and ‘build’ only receive data packages already present in the main database. In the main database data packages have foreign keys to other data packages. In the main database data packages are either dimension data packages or data cubes: data cubes reference dimension data packages (and some dimension data packages reference each other).

In the collect, clean and merge stages each data package stream is independent, but in the load and build stages they are combined. In the load stage, tasks that deliver dimension data packages are executed before those that deliver data packages. And within dimension data packages there is strict sequence: the source dimension data package is the first and does not have any input data from the current stage, instead queries (nonsensitive) data from the admin database. Then the function, flag, unit and miscellaneous dimension data packages are imported, followed by classification data packages and, finally, by data cubes.

In the build stage, there are three initial steps of harmonization, gap-filling and balancing. In these three steps a new version of each data cube is generated: in harmonization the native classifications is replaced by a main classification; in gap-filling missing values are estimated; and in balancing values are adjusted so that constraints are satisfied. The following steps then lead to the compilation of a core input-output model, which is then expanded with transformations to its flow version, and finally to its coefficient version.

Table 1 shows the data underlying figure 2. For each task its input tasks (i.e., those generating its input data packages) are indicated in column ‘Inputs’ while column ‘Repository’ indicates where the code they execute is stored in CodeRefinery. Several notes follow. In this example the source data package B is a manual contribution, since it’s collect task is not associated with a repository. All merge tasks execute code from the same repository.

Workflow management

The data transformation workflow is managed by Airflow, which executes DAGs specified in python scripts:

dag_main.py
dag_major_<major>.py

where = 0, 1, …, i.e., there is one such file per major version of the Bonsai database.

File dag_main.py queries the admin_db on a regular schedule and when performs the following steps:

  1. Open the workflow log file

  2. Query the admin database for new requests

  3. If there is no new request, write to log file and stop. If there is a new request, write to log file and continue.

  4. From admin_db collect information: new release number, new release type and major db version.

  5. Check if input data is valid. Write to log file. If valid proceed.

  6. Update information in admin_db concerning coderefinery repositories and check consistency of admin_db. Write to log file. If valid proceed.

  7. Launch the dag_major_<major>.py corresponding to the requested release, passing as information: pointer to admin_db, new release number, new release type and major_db version.

Following the usual Airflow syntax, file dag_major_<major>.py contains the list of tasks of the corresponding DAG, followed by the hardcoded edge dependencies. There are different ways in which the following can be achieved (e.g., in the dag operator, or in the manifest of the underlying docker) but each task should call a generic object, for now referred to as bonsai_task to which it passes the common information received from dag_main.py and an additional task_id parameter that identifies the task in the task table of the admin_db. The generic bonsai_task then performs the following steps:

  1. Write to the workflow log file (id, time and status).

  2. Query the admin database for status of dependencies. If dependencies have been initialized but not changed since last execution and write to table db_task_version and db_edge_version of admin_db and write to workflow log.

  3. If answer to query 2 is positive, stop. Otherwise, write to tables task_version and edge_version and continue.

  4. From admin_db collect information on input and output data package versions and paths and create config.yaml file in location accessible to task.

  5. Create instance of task, e.g., Docker container and install the Python package from the Coderefinery repository as pip install git+ssh git@source.bonsai.coderefinery.org/<url_to_repo>:<version>.

  6. Execute the task as <repo_name>.main --run <path_to_config.yaml>.

  7. Close the instance.

  8. Update admin_db and log with information about the success of task execution.

Currently API info is being handled by environment variables, for production maybe it is easier to write it explicitly in the config file? Otherwise the values in config.yaml under api_info may be the name of environment variables, with are created by the Bonsai task operator in the instance that will execute the task.

The first load task loads the source dimension data package, which needs to query the admin database. How to handle this? Should the necessary data for the sqlalchemy pointer be passed in config the same way as api_info in the ‘collect’ stage?

Task and utility template

The code architecture of each repository that executes a task must comply with the architecture described in figure 4. To start with, notice that a Python package has two different names: <python_distribution> is the name used to manage an installation, e.g., pip show <python_distribution> in command line, and <python_package> is the name used to load the package within the Python interpreter, e.g., import <python_package>. Additionally, each task is expected to generate one data package, called <data_package>. In the following paragraphs describe now in some detail four files (tox.ini, docs/index.md, tests/test_main.py and src/<python_package>/main.py) and the folder tests/data/.

<python_distribution>/
  ├── tox.ini
  ├── ...
  ├── docs/
  │     ├── index.md
  │     └── ...
  ├── tests/
  │     ├── test_main.py
  │     ├── ...
  │     └── data/
  │           ├── config.yaml
  │           ├── input/
  │           │     ├── <data_package_0>/
  │           │     │     ├── <data_package_0>.metadata.yaml
  │           │     │     ├── <data_package_0>.gv.svg
  │           │     │     ├── <file_0>.csv
  │           │     │     └── ...
  │           │     └── ...
  │           └── output/
  │                 └── <data_package>/
  │                       ├── <data_package>.log
  │                       └── ...
  └── src/
        └── <python_package>/
              ├── main.py
              └── ...

Figure 4. Tree structure of a task repository.

Foder tests/data/ contains a template config.yaml with the structure of the arguments required to execute the task, and when the data in tests/data/input/ is correct, the task should return output matching tests/data/output/. Input data, if any, and usually output data are expected to be data packages, which consist of csv files and a metadata.yaml file.

File tox.ini contains instructions on automated testing and generation of documentation. In particular, it will launch tests/test_main.py which, among other things generates the entity-relation diagrams of input and output data inside tests/data/. index.md should either contain instructions for installation and the description of the entity-relation models of input and output data, including the diagrams. tests/test_main.py should contain tests for the three main behaviours expected of the task:

  1. <python_package>.main --run <path_to_config.yaml>, where the config.yaml used is tests/data/config.yaml, and the task is executed.

  2. <python_package>.main --plot <path_to_metadata.yaml>, where the metadata.yaml files used are all those under tests/data/, and entity-relation diagrams are created in the relevant folder.

  3. <python_package>.main --export <export_path>, where all files under tests/data/ is exported to <export_path>.

The content of file src/<python_package>/main.py should be:

import sys
from task_wrapper import task_wrapper
from <my_module> import <my_function>

@task_wrapper
def main(argv):
    """ useful information """
    input_data = argv[-1]

    # start edit
    output_data = <my_function>(input_data)
    # stop edit

    return output_data

if __name__ == "__main__":
    """ useful information """
    main(sys.argv)

Notice that the input of main, argv, is not the one which is passed, since the task_wrapper will append it with the dictionary read from config.yaml (if argv[1] == –run), and with the input data packages (in a list). Need to check if this is possible.

The parts that should be edited by the task developer are those within the #start edit and #end edit comments, the """ useful information""" blocks and the import line from <my_module> import <my_function>. The task_wrapper is a decorator that is imported from a utility (which should be declared in setup.cfg), and which performs the following actions:

  1. Parse and validate the list of arguments. The firt element should be --run (or alias -r), --export (or -e) or --plot (or -p), each receiving a path as argument, described above when describing the test behaviour. If something is invalid provide useful feedback and stop, otherwise continue.

  2. Create any output folder if needed, open a log file to it, describe initial steps, then pass logger to task, and close after completion.

  3. In the case of the ‘run’ functionality, use the data_io package to load any input data, and append them to the list of arguments passed to the task.

  4. After the task is finished write the output data package to the output folder (if it is valid).

The ‘plot’ functionality also uses the data_io package. A ‘logger_setup’ standalone function should be available from the task_wrapper utility to be imported for development of tasks and utilities.

Collect stage:

Fields in config.yaml of the collect stage:

stage: 'collect'
output:
    name:
    path:
    version:
    create_path: False
    overwrite: False
api_info:
    username:
    password:
    key:
    token:

The task_wrapper will output a metadata.yaml file to the raw data folder with additional fields obtained from introspection or checking with the template metadata:

    files: <list with names of expected files>
    data_name: CONFIG
    data_path: CONFIG
    data_version: CONFIG
    datetime: RUNTIME
    repo: RUNTIME
    repo_version: RUNTIME ()
    contact_person:
    source:
    document:
    license:
    ...?

As a rule, the values in tests/data/<data_folder>/metadata.yaml can be standard string, meaning they will be copied to the output metadata, or be RUNTIME, meaning they are determined at runtime, or CONFIG, meaning they are to be read from config. The same logic can apply to tests/data/config.yaml.

In the case of manually provided data, repo and repo_version can be None or MANUAL?

task_wrapper will not actually write the files. That must be managed by the actual content of the main function. In other stages the main function only needs to pass the objects. task_wrapper will also not generate entity-relation diagrams, for example the function plot may have a condition if stage == ‘collect’ do nothing.

The choice of values for the create_path and overwrite options assume that the Bonsai task operator previously created the folder. Alternatively, it could be task_wrapper doing that, in which case the options are different. There is also the issue of the root path. should in config also a root path be provided, from which the output path is relative?

The utility template does not need to have any particular structure, for as long as it contains clear instructions on how to use. The export of any test data and a test script that illustrates its application should be included. The task_wrapper utility may include a standalone ‘export’ functionality to be used for this purpose.

Clean stage:

Fields in config.yaml:

stage: ‘clean’ input: name: path: version: output: name: path: version: create_path: False overwrite: False

task_wrapper constraints: inputs are expected to be files, output is expected to have no upstream dependencies. Output is a datapackage, so diagram should be created, maybe add a ‘datapackage: True’ field to metadata? As in ‘collect’ metadata has non-standard fields with admin data (contact-person, source, document, license).

Merge stage:

Merge tasks all call the same ‘merge’ repository, which is expected to receive two inputs, new_contrib and previous_merge, and delivers a new_merge. The distinction can be done assuming that new_contrib comes from database ‘clean’ and previous_current from ‘merge’ (if any).

In other stages, task_wrapper checks that the output template metadata and the generated metadata have the same entity-relation model. However, this check must be skipped for the merge “task”. The check instead is that both input data packages have the same entity-relation model. Admin metadata needs to be concatenated, so it is better to have a special sub-dictionary with lists. That is, in metadata there is:

admin: contact_person: […] source: […] document: […] license: […]

The output data pacakge should be ‘merge’, no upstream dependencies allowed.

Merge should add to each table a reference to the contribution from which that record originates. This means this info needs to be passed on by workflow management.

Load stage:

Admin metadata is discarded from now on, as it is explicitly stored in tables.

Inputs can be from database ‘merge’ or from ‘main’. Output is ‘main’.

Build stage:

Inputs and output are ‘main’.

Data package utility

While the task_wrapper utility handles specific paths that comply with the architecture of task repositories, the functionalities to handle data packages are in the ‘data_io’ package. This Python package should have a DataPackage class with the main methods and attributes described below.

from data_io import DataPackage and dp = DataPackage() initialize an empty data package.

dp.load(path, depth=0): given path to a folder with a Frictionless-compliant metadata.yaml file, populates the data package. The depth argument indicates whether upstream dependencies of a certain depth are loaded too (0 means not, -1 means all, i > 0 indicates the highest level of dependency imported). Alternative syntax is dp = data_io.load(path, depth=0).

dp.build(metadata: dict, tables: list of pd.DataFrames): assembles a data package from Python objects. It can be done is steps, first building the metadata and then the tables.

dp.dump(path): Dumps to path (only depth zero is allowed for dumping).

dp.check(): Check internal consistency of data package.

dp.plot(path): Generate entity-relation diagram.

The first attribute is dp.metadata: dict: dictionary with a syntax based on the Frictionless framework with some additional fields:

  1. Root fields: datapackage_name, datapackage_version, root_path, task_name, task_version, repo_full_url, database, depth, dependencies. All except the last two are strings. depth is an int and the last one is a dict as: {<datapackage_0>: {version: str, path: str, depth: int}, ...}. Each dependency is an upstream data package, that is found by combining the root_path with dependencies[<datapackage_0>]['path'].

  2. Within each ['resources'][<table_0>] there is an additional field datapackage indicating either <datapackage_name> (optional) or a eky of dependencies (mandatory).

  3. Within each ['resources'][<table_0>]['schema']['foreignKeys'][...]['reference'] there is an additional field datapackage indicating either <datapackage_name> (optional) or a key of dependencies (mandatory).

This architecture is not ideal since it does not preclude clashes of tables with the same name in different data packages. Maybe the name of the table should be a composite <datapackage_0.table_0>.

Additional fields are required for the entity-relation diagrams. Within each ['resources'][<table_0>]['schema']['foreignKeys'][...]['reference'], direction in ['forward', 'back'] and distance: int.

Other attributes are the data packages and tables proper: dp.<data_package_0>.<table_0>: pd.DataFrame: pandas dataframe with table <table_0> in data package <data_package_0>. If depth is zero, then there is only one data package, and otherwise upstream dependencies are stored in their own attributes. The index should be the primary key of the table (field name ‘id’). The constraints defined in the metadata resources should be valid.

The class has an inner join method dp.join(<child_table>, <parent_table>, [<fields>]) or similar.

Ideally we would like to have table and data package archetypes. Table archetypes are fact, dimension, classification and concordance, data package archetypes are data cube, dimension and classification. However, these cannot be subclasses because these archetypes are defined only in relation with other objects and subclassing the pandas dataframe class is officially discouraged. Thus, we use archetype as keys in the metadata of both the datapackage as a whole (in the root and the dependencies field), and in each resource (=table). Method DataPackage.check() checks for archetype consistency.

Admin database

Administrative data is kept separately for security reasons. This include sensitive information concerning: users (contact, affiliation, passwords); API keys and tokens; data sources and licenses; and workflow management (tasks, repositories, versions and their interaction). The part concerning users, APIs and data sources is discussed in the user experience chapter. The part concerning workflow management is discussed in the workflow tutorial. The full entity-relation model of the admin database is presented in figure 12.

Figure 12. Entity-relation diagram of the full admin database.

The workflow management minimal working example describes the lower and right part of the full admin database. The top left corner roughly overlaps with the source dimension data package, but contains additional sensitive information. It is not shown explicitly in order not to overwhelm the diagram, but all tables expect ‘user_group’ have an additional field ‘created_by’ with a foreign key to ‘user’, and all tables have an additional field ‘create_time’. Table ‘user_group’ has lists user groups with different permissions.