ObjTables logo

ObjTables: Working with complex data with the ease of spreadsheets, the rigor of schemas & the power of object-oriented programming

ObjTables is a toolkit which makes it easy to work with complex datasets by combining spreadsheets (e.g., Excel workbooks) with schemas (classes, their attributes, the type of each attribute, and the possible relationships between instances of classes), and an object-relational mapping system (similar to Active Record, Doctrine, Django, Hibernate, Propel, SQLAlchemy, etc.).

Together, ObjTables makes it easy to create datasets that are both human and machine-readable which can be easily shared with others, compared with similar datasets, re-used for additional analyses, and composed into more comprehensive datasets.

ObjTables consists of a format for describing schemas for spreadsheets; numerous data types for scientific data; a syntax for indicating the class and attribute represented by each table and column in a spreadsheet; and software for using schemas to rigorously validate, merge, split, compare, and revision datasets; and software for using schemas to parse datasets into objects for further analysis in programming languages such as Python.

ObjTables is ideal for supplementary materials of journal article, as well as for emerging domains which need to quickly build new formats for new types of data and associated software with minimal effort.

Overview

Use cases

  • Integrating heterogeneous data.
  • Collaboratively and iteratively building complex datasets and models.
  • Defining formats for new types of data and models.
  • Publishing re-usable supplementary materials to journal articles.
  • Sharing re-usable data with collaborators.

Components of the ObjTables toolkit

  • Format for schemas for tabular datasets.
  • Numerous data types for mathematics, science, chemistry, and biology.
  • Markup format for tabular datasets (e.g., Excel workbooks or collections of CSV/TSV files).
  • Software tools for working with tabular datasets.
  • Python package for additional flexibility and more complex analyses.

Benefits

  • Use collections of tables (e.g., an Excel workbook) to represent complex data consisting of multiple related objects of multiple types (e.g., rows of worksheets), each with multiple attributes (e.g., columns).
  • Use complex data types (e.g., numbers, strings, numerical arrays, symbolic mathematical expressions, chemical structures, biological sequences, etc.) within tables.
  • Use Excel as a graphical interface for viewing and editing complex datasets.
  • Use embedded tables and grammars to encode relational information into columns and groups of columns of tables.
  • Define clear schemas for tabular datasets.
  • Use schemas to rigorously validate tabular datasets.
  • Use schemas to parse tabular datasets into data structures for further analysis in languages such as Python.
  • Compare, merge, split, revision, and migrate tabular datasets.

What is an ObjTables schema and dataset?

An ObjTables schema defines the types of objects that the workbook can represent (the tables of the workbook), their attributes (the columns of each table), the type of each attribute (e.g., Boolean, integer, float, string, etc.), and the possible relationships between the instances of the objects (rows of the tables). Optionally, schemas can also define which attributes are required and how each attribute, object, and the entire dataset should validated. As illustrated below, schemas can be defined using our simple tabular format, or as a Python module.

An ObjTables dataset is a workbook or collection of tables. The workbook defines the objects that constitute the dataset (the rows of the tables), the attributes of each object (the values of the cells), and the relationships between the objects (the values of the cells that represent relationships). As illustrated below, each table must be marked with a statement that begins with !! that indicates the class represented by the table. Similarly, each column must be marked with a statement that begins with ! that indicates the attribute represented the column. These statements describe the link between workbooks and their schemas.

Example: Address book of CEOs

The following illustrates a schema for an address book of people and their employers, and an example address book of the CEOs of several technology companies. The documentation contains more complete examples that illustrates additional features of ObjTables. The documentation includes Excel, CSV, TSV, JSON, YAML, and Python files for these examples.

Tabular-formatted schema

!!ObjTables type='Schema' tableFormat='row' description='Table/model and column/attribute definitions' date='2020-03-10 21:34:50' objTablesVersion='0.0.8'
!Name!Type!Parent!Format!Verbose name!Verbose name plural!Description
 
PersonClassrowPersonPeople
nameAttributePersonString(primary=True, unique=True)Name
companyAttributePersonStringCompany
email_addressAttributePersonEmailEmail address
phone_numberAttributePersonStringPhone number

Schema defined as a Python module

import obj_tables

    class Person(obj_tables.Model):
        name = obj_tables.StringAttribute(
            primary=True,
            unique=True,
            verbose_name='Name')
        company = obj_tables.StringAttribute(
            verbose_name='Company')
        email_address = obj_tables.EmailAttribute(
            verbose_name='Email address')
        phone_number = obj_tables.StringAttribute(
            verbose_name='Phone number')

        class Meta(obj_tables.Model.Meta):
            table_format = obj_tables.TableFormat.row
            attribute_order = (
                'name',
                'company',
                'email_address',
                'phone_number',
            )
            verbose_name = 'Person'
            verbose_name_plural = 'People'

Tabular-formatted dataset

!!!ObjTables objTablesVersion='0.0.8' date='2020-03-14 13:19:04'
!!ObjTables type='Data' tableFormat='row' id='Person' name='People' date='2020-03-14 13:19:04' objTablesVersion='0.0.8'
!Name !Company !Email address !Phone number
Mark ZuckerbergFacebookzuck@fb.com650-543-4800
Reed HastingsNetflixreed.hastings@netflix.com408-540-3700
Sundar PichaiGooglesundar@google.com650-253-0000
Tim CookAppletcook@apple.com408-996-1010

Python code for generating the dataset

cook = Person(
        name='Tim Cook',
        company='Apple',
        email_address='tcook@apple.com',
        phone_number='408-996-1010',
    )

    hastings = Person(
        name='Reed Hastings',
        company='Netflix',
        email_address='reed.hastings@netflix.com',
        phone_number='408-540-3700',
    )

    pichai = Person(
        name='Sundar Pichai',
        company='Google',
        email_address='sundar@google.com',
        phone_number='650-253-0000',
    )

    zuckerberg = Person(
        name='Mark Zuckerberg',
        company='Facebook',
        email_address='zuck@fb.com',
        phone_number='650-543-4800',
    )

Motivation

Despite numerous formats, software tools, and repositories, it remains difficult to re-use many types of data for additional investigation. In particular, it remains difficult to re-use most supplementary datasets to journal articles because they are often are provided as Excel workbooks. While Excel is ease to use, Excel workbooks are difficult to re-use because Excel has limited support for multi-dimensional data, metadata, data validation, and analysis. Alternative formats, such as relational databases, which can handle multi-dimensional data, metadata, data validation, and analysis are ill-suited to supplementary materials because they require substantially more knowledge, effort, and frequently complex software to utilize.

To make it easier to re-use data, ObjTables combines the ease of use of spreadsheets (e.g., Excel workbooks) with the rigor of schemas (i.e. a set of classes, their attributes, the type of each attribute, and the relationships among the classes) and structured metadata. This enables users to use Excel to view and edit aribtrarily complex datasets, and to use schemas and the ObjTables software tools to parse datasets and metadata into data structures suitable for analysis in languages such as Python with powerful tools such as NumPy, Pandas, scikit-learn, and SciPy.

To make it easier to build datasets, the ObjTables software can also use schemas to validate, merge, split, compare, revision, and migrate datasets. Together, these features also make ObjTables well-suited to emerging domains which need to create new formats for new types of data and associated software tools for viewing, editing, validating, and analyzing these formats with minimal effort. For example, ObjTables is well-suited to defining new formats for multi-omics datasets that combine genomics, transcriptomics, proteomics, biochemical, and microscopy data.

The ObjTables toolkit consists of a format for schemas for tabular datasets, numerous data types for scientific data and models, a markup format for tabular datasets (e.g., Excel workbooks), and software tools for using schemas to validate, compare, merge, split, revision, and migrate tabular datasets. ObjTables includes four software tools: a web application, a REST API, a command-line program, and a Python package.

This website provides an introduction to ObjTables, as well as links to use and obtain the ObjTables software and source code, examples, tutorials, and documentation.

Features

As described above, ObjTables was designed for cases, such as supplementary materials, where datasets need to be both human and machine-readable, and for emerging domains which need to create new formats for new types of datasets (and associated software tools, including graphical tools for viewing and editing complex datasets) with minimal effort. ObjTables is well-suited to these use cases because it provides the functionality described below. ObjTables is well-suited to supplementary materials because it enables readers to both use Excel to view arbitrary datasets and use schemas to parse arbitrary datasets into convenient data structures for further analysis. ObjTables is well-suited to developing data formats for emerging domains because it provides emerging domains the ability to quickly define precise formats for complex data, as well as user-friendly tools for creating, viewing, editing, validating, merging, comparing, revisioning, migrating complex datasets.

Create Excel templates for building complex datasets

To make it easy to build datasets, the ObjTables software can generate template Excel workbooks for schemas with a table of contents, skeletons for the tables and columns, inline help, dropdown menus, and Excel validation.

Use Excel as a GUI for viewing and editing complex datasets

ObjTables enables users to leverage Excel as a graphical interface for viewing and editing complex datasets. ObjTables Excel datasets have the following features:

  • Table of contents: Datasets can include a worksheet that describes the data represented by each worksheet and provides hyperlinks to each worksheet.
  • Formatted class titles: Each worksheet includes a title bar that describes the data captured by the worksheet. The title bars are formatted, frozen, and protected from editing.
  • Formatted attribute headings: Each workbook includes headings for each column and group of columns. The headings are formatted, auto-filtered, frozen, and protected from editing.
  • Inline help for attributes: ObjTables uses Excel comments to embed help information about each attribute into it's heading.
  • Select menus for enumerations and relationships: ObjTables provides dropdown menus for each attribute that represents an enumeration, a one-to-one relationship, or a many-to-one relationship.
  • Instant validation: ObjTables uses Excel to validate several basic properties of attributes. Note, due to the limitations of Excel, this provides limited validation. The ObjTables software provides more extensive validation.
  • Hidden extra rows and columns: To help users focus on their data, ObjTables hides all empty rows and columns.
  • Protection from unintentional editing: To help users avoid mistakes, ObjTables protects worksheets.

Iteratively build and revision complex datasets

The ObjTables software leverages Git to make it easy to build datasets iteratively, revision datasets, and track their provenance, including when each revision was made, who made it, and why it was made.

Iteratively build schemas and migrate complex datasets

To make it easy to build schemas iteratively, the ObjTables software can revision schemas, as well as migrate datasets between different versions of schemas (e.g., adding, removing, and renaming tables and columns).

Rigorously validate and quickly debug complex datasets

ObjTables makes it easy to validate and debug datasets at multiple levels:

  • Attribute validation: Validations of individual attributes can be defined declaratively. More complex validations can be defined by implementing using Python package.
  • Instance validation: Users can use the Python package to implement custom instance-level validation by customizing the `obj_tables.Model.validate` method of each class.
  • Class-level validation: Most attributes can be constrained to have unique values across all instances. Python modules that implement schemas can also capture tuples of attributes that must be unique across all instances of a class. See the Python documentation for more information.

Merge and split datasets

To help users build large datasets, the ObjTables software can merge datasets by identifying common objects, joining them, and concatenating their relationships to other objects. To help users break down datasets into smaller, more manageable pieces, the ObjTables software can split datasets by cutting relationships and identifying all of the resulting connected subsets of the dataset.

Compare/difference datasets

To help users compare and review changes to datasets, the ObjTables software can determine if datasets are semantically equal and identify their differences.

Query and analyze complex datasets

The ObjTables Python package makes it easy to find objects in datasets and use Python to conduct complex analyses of datasets such as numerical simulations.

Pretty print datasets for publication

To make it easy to create files suitable for supplementary materials of journal articles, ObjTables can pretty print datasets with tables of contents, formatted table titles and column headings, and inline help.

Visualize schemas for datasets

To help users understand schemas, ObjTables can generate UML diagrams of schemas.

Components of the ObjTables toolkit

To make it easy to work with complex data, ObjTables provides a complete and coordinated set of tools for building, validating, analyzing, and sharing complex data. This includes a simple format for schemas for datasets, numerous data types for scientific research, a markup syntax for tabular-formatted datasets, and software tools for validating, merging, comparing, revisioning, and migrating datasets. For use cases which require additional flexibility, ObjTables also provides a Python library which can be used to implement custom data types and validations.

Format for schemas for tabular datasets

ObjTables schemas capture the format of each table, including the name and data type of each column, which cells represent relationships among the entries in the tables, and constraints on the value of each cell. ObjTables supports three modes of encoding relationships into cells in tables.

  • Columns for relationships among objects represented by entries in tables: Relationships from one (primary) object to other (related) objects can be captured by (a) incorporating a column that represents a unique key for each related object into the table that represents the related objects and (b) encoding the keys for the related objects as a comma-separated list into a column in the table that represents the primary objects.
  • Embedded tables for *-to-one relationships: To help users encode complex datasets into a minimal number of tables, ObjTables can also encode instances of related classes into groups of columns. ObjTables uses merged headings to distinguish these columns.
  • Embedded grammars for relationships: To help users encode complex datasets into a minimal number of tables, grammars can be used to encode instances of related classes into a single column. These grammars can be defined declaratively in EBNF format using Lark .

Numerous data types

ObjTables provides numerous data types, including for mathematics, science, chemoinformatics, and genomics.

Markup format for tabular datasets (e.g., Excel workbooks)

The format includes syntax for declaring which cells represent each table, instance, and attribute; declaring which entries represent metadata such as the date that a table was updated; and declaring which entries represent comments.

Software tools for working with tabular datasets

ObjTables includes a web application, a REST API, a command-line program, and a Python package for working with datasets. These tools can be used to pretty print, validate, compare, revision, and migrate datasets.

Python package for additional flexibility

For more flexibility, the Python package can be used to incorporate custom data types, define custom validation, query, and analyze datasets.

Software tools

The ObjTables toolkit includes four software interfaces: a web application, a REST API, a command-line program, and a Python package. The web application, REST API, and command-line program provide the same features. In addition to the features of the web application, REST API, and command-line program, the Python package can programmatically query, edit, merge, split, revision, and migrate datasets. The Python package is also more flexible. For example, the Python package can support additional datatypes and custom validation.

We recommend beginning with the web application, REST API, or command-line program. We recommend using the Python package when more flexibility is required, such as a custom data type, or to analyze datasets with Python tools such as NumPy, Pandas, scikit-learn, or SciPy.

For convenience, a Dockerfile for building an Ubuntu image with the ObjTables tools is also available. The image can be used to run the web application, REST API, command-line program, or Python package. In addition, the ObjTables source code is available from GitHub.

Web application

A web application is available at objtables.org/app.

REST API

A REST API is available at objtables.org/api.

Command-line program

A command-line program is available from PyPI .

Python package

A Python package is available from PyPI .

The Python package provides more flexibility than the web application, REST API, and command-line program for custom data types and custom validation. The Python package is also best-suited for analyzing datasets.

Dockerfile

A Dockerfile for building a Docker image is available from GitHub .

Source code

The source code is available from GitHub .

Use cases

ObjTables was designed to help users share complex data with the ease of spreadsheets and the rigor of schemas. ObjTables excels at cases where datasets need to be both human and machine-readable. For example, ObjTables is well-suited to supplementary materials of journal articles where it is important to share materials in a human-readable format that doesn't require special domain-specific software (e.g., Excel workbooks), and where it is important to share materials in a format that enables their re-use for additional studies. ObjTables is also well-suited to emergent fields which need to quickly build new formats (and associated software tools) for new types of data. For example, we have used ObjTables to build formats for describing whole-cell models and the datasets needed to build and validate whole-cell models.

Publishing re-usable supplementary materials

Although supplementary materials often contain valuable data, supplementary materials are underutilized because they are often provided in custom formats that are difficult to understand, parse, and re-use.

ObjTables addresses this issue by enabling authors to publish materials in a tabular format which can easily be read by humans and computers: (a) ObjTables enables authors to pretty print their data with tables of contents and inline help, (b) ObjTables enables authors to provide schemas for parsing their data, and (c) ObjTables enables readers to use these schemas to parse and analyze published data with minimal effort. Together, this makes it easier for authors to publish supplementary materials that are easy for others to re-use for additional studies.

Sharing re-usable data and models

Researchers often need to send their collaborators new datasets and models that cannot be described in any existing format. This often requires collaborators to write custom codes to parse these custom datasets and models. The substantial effort needed to write these codes is a frequent barrier to collaboration.

ObjTables makes it easier to share re-usable data and models with collaborators by (a) enabling researchers to rigorously describe the structure of their data or model with a schema, (b) enabling researchers to capture metadata about their data or model, (c) providing researchers software tools for validating their data, and (d) enabling collaborators to use these schemas to parse data from their colleagues quickly.

Building, validating and analyzing complex datasets and models

Many fields aim to understand how behaviors emerge from complex networks. This often requires integrating diverse data about different parts of the network. For example, systems biology aims to understand how cellular behavior emerges from genotype, often using genomics, biochemical, and other data. Excel is a popular tool for merging data because it's flexible and easy to use. However, Excel only supports a few data types, and Excel has limited support for multi-dimensional data. In addition, it is difficult to debug and analyze Excel workbooks.

By combining Excel with schemas, ObjTables makes it easy to build, validate, and analyze complex datasets: (a) users can use Excel to assemble diverse data into tables, (b) users can quickly define schemas for their data, and (c) users can use these schemas to validate their data and parse their data into data structures suitable for further analysis in languages such as Python. For example, we have used ObjTables to build integrated datasets of the biochemistry of Mycoplasma pneumoniae and H1 human embryonic stem cells.

ObjTables also makes it easy to build datasets iteratively over time by helping users revision data with Git and migrate their data as they revise their schemas.

Defining formats for new types of data and models

New areas of science often require new types of data and new kinds of models. In turn, this often requires new formats to capture these data and models and new software for working with these formats, including new tools for parsing and validating data and models described in these formats. Creating these formats is often an obstacle for new domains that have limited resources. Evolving these formats as new approaches emerge is also challenging because this often requires updating the software tools for the format and converting old files to the revised format.

ObjTables addresses this issue by making it easy to define schemas for domain-specific data and providing software tools for parsing, manipulating, and validating data encoded in these schemas. For example, we have used ObjTables to create, WC-KB , a format for the experimental omics, biochemical, and physiological data needed to model cellular biochemistry. We have also used ObjTables to create, WC-Lang , a format for whole-cell models of all of the biochemical activity in a cell. Creating these formats required minimal code.

Examples, tutorials, documentation, and help

Extensive examples, interactive tutorials, and documentation for the ObjTables formats and software tools are available through the links below. Please contact us for further help.

Examples

The documentation contains several example schemas and datasets .

Command-line program installation

Installation instructions for the command-line program are available at docs.karrlab.org. A Dockerfile for building an Ubuntu Linux image with ObjTables is available from the ObjTables Git repository .

Python package installation

Installation instructions for the Python package are available at docs.karrlab.org. A Dockerfile for building an Ubuntu Linux image with ObjTables is available from the ObjTables Git repository .

Tutorials for the Python package

A Jupyter notebook with interactive tutorials is available at sandbox.karrlab.org .

Docs for the schema and dataset formats

Documentation for the formats for schemas and the formats for datasets is available at objtables.org/docs.

Docs for the REST API

Documentation for the REST API is available at objtables.org/api.

Docs for the command-line program

Documentation for the command-line program is available inline by running obj-tables --help.

Docs for the Python package

An introduction to the Python package is available at objtables.org/docs. Detailed documentation is available at docs.karrlab.org.

Further help

Please contact the Karr Lab with any questions.