ObjTables is a toolkit which makes it easy to work with complex datasets by combining spreadsheets (e.g., Excel workbooks) with schemas (classes, their attributes, the type of each attribute, and the possible relationships between instances of classes), and an object-relational mapping system (similar to Active Record, Doctrine, Django, Hibernate, Propel, SQLAlchemy, etc.).
Together, ObjTables makes it easy to create datasets that are both human and machine-readable which can be easily shared with others, compared with similar datasets, re-used for additional analyses, and composed into more comprehensive datasets.
ObjTables consists of a format for describing schemas for spreadsheets; numerous data types for scientific data; a syntax for indicating the class and attribute represented by each table and column in a spreadsheet; and software for using schemas to rigorously validate, merge, split, compare, and revision datasets; and software for using schemas to parse datasets into objects for further analysis in programming languages such as Python.
ObjTables is ideal for supplementary materials of journal article, as well as for emerging domains which need to quickly build new formats for new types of data and associated software with minimal effort.
An ObjTables schema defines the types of objects that the workbook can represent (the tables of the workbook), their attributes (the columns of each table), the type of each attribute (e.g., Boolean, integer, float, string, etc.), and the possible relationships between the instances of the objects (rows of the tables). Optionally, schemas can also define which attributes are required and how each attribute, object, and the entire dataset should validated. As illustrated below, schemas can be defined using our simple tabular format, or as a Python module.
An ObjTables dataset is a workbook or collection of tables. The workbook defines the objects that constitute the dataset (the rows of the tables), the attributes of each object (the values of the cells), and the relationships between the objects (the values of the cells that represent relationships). As illustrated below, each table must be marked with a statement that begins with !! that indicates the class represented by the table. Similarly, each column must be marked with a statement that begins with ! that indicates the attribute represented the column. These statements describe the link between workbooks and their schemas.
The following illustrates a schema for an address book of people and their employers, and an example address book of the CEOs of several technology companies. The documentation contains more complete examples that illustrates additional features of ObjTables. The documentation includes Excel, CSV, TSV, JSON, YAML, and Python files for these examples.
!!ObjTables type='Schema' tableFormat='row' description='Table/model and column/attribute definitions' date='2020-03-10 21:34:50' objTablesVersion='0.0.8' | ||||||
---|---|---|---|---|---|---|
!Name | !Type | !Parent | !Format | !Verbose name | !Verbose name plural | !Description |
Person | Class | row | Person | People | ||
name | Attribute | Person | String(primary=True, unique=True) | Name | ||
company | Attribute | Person | String | Company | ||
email_address | Attribute | Person | Email address | |||
phone_number | Attribute | Person | String | Phone number |
import obj_tables class Person(obj_tables.Model): name = obj_tables.StringAttribute( primary=True, unique=True, verbose_name='Name') company = obj_tables.StringAttribute( verbose_name='Company') email_address = obj_tables.EmailAttribute( verbose_name='Email address') phone_number = obj_tables.StringAttribute( verbose_name='Phone number') class Meta(obj_tables.Model.Meta): table_format = obj_tables.TableFormat.row attribute_order = ( 'name', 'company', 'email_address', 'phone_number', ) verbose_name = 'Person' verbose_name_plural = 'People'
!!!ObjTables objTablesVersion='0.0.8' date='2020-03-14 13:19:04' |
---|
!!ObjTables type='Data' tableFormat='row' id='Person' name='People' date='2020-03-14 13:19:04' objTablesVersion='0.0.8' | |||
---|---|---|---|
!Name | !Company | !Email address | !Phone number |
Mark Zuckerberg | zuck@fb.com | 650-543-4800 | |
Reed Hastings | Netflix | reed.hastings@netflix.com | 408-540-3700 |
Sundar Pichai | sundar@google.com | 650-253-0000 | |
Tim Cook | Apple | tcook@apple.com | 408-996-1010 |
cook = Person( name='Tim Cook', company='Apple', email_address='tcook@apple.com', phone_number='408-996-1010', ) hastings = Person( name='Reed Hastings', company='Netflix', email_address='reed.hastings@netflix.com', phone_number='408-540-3700', ) pichai = Person( name='Sundar Pichai', company='Google', email_address='sundar@google.com', phone_number='650-253-0000', ) zuckerberg = Person( name='Mark Zuckerberg', company='Facebook', email_address='zuck@fb.com', phone_number='650-543-4800', )
Despite numerous formats, software tools, and repositories, it remains difficult to re-use many types of data for additional investigation. In particular, it remains difficult to re-use most supplementary datasets to journal articles because they are often are provided as Excel workbooks. While Excel is ease to use, Excel workbooks are difficult to re-use because Excel has limited support for multi-dimensional data, metadata, data validation, and analysis. Alternative formats, such as relational databases, which can handle multi-dimensional data, metadata, data validation, and analysis are ill-suited to supplementary materials because they require substantially more knowledge, effort, and frequently complex software to utilize.
To make it easier to re-use data, ObjTables combines the ease of use of spreadsheets (e.g., Excel workbooks) with the rigor of schemas (i.e. a set of classes, their attributes, the type of each attribute, and the relationships among the classes) and structured metadata. This enables users to use Excel to view and edit aribtrarily complex datasets, and to use schemas and the ObjTables software tools to parse datasets and metadata into data structures suitable for analysis in languages such as Python with powerful tools such as NumPy, Pandas, scikit-learn, and SciPy.
To make it easier to build datasets, the ObjTables software can also use schemas to validate, merge, split, compare, revision, and migrate datasets. Together, these features also make ObjTables well-suited to emerging domains which need to create new formats for new types of data and associated software tools for viewing, editing, validating, and analyzing these formats with minimal effort. For example, ObjTables is well-suited to defining new formats for multi-omics datasets that combine genomics, transcriptomics, proteomics, biochemical, and microscopy data.
The ObjTables toolkit consists of a format for schemas for tabular datasets, numerous data types for scientific data and models, a markup format for tabular datasets (e.g., Excel workbooks), and software tools for using schemas to validate, compare, merge, split, revision, and migrate tabular datasets. ObjTables includes four software tools: a web application, a REST API, a command-line program, and a Python package.
This website provides an introduction to ObjTables, as well as links to use and obtain the ObjTables software and source code, examples, tutorials, and documentation.
As described above, ObjTables was designed for cases, such as supplementary materials, where datasets need to be both human and machine-readable, and for emerging domains which need to create new formats for new types of datasets (and associated software tools, including graphical tools for viewing and editing complex datasets) with minimal effort. ObjTables is well-suited to these use cases because it provides the functionality described below. ObjTables is well-suited to supplementary materials because it enables readers to both use Excel to view arbitrary datasets and use schemas to parse arbitrary datasets into convenient data structures for further analysis. ObjTables is well-suited to developing data formats for emerging domains because it provides emerging domains the ability to quickly define precise formats for complex data, as well as user-friendly tools for creating, viewing, editing, validating, merging, comparing, revisioning, migrating complex datasets.
To make it easy to build datasets, the ObjTables software can generate template Excel workbooks for schemas with a table of contents, skeletons for the tables and columns, inline help, dropdown menus, and Excel validation.
ObjTables enables users to leverage Excel as a graphical interface for viewing and editing complex datasets. ObjTables Excel datasets have the following features:
The ObjTables software leverages Git to make it easy to build datasets iteratively, revision datasets, and track their provenance, including when each revision was made, who made it, and why it was made.
To make it easy to build schemas iteratively, the ObjTables software can revision schemas, as well as migrate datasets between different versions of schemas (e.g., adding, removing, and renaming tables and columns).
ObjTables makes it easy to validate and debug datasets at multiple levels:
To help users build large datasets, the ObjTables software can merge datasets by identifying common objects, joining them, and concatenating their relationships to other objects. To help users break down datasets into smaller, more manageable pieces, the ObjTables software can split datasets by cutting relationships and identifying all of the resulting connected subsets of the dataset.
To help users compare and review changes to datasets, the ObjTables software can determine if datasets are semantically equal and identify their differences.
The ObjTables Python package makes it easy to find objects in datasets and use Python to conduct complex analyses of datasets such as numerical simulations.
To make it easy to create files suitable for supplementary materials of journal articles, ObjTables can pretty print datasets with tables of contents, formatted table titles and column headings, and inline help.
To help users understand schemas, ObjTables can generate UML diagrams of schemas.
To make it easy to work with complex data, ObjTables provides a complete and coordinated set of tools for building, validating, analyzing, and sharing complex data. This includes a simple format for schemas for datasets, numerous data types for scientific research, a markup syntax for tabular-formatted datasets, and software tools for validating, merging, comparing, revisioning, and migrating datasets. For use cases which require additional flexibility, ObjTables also provides a Python library which can be used to implement custom data types and validations.
ObjTables schemas capture the format of each table, including the name and data type of each column, which cells represent relationships among the entries in the tables, and constraints on the value of each cell. ObjTables supports three modes of encoding relationships into cells in tables.
The ObjTables toolkit includes four software interfaces: a web application, a REST API, a command-line program, and a Python package. The web application, REST API, and command-line program provide the same features. In addition to the features of the web application, REST API, and command-line program, the Python package can programmatically query, edit, merge, split, revision, and migrate datasets. The Python package is also more flexible. For example, the Python package can support additional datatypes and custom validation.
We recommend beginning with the web application, REST API, or command-line program. We recommend using the Python package when more flexibility is required, such as a custom data type, or to analyze datasets with Python tools such as NumPy, Pandas, scikit-learn, or SciPy.
For convenience, a Dockerfile for building an Ubuntu image with the ObjTables tools is also available. The image can be used to run the web application, REST API, command-line program, or Python package. In addition, the ObjTables source code is available from GitHub.
ObjTables was designed to help users share complex data with the ease of spreadsheets and the rigor of schemas. ObjTables excels at cases where datasets need to be both human and machine-readable. For example, ObjTables is well-suited to supplementary materials of journal articles where it is important to share materials in a human-readable format that doesn't require special domain-specific software (e.g., Excel workbooks), and where it is important to share materials in a format that enables their re-use for additional studies. ObjTables is also well-suited to emergent fields which need to quickly build new formats (and associated software tools) for new types of data. For example, we have used ObjTables to build formats for describing whole-cell models and the datasets needed to build and validate whole-cell models.
Although supplementary materials often contain valuable data, supplementary materials are underutilized because they are often provided in custom formats that are difficult to understand, parse, and re-use.
ObjTables addresses this issue by enabling authors to publish materials in a tabular format which can easily be read by humans and computers: (a) ObjTables enables authors to pretty print their data with tables of contents and inline help, (b) ObjTables enables authors to provide schemas for parsing their data, and (c) ObjTables enables readers to use these schemas to parse and analyze published data with minimal effort. Together, this makes it easier for authors to publish supplementary materials that are easy for others to re-use for additional studies.
Researchers often need to send their collaborators new datasets and models that cannot be described in any existing format. This often requires collaborators to write custom codes to parse these custom datasets and models. The substantial effort needed to write these codes is a frequent barrier to collaboration.
ObjTables makes it easier to share re-usable data and models with collaborators by (a) enabling researchers to rigorously describe the structure of their data or model with a schema, (b) enabling researchers to capture metadata about their data or model, (c) providing researchers software tools for validating their data, and (d) enabling collaborators to use these schemas to parse data from their colleagues quickly.
Many fields aim to understand how behaviors emerge from complex networks. This often requires integrating diverse data about different parts of the network. For example, systems biology aims to understand how cellular behavior emerges from genotype, often using genomics, biochemical, and other data. Excel is a popular tool for merging data because it's flexible and easy to use. However, Excel only supports a few data types, and Excel has limited support for multi-dimensional data. In addition, it is difficult to debug and analyze Excel workbooks.
By combining Excel with schemas, ObjTables makes it easy to build, validate, and analyze complex datasets: (a) users can use Excel to assemble diverse data into tables, (b) users can quickly define schemas for their data, and (c) users can use these schemas to validate their data and parse their data into data structures suitable for further analysis in languages such as Python. For example, we have used ObjTables to build integrated datasets of the biochemistry of Mycoplasma pneumoniae and H1 human embryonic stem cells.
ObjTables also makes it easy to build datasets iteratively over time by helping users revision data with Git and migrate their data as they revise their schemas.
New areas of science often require new types of data and new kinds of models. In turn, this often requires new formats to capture these data and models and new software for working with these formats, including new tools for parsing and validating data and models described in these formats. Creating these formats is often an obstacle for new domains that have limited resources. Evolving these formats as new approaches emerge is also challenging because this often requires updating the software tools for the format and converting old files to the revised format.
ObjTables addresses this issue by making it easy to define schemas for domain-specific data and providing software tools for parsing, manipulating, and validating data encoded in these schemas. For example, we have used ObjTables to create, WC-KB , a format for the experimental omics, biochemical, and physiological data needed to model cellular biochemistry. We have also used ObjTables to create, WC-Lang , a format for whole-cell models of all of the biochemical activity in a cell. Creating these formats required minimal code.
Extensive examples, interactive tutorials, and documentation for the ObjTables formats and software tools are available through the links below. Please contact us for further help.