ObjTables: a toolkit for parsing and validating tables with relational schemas

ObjTables is a toolkit for working with complex data as collections of tables which combines the ease of use and flexibility of Excel with the rigor and power of defined schemas.

ObjTables makes it easy to

  • Use collections of tables (e.g., an Excel workbook) as a interface to view and edit complex datasets that describe relationships and other complex datatypes,
  • Use embedded tables and grammars to encode relational information into columns and groups of columns of tables,
  • Define schemas for collections of tables,
  • Use these schema to validate collections of tables and parse them into Python data structures for analysis,
  • Conduct operations on complex datasets such as comparing and merging datasets, and
  • Edit schemas and migrate datasets to new versions of schemas.

The ObjTables toolkit includes five components:

  • Tabular format for describing schemas for datasets encoded as collections of tables. This includes the data type of each column, relationships between rows in tables, and two methods for encoding complex relational information into cells:
    • Embedded tables for *-to-one relationships: To help users encode complex relationships into a minimal number of tables, optionally, ObjTables can encode related models into groups of columns. ObjTables uses merged headings to distinguish these columns.
    • Embedded grammars for relationships: To help users encode complex relationships into a minimal number of tables, optionally, grammars can be used to encode relationships to multiple instances of multiple models within a single column. These grammars can be defined declaratively in EBNF format using Lark .
  • Numerous data types for scientific research including types for chemoinformatics, genomics, and mathematics.
  • Software tools for generating Excel workbooks for editing datasets according to schemas. This enables users to use Excel as a graphical user interface for editing datasets. As outlined below, this leverages a variety of features of Excel.
  • Software tools for parsing and validating datasets according to schemas.
  • Software tools for manipulating complex datasets, including creating, analyzing, comparing, merging, and migrating datasets.

ObjTables enables users to leverage Excel as a graphical user interface for viewing and editing complex datasets. Excel-encoded datasets have the following features:

  • Table of contents: Each Excel-encoded dataset includes a worksheet that describes the contents of each worksheet, displays the number of instances of each model, and provides hyperlinks to the worksheets that encode the instances of each model.
  • Formatted model titles: Each worksheet that encodes a model includes a title bar which describes the model. These title bars are formatted, frozen, and protected from editing.
  • Formatted attribute headings: Each worksheet that encodes a model includes headings for each column and group of columns. These headings are formatted, auto-filtered, frozen, and protected from editing.
  • Inline help for attributes: ObjTables uses Excel comments to embed help information about each attribute into its heading.
  • Select menus for enumerations and relationships: ObjTables provides select menus for each attribute that encodes an enumeration, a one-to-one relationship, or a many-to-one relationship.
  • Instant validation: ObjTables uses Excel to validate several properties of attributes. Note, due to the limitations of Excel, this provides limited validation. ObjTables Python schemas can be used to implement thorough validations.
  • Hidden extra rows and columns: To help users focus on the attributes of their models, ObjTables protects and hides all additional rows and columns.

ObjTables makes it easy to implement rigorous validations of datasets:

  • Attribute validation: Validations of individual attributes can be defined declaratively (e.g. string(min_length=8)). More complex validations can be defined using a Python schema or by implementing custom types of attributes.
  • Instance validation: Users can implement custom instance-level validations by creating a Python module which implements a schema and implementing the validate methods of the classes.
  • Model-level validation: Most attributes can be constrained to have unique values across all instances (e.g., string(unique=True)). Python modules which implement schemas can also capture tuples of attributes which must be unique across all instances of a model. See the documentation for more information.

ObjTables provides four user interfaces to the software tools:

  • Web app below : The web app enables users to use ObjTables without having to install any software.
  • REST API : The REST API enables users to use ObjTables programmatically without having to install any software.
  • Command line interface : The command line interface enables users to use ObjTables without having to upload data to this website.
  • Python library : The Python library enables users to extend ObjTables with custom attributes and validation and use ObjTables to analyze complex datasets.

ObjTables is available open-source under the MIT license .

ObjTables was developed to implement languages for describing whole-cell computational models and the data needed to build and verify them.

Web app

Form

Output

Tabular format datasets

File formats

ObjTables supports three formats for collections of tables, or datasets:

  • Excel workbook: Each worksheet contains a table which encodes the instances of a model. The names of the worksheets should be ! followed by the name of a model in the schema (e.g., !ModelName).
  • Set of CSV or TSV files: Each file contains a table which encodes the instances of a model. The names of the files should follow the pattern dir/!*.csv (e.g., Doc/!ModelName.tsv). Sets of CSV and TSV files can be uploaded to this server as a zip archive.
  • Single text file with multiple CSV or TSV formatted tables: Each table within the dataset encodes the instances of a model.

Model metdata

The first row in each table must describe the version of ObjTables used by the table and the model which the table represents (e.g., !!ObjTables ObjTablesVersion='2.0' TableID='<model_name>').

Optionally, the first row can contain additional key-value pairs which represent additional metadata such as a description of the table and the date when the table was updated (e.g., !!ObjTables ObjTablesVersion='2.0' TableID='model_name' Description='Model description').

Model instances (rows) and attributes (columns)

The attributes of each model are represented by the columns of its table, each instance of each model is represented as a row in a table, and the attributes of each instance are encoded into the cells in tables. The first rows in the table should define the attributes represented by each column; each cell should contain a ! followed by the name an attribute in the schema (e.g., !Id).

Encoding multiple models in a single table

To help users encode complex relationships into a minimal number of tables, relationships between models can be encoded into groups of columns and individual cells.

  • Encoding *-to-one related data into groups of columns: *-to-one relationships to other instances of models can be encoded into a series of consecutive columns by setting the formats of the related models to multiple_cells in the schema. When this feature is used, tables must include an additional row of column headings which indicates the groups of columns.
  • Encoding *-to-many related data into individual columns: *-to-many relationships to other instance of models can be encoded into a single column by setting the formats of the related models to cell and defining a grammar for serializing and deserializing the related instances to/from strings which can be written to and read from individual cells. These grammars can be defined declaratively in EBNF format using Lark .

Encoding models into transposed tables

Optionally, a model can be encoded into a transposed table in which the columns represent instances of the model and the rows represent attributes of the model. This feature is useful for models which are intended to only have a single instance. Models can be encoded into transposed tables by setting their format to column rather than row.

Comments

Comments about individual model instances can be encoded by inserting rows whose first cell begins with %.

Additional tables, rows, and columns outside the scope of schemas

Excel workbooks can include additional worksheets. Worksheets that are outside the scope of the schema must have names which do not not begin with an !.

Tables can include additional columns. Columns that are outside of the scope of the schema must have names which do not begin with an !.

Tables can include additional empty rows and rows with comments. Rows that contain comments must begin with %. Comments will be associated with the next data row, except for comments which are below the final row which will be associated to the final row.

Example

The following example illustrates how to encode parents, children, and their favorite video games into two tables according to the example schema below. The FavoriteVideoGame model enables information about the favorite game of each child to be encapsulated into a group of columns within the Child table. The FavoriteVideoGame model also enables the Python representation of the data to encapsulate information about the favorite game of each child into a separate class. This helps make tables and Python code more human-readable.

!!SBtab TableID='Parent' SBtabVersion='2.0'
!Id !Name
jane_doeJane Doe
john_doeJohn Doe
mary_roeMary Roe
richard_roeRichard Roe
!!SBtab TableID='Child' SBtabVersion='2.0'
!FavoriteVideoGame
!Id !Name !Gender !Parents !Name !Publisher !Year
jamie_doe Jamie Doe female jane_doe, john_doe Legend of Zelda Nintendo 1986
jimie_doe Jimie Doe male jane_doe, john_doe Super Mario Brothers Nintendo 1985
linda_roe Linda Roe female mary_roe, richard_roe Sonic the Hedgehog Sega 1991
mike_roe Michael Roe male mary_roe, richard_roe SimCity Electronic Arts 1989

Excel and Python files for the example are available here .

Additional examples are available at SBtab.net.

Formats for schemas

Schemas can either be defined using the tabular format described here or defined using the Python API. The tabular format is easier to use. The Python API enables methods for manipulating data to be encapsulated with schemas. This makes it easy to define custom validations such as for the element balance of a chemical reaction or for the lack of cycles in a network. Furthermore, the software tools can generate Python schemas from tabular-formatted schemas, which is a convenient starting point for further development.

Tabular format

Tabular-formatted schemas should begin with a single header row which indicates that the dataset is encoded in an ObjTables schema (!!ObjTables TableID="DEFINITION" ...).

After the header row, the schema file should contain a table with the following columns that defines the models/tables and their attributes/columns. Each row in the table should define a single model or attribute.

Tabular-formatted schemas can be saved in comma-separated (.csv), tab-separated (.tsv), or Excel (.xlsx) format.

The table should contain the following columns:

  • !Name: Name of the component
  • !Type: Type of the component (Model or Attribute)
  • !Parent:
    • Models: Not applicable
    • Atttributes: Name of the model that the attribute belongs to
  • !Format:
    • Tables:
      • row: Encode the instances of the model as rows
      • column: Encode the instances of the model as columns (i.e. transposed table)
      • multiple_cells: Encode the instances of the model within a group of columns in the tables of the one-to-many and one-to-one related models
      • cell: Encode the instances of the model within columns of the related models, optionally, using a grammar
    • Columns: One of the data types listed below (e.g., String, Float). Arguments for the data types can be description in parenthesis (e.g., String(min_length=5). The arguments enable users to customize how data types function and are validated. See the documentation for more information about the types and their arguments.
  • (Optional) !Description: Description of the component

Example

The following example illustrates a schema for encoding three models of parents, children, and their favorite video games into two tables of parents and children, with the favorite games of the children embedded into a group of columns within the children table.

!!SBtab TableID='DEFINITION' TableName='Table/model and column/attribute definitions' SBtabVersion='2.0'
!Name!Type!Parent!Format!Description
ParentTablecolumnRepresents a parent
IdColumnParentslugIdentifier
NameColumnParentstring
ChildTablerowRepresents a child
IdColumnChildslugIdentifier
NameColumnChildstring
GenderColumnChildenum(['female', 'male'])
ParentsColumnChildmanyToMany('Parent', related_name='children')
FavoriteVideoGameColumnChildmanyToOne('Game', related_name='children')
GameTablemultiple_cellsRepresents a video game
NameColumnGamestring(unique=True)
PublisherColumnGamefloat
YearColumnGameinteger

Excel and Python files for the example are available here .

Additional examples are available at SBtab.net.

Python format

Schemas can also be implemented as Python modules. The software can convert tabular-formatted schemas into Python modules. The Python module format provides more flexibility than the tabular format. For example, Python-formatted schemas can encapsulate methods into schemas, which can be used to implement custom validations.

Data types

ObjTables supports numerous datatypes and makes it easy to implement additional types. For example, SBtab extends ObjTables by adding a variety of types for genomics, systems biology, and synthetic biology research.

Strings

  • Long string
  • Regular expression string
  • String

Numbers

  • Boolean
  • Integer
  • Float
  • Positive integer
  • Positive float

Dates/times

  • Date
  • Date/time
  • Time

Internet

  • Email
  • URL

Enumerations

  • Enumeration

Relationships

  • One-to-one
  • One-to-many
  • Many-to-one
  • Many-to-many

Chemistry (SBtab)

  • Chemical formula
  • Chemical structure (SMILES, BpForms , BcForms )

Genomics (SBtab)

  • DNA, RNA, and protein sequences (Biopython )
  • Feature location (Biopython )
  • Frequency position matrix (Biopython )

Informatics (SBtab)

  • Ontology term (Pronto )

Mathematics (SBtab)

  • Array (NumPy )
  • Matrix (NumPy )
  • Python expressions
  • Symbolic expression (SymPy )
  • Symbolic symbos (SymPy )

Physics (SBtab)

  • Units (Pint )

Software tools

ObjTables includes a variety of methods for working with schemas and datasets:

  • Generate a Python module that implements a tabular-formatted schema.
  • Generate a template Excel, CSV, or TSV file(s) for a schema.
  • Programmatically construct, search, and analyze datasets.
  • Use Git to revision datasets.
  • Normalize datasets into a deterministically reproducible ordering.
  • Validate that a dataset adheres to a schema and report any errors.
  • Pretty format a dataset according to a schema.
  • Use a schema to compare the semantic meaning of two datasets.
  • Use a schema to convert a dataset to an alternate format.
  • Use a schema to convert a dataset to a dictionary of pandas data frames.

User interfaces

ObjTables includes four user interfaces to the software tools described above.

Web app

A web app is available above .

REST API

A REST API is available at objtables.org/api.

Command line interface

A command line interface is available from PyPI .

Python library

A Python library is available from PyPI .

Python modules which implement ObjTables schemas make it easy to create datasets, parse files into structured Python representations, query and edit datasets, and save datasets to files.

The following example illustrates to programmatically create, manipulate, analyze, and export the same dataset of parents and children described above.

Importing Python modules which implement schemas

import parents_children

Creating datasets

# Create parents
jane_doe = parents_children.Parent(id='jane_doe', name='Jane Doe')
john_doe = parents_children.Parent(id='john_doe', name='John Doe')
mary_roe = parents_children.Parent(id='mary_roe', name='Mary Roe')
richard_roe = parents_children.Parent(id='richard_roe', name='Richard Roe')

# Create children
jamie_doe = parents_children.Child(id='jamie_doe',
                                   name='Jamie Doe',
                                   gender=parents_children.Child.gender.enum_class.female,
                                   parents=[jane_doe, john_doe])
jamie_doe.favorite_video_game = parents_children.Game(name='Legend of Zelda: Ocarina of Time',
                                                      publisher='Nintendo',
                                                      year=1998)

jimie_doe = parents_children.Child(id='jimie_doe',
                                   name='Jimie Doe',
                                   gender=parents_children.Child.gender.enum_class.male,
                                   parents=[jane_doe, john_doe])
jimie_doe.favorite_video_game = parents_children.Game(name='Super Mario Brothers',
                                                      publisher='Nintendo',
                                                      year=1985)
linda_roe = parents_children.Child(id='linda_roe',
                                   name='Linda Roe',
                                   gender=parents_children.Child.gender.enum_class.female,
                                   parents=[mary_roe, richard_roe])
linda_roe.favorite_video_game = parents_children.Game(name='Sonic the Hedgehog',
                                                      publisher='Sega',
                                                      year=1991)
mike_roe = parents_children.Child(id='mike_roe',
                                  name='Michael Roe',
                                  gender=parents_children.Child.gender.enum_class.male,
                                  parents=[mary_roe, richard_roe])
mike_roe.favorite_video_game = parents_children.Game(name='SimCity',
                                                     publisher='Electronic Arts',
                                                     year=1989)

Querying datasets

mike_roe = mary_roe.children.get_one(id='mike_roe')
mikes_parents = mike_roe.parents
mikes_sisters = mikes_parents[0].children.get(gender=parents_children.Child.gender.enum_class.female)

Editing datasets

jamie_doe.favorite_video_game.name = 'Legend of Zelda'
jamie_doe.favorite_video_game.year = 1986

Validating datasets

import obj_tables

objects = [jane_doe, john_doe, mary_roe, richard_roe,
           jamie_doe, jimie_doe, linda_roe, mike_roe]
errors = obj_tables.Validator().run(objects)
assert errors is None

Parsing data from files

import obj_tables.io

filename = 'obj_tables/web_app/examples/parents_children.xlsx'
objects = obj_tables.io.Reader().run(filename, sbtab=True,
                                    models=[parents_children.Parent, parents_children.Child],
                                    group_objects_by_model=True)
parents = objects[parents_children.Parent]
jane_doe_2 = next(parent for parent in parents if parent.id == 'jane_doe')

Exporting datasets to files

filename = 'obj_tables/web_app/examples/parents_children_copy.xlsx'
objects = [jane_doe, john_doe, mary_roe, richard_roe,
           jamie_doe, jimie_doe, linda_roe, mike_roe]
obj_tables.io.Writer().run(filename, objects,
                          models=[parents_children.Parent, parents_children.Child],
                          sbtab=True)

Analyzing datasets

assert jane_doe.is_equal(jane_doe_2)

Resources for working with ObjTables

Below are several resources which can be helpful for working with ObjTables.

  • Microsoft Excel
  • WPS Office : Free editor for Linux, Mac, and Windows

Tutorials, documentation, and help

Documentation for the formats for schemas and datasets

Documentation for the format for the schemas and the formats for the collections of tables is available above . Additional information is available at docs.karrlab.org .

Query builder for the REST API

A visual interface for building REST queries is available at objtables.org/api.

Documentation for the REST API

Documentation for the REST API is available at objtables.org/api.

Documentation for the command line program

Documentation for the command line program is available inline by running obj-tables --help.

Tutorial for the Python API

A Jupyter notebook with an interactive tutorial is available at sandbox.karrlab.org .

Documentation for the Python API

Documentation for the Python API is available above and at docs.karrlab.org .

Questions

Please contact the Karr Lab with any questions.

Contributing to ObjTables

To contribute to the software, please submit a Git pull request .

About ObjTables

Source code

ObjTables is available open-source from GitHub .

License

ObjTables is released under the MIT license .

Citing ObjTables

Coming soon!

Team

ObjTables was developed by the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, US and the Applied Mathematics and Computer Science, from Genomes to the Environment research unit at the Institut National de la Recherche Agronomique in Jouy en Josas, FR.

  • Arthur Goldberg
  • Jonathan Karr
  • Wolfram Liebermeister
  • Timo Lubitz

Acknowledgements

ObjTables was supported by a National Institute of Health P41 award , a National Institute of Health MIRA R35 award , and a National Science Foundation INSPIRE award .

Questions/comments

Please contact the Karr Lab with any questions or comments.