BpForms is a toolkit for unambiguously representing the primary sequence of forms of biopolymers. By concretely representing the primary sequence of biopolymers, BpForms aims to facilitate concrete discussion about DNA modification, post-transcriptional processing, and post-translational processing; facilitate the determination of the structures of biopolymer forms; facilitate the integration of data about DNA modification, post-transcriptional processing, and post-translational processing; and enable whole-cell models that represent DNA modification, post-transcriptional processing, and post-translational processing and the functions of modified DNA, RNA, and proteins.
BpForms includes a notation for describing biopolymer forms, as well as this website, a JSON REST API, a command line interface, and a Python API for calculating properties of biopolymer forms. These tools are available open-source under the MIT license.
BpForms notation
Overview
The BpForms notation represents biopolymers as FASTA sequences augmented with (a) multiple-letter alphabet-defined monomers delimited by curly brackets and (b) user-defined monomers described in square brackets by one or more attributes separated by "|". The structure, monomer-bond-atom, monomer-displaced-atom, left-bond-atom, left-displaced-atom, right-bond-atom, and right-displaced-atom attributes are required to calculate the chemical formula, molecular weight, and charge. All other attributes are optional.
BpForms has several pre-built alphabets.
Examples
- [id: "dI" | name: "deoxyinosine"]ACGC: represents deoxyinosine at the first position
- AC[id: "dI" | name: "deoxyinosine"]GC: represents deoxyinosine at the third position
- AC{2mG}C[id: "dI" | name: "deoxyinosine"]: represents guanosine methylation at the second position and deoxyinosine at the last position
Structures of monomers
The structure attribute describes the chemical structure of the inline monomer. This attribute is a SMILES-encoded string. Each monomer can only have one structure. This attribute is required to calculate the structure of the BpForm.
Examples
[id: "dI"
| structure: "O=C1NC=NC2=C1N=CN2"
]
Linkages between monomers
The monomer-bond-atom, monomer-displaced-atom, left-bond-atom, left-displaced-atom, right-bond-atom, and right-displaced-atom attributes describe the linkages between monomers and their backbone and between successive monomers. Each monomer can have multiple bonds and multiple displaced atoms. These attributes are required to calculate the structure of the BpForm.
Examples
[id: "dI"
| structure: "O=C1NC=NC2=C1N=CN2"
| monomer-bond-atom: Monomer / N / 10 / 0
| monomer-displaced-atom: Monomer / H / 10 / 0
]
Uncertainty about the primary sequence
BpForms can represent two types of uncertainty in the primary sequences of biopolymer forms.
- The delta-mass delta-charge attributes describe uncertainty in the chemical identity of the monomer.
- The position attribute describes uncertainty in the position of the monomer within the sequence.
Examples
- [id: "dAMP" | delta-mass: 1 | delta-charge: 1]: indicates the presence of an additional proton whose exact location is not known.
- [id: "dI" | position: 2-3]: indicates that deoxyinosine may occur anywhere between the second and third position.
Metadata about monomers
BpForms can represent several types of metadata about monomers.
- The id and name attributes are human-readable labels for monomers. Only one id and one name is allowed per monomer.
- The synonym attribute is an additional human-readable label. Monomers can have multiple synonyms.
- The identifier attribute indicates entries in databases and ontologies which are equivalent to the monomer. Monomers can have multiple identifiers. The namespace and id of each identifer must be separated by a "/".
- The comments attribute describes additional information about the monomer. Monomers can only have one comment.
Examples
- [id: "dI" | name: "deoxyinosine"]
- [id: "dI" | synonym: "deoxyinosine" | synonym: "2'-deoxyinosine"]
- [id: "dI" | identifier: "chebi" / "CHEBI:28997" | identifier: "pubchem.compound" / "65058"]
- [id: "dI" | comments: "A purine 2'-deoxyribonucleoside that is inosine ..."]
Integrating BpForms into pathway databases, models, and in silico designs
Pathway databases (BioPAX )
We are exploring a recommended best practice for encoding BpForms in BioPAX.
Example
Coming soon.
Models (SBML )
The annotation attribute of species can be used to describe species with BpForms.
Example
<species>
<annotation>
<bpforms:bpform xmlns:bpforms="https://www.bpforms.org/ns">
<bpforms:alphabet>dna</bpforms:alphabet>
<bpforms:sequence>A{m2C}GT</bpforms:sequence>
</bpforms:BpForm>
</annotation>
</species>
In silico designs (SBOL )
These encoding URIs can be used to describe components in in silico designs:
- DNA: http://edamontology.org/format_XXXX
- RNA: http://edamontology.org/format_YYYY
- Protein: http://edamontology.org/format_ZZZZ
Example
<sbol:Sequence>
<sbol:encoding rdf:resource="http://edamontology.org/format_XXXX"/>
<sbol:elements>A{m2C}GT</sbol:elements>
</sbol:Sequence>
BpForm interfaces
Python package
The
BpForms Python package is available from PyPI
.
Command line interface
The
BpForms command line interface is available from PyPI
.
Source code
BpForms is available open-source from GitHub
.
Documentation
Please see
BpForms documentation
.
License
BpForms is released under the MIT license
.
About BpForms
Citing BpForms
Lang PF, Chebaro Y & Jonathan R. Karr. BpForms: a toolkit for concretely describing modified DNA, RNA and proteins. arXiv:1903.10042
Team
BpForms was developed by Jonathan Karr
, Yassmine Chebaro
, and Paul Lang
in the Karr Lab
at the Icahn School of Medicine at Mount Sinai
in New York, USA.
Acknowledgements
BpForms was supported by a National Institute of Health P41 award
, a National Institute of Health MIRA R35 award
, and a National Science Foundation INSPIRE award
.
Questions/comments
Please contact Jonathan Karr
with any questions or comments.