Using CIF dictionaries with PyCIFRW

Introduction

CIF dictionaries describe the meaning of the datanames found in CIF data files in a machine-readable format - the CIF format. Each block in a CIF dictionary defines a single dataname by assigning values to a limited set of attributes. This set of attributes used by a dictionary is called its 'Dictionary Definition Language' or DDL. Three languages have been used in IUCr-supported CIF dictionaries: DDL1 (the original language), DDL2 (heavily developed by the macromolecular community), and DDLm (a new standard that aims to unite the best of DDL1 and DDL2). DDL2 and DDLm both allow algorithms to be defined for datanames. These algorithms describe how to derive values for datanames from other quantities in the data file.

Knowing the dictionary that a given datafile is written with reference to thus allows us to do two things: to validate that datanames and values match the constraints imposed by the definition; and, in the case of DDL2 and DDLm, to calculate values which might then be used for checking or simply to fill in missing information.

Dictionaries

DDL dictionaries can be read into CifFile objects just like CIF data files. For this purpose, CifFile objects automatically support save frames (used in DDL2 and DDLm dictionaries), which are accessed just like CifBlocks using their save frame name. By default save frames are not listed as keys in CifFiles as they do not form part of the CIF standard.

The more powerful CifDic object creats a unified interface to DDL1, DDL2 and DDLm dictionaries. A CifDic is initialised with a single file name or CifFile object, and will accept the grammar keyword:

    cd = CifFile.CifDic("cif_core.dic",grammar='1.1')

Definitions are accessed using the usual notation, e.g. cd['_atom_site_aniso_label']. Return values are always CifBlock objects. Additionally, the CifDic object contains a number of instance variables derived from dictionary global data:

dicname
The dictionary name + version as given in the dictionary
diclang
'DDL1','DDL2', or 'DDLm'

CifDic objects provide a large number of validation functions, which all return a Python dictionary which contains at least the key result. result takes the values True, False or None depending on the success, failure or non-applicability of each test. In case of failure, additional keys are returned depending on the nature of the error.

Validation with PyCIFRW

A top level function is provided for convenient validation of CIF files:

    CifFile.Validate("mycif.cif",dic = "cif_core.dic")

This returns a tuple (valid_result, no_matches). valid_result and no_matches are Python dictionaries indexed by block name. For valid_result, the value for each block is itself a dictionary indexed by item_name. The value attached to each item name is a list of (check_function, check_result) tuples, with check_result a small dictionary containing at least the key result. All tests which passed or were not applicable are removed from this dictionary, so result is always False. Additional keys contain auxiliary information depending on the test. Each of the items in no_matches is a simple list of item names which were not found in the dictionary.

If a simple validation report is required, the function validate_report can be called on the output of the above function, printing a simple ASCII report. This function can be studied as an example of how to process the structure returned by the validate function.

A somewhat nicer interface to validation is provided in the ValidationResult class (thanks to Boris Dusek), which is initialised with the return value from validate:

    val_report = ValidationResult(validate("mycif.cif",dic="cif_core.dic"))

This class provides the report method, producing a human-readable report, as well as boolean methods which return whether or not the block is valid or if items appear in the block that are not present in the dictionary - is_valid and has_no_match_items respectively.

Limitations on validation

  1. (DDL2 only) When validating data dictionaries themselves, no checks are made on group and subgroup consistency (e.g. that a specified subgroup is actually defined).

  2. (DDL1 only) Some _type_construct attributes in the DDL1 spec file are not machine-readable, so values cannot be checked for consistency

  3. DDLm validation methods are still in development so are not comprehensive.