2. Package elements in brief¶
There are multiple stages between a clinical source database and a final research database. CRATE separates these stages.
2.1. Terminology¶
I will refer to ‘anonymisation’ in this document, but sometimes as a shorthand for ‘pseudonymisation’, in which IDs are removed and replaced by a generated pseudonym.
2.2. Database connections¶
Most of the CRATE tools talk to one or more databases. They do this via SQLAlchemy, which uses a unified URL scheme to define a database connection. You will need to create a URL for every database you wish to use. In some cases you may need to install drivers. See “Internal database interfaces” below.
2.3. Preprocessing¶
Your data may need reshaping or adding to. For example, while you will want to remove addresses and postcodes from the raw data, you may want to add less specific but nonetheless useful UK Office of National Statistics (ONS) geographical information. It may also be that your source database is heavily normalised [1] and you want to de-normalize it to make life easier for your researchers.
CRATE provides the following optional pre-processing steps:
crate_postcodes
: takes a downloaded UK ONS Postcode Database (ONS PD) file and inserts it into a database, for later linking to your source data.crate_preprocess_rio
: adds fields, indexes, and views to a Servelec RiO database (or a Servelec RiO CRIS Extract Program [RCEP] database) to apply some de-normalization of RiO Core and RiO Non-Core data to make it simpler for end users. This program also generates data dictionary options for the next step.
2.4. Data dictionary generation and editing¶
- CRATE removes identifiable information as it copies a database based on a data dictionary, which is essentially a spreadsheet with one row for every column in the source database.
- You can create this data dictionary manually, and edit it manually, but CRATE
also provides a way to autogenerate a draft of the data dictionary. Use the
command
crate_anonymise --draftdd
to start a data dictionary, orcrate_anonymise --incrementaldd
to discover new database columns and add them to an existing data dictionary. Both these commands refer to a CRATE anonymiser configuration file, and that configuration file can contain guidance on how to draft the data dictionary, in the form of options namedddgen...
.
To be more helpful, preprocessors (including crate_preprocess_rio
) can
create these options for you; see the --settings-filename
option above.
These suggestions incorporate knowledge about the specific database (e.g. which
fields contain patient IDs; which contain references to external document
files; etc). You can take those suggestions, and add them to your CRATE
configuration file. If you do that, the autogenerated data dictionary will (we
hope) be much closer to what you want.
However, you should always review your data dictionary by hand prior to anonymisation.
2.5. Anonymisation¶
You can use the crate_anonymise
(or crate_anonymise_multiprocess
)
commands to perform the main anonymisation. This can be done in a “full” way,
dropping existing tables and starting from scratch, or in an “incremental” way,
looking for changes to the source database (with respect to the anonymised
database) and changing the anonymised database accordingly.
This tool uses a configuration file that you create and edit. Use
crate_anonymise --democonfig
to generate a demonstration file. (For some
database, like RiO, you can mix in the suggested options from
crate_preprocess_rio
.)
2.6. Natural language processing (NLP)¶
You can use the crate_nlp
(or crate_nlp_multiprocess
) commands to pass
text from one or more databases/tables/columns, to an external NLP tool, and
the structured data back to a database table.
CRATE includes some built-in natural language tools, including regular expression (regex) parsers for numerical results.
The GATE NLP system is also supported, via a Java program. Use
crate_nlp_build_gate_java_interface
to build this before you use it for the
first time.
The MedEx-UIMA system is also supported, via a Java program. Use
crate_nlp_build_medex_java_interface
to build this before you use it for
the first time.
This tool uses a configuration file that you create and edit. Use crate_nlp –democonfig to generate a demonstration file.
2.7. Web front end¶
CRATE offers a web front end that supports researcher access to the data, and allows managers to operate a specific consent-to-contact process.
It uses a configuration file. Use crate_print_demo_crateweb_config
to
create a starting config that you can edit, and
crate_generate_new_django_secret_key
to generate a random secret key for
your site (which goes into the config).
The crate_django_manage command
provides options for:
- building the structure of the admin database (
migrate
); - collecting statically served files (
collectstatic
); - creating a superuser (
createsuperuser
); - manually changing a password (
changepassword
); - populating a consent database (
populate
); - testing the back-end messaging system by sending an e-mail (
test_email
);
and a few other things that other scripts provide more convenient interfaces to.
Other scripts include:
crate_launch_django_server
for a test Django server;crate_launch_cherrypy_server
to launch a production-grade CherryPy server;crate_launch_celery
to launch the Celery message-handling backend;crate_launch_flower
for the Flower tool to monitor the Celery/RabbitMQ backend;crate_windows_service
to set up or test a Windows service for the web server system. (The CRATE Windows service does the equivalent of running bothcrate_launch_cherrypy_server
andcrate_launch_celery
, in the background.)
2.8. Testing and additional tools¶
Other tools include:
crate_make_demo_database
: creates a demonstration database for testing.crate_test_anonymisation
: fetches raw and anonymised data (from a source and a destination database), for a human to compare with a tool like Meld to verify the accuracy of anonymisation.crate_estimate_mysql_memory_usage
: estimates the memory footprint of MySQL.
Footnotes
[1] | https://en.wikipedia.org/wiki/Database_normalization |