17. Things to do¶
Todo
Write more documentation here!
(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/docs/source/website_using/index.rst, line 28.)
Todo
Upgrade to django-pyodbc-azure==2.0.6.1
, which will require
Django 2.0.6, pyodbc 3.0+.
(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/docs/source/installation/database_drivers.rst, line 484.)
Todo
also: use celery beat to refresh regularly +/- trigger withdrawal of consent if consent mode changed; http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/crateweb/consent/models.py:docstring of crate_anon.crateweb.consent.models.ConsentMode.refresh_from_primary_clinical_record, line 18.)
Todo
also make automatic opt-out list
(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/crateweb/consent/models.py:docstring of crate_anon.crateweb.consent.models.ConsentMode.refresh_from_primary_clinical_record, line 22.)
17.1. Older: some redundant?¶
Fix parsing bug under SQL Server.
Redo consent system as per ethics
Pull consent/discharge info from (1) RiO new, (2) RiO old, (3) CRATE
“After discharge” code
functools.lru_cache is not thread-safe
- Symptom: KeyError at /pe_df_results/4/ (<crate_anon.crateweb.research.research_db_info.ResearchDatabaseInfo object at ...>, <TableId<db='RiO', schema='dbo', table='GenSENFunctionTest') at ...>) at get_mrid_linkable_patient_pables(): if self.table_contains_rid(table): which is defined as: @lru_cache(maxsize=1000) def table_contains_rid(self, table: TableId): - https://bugs.python.org/issue28969 - Thus: https://noamkremen.github.io/a-simple-threadsafe-caching-decorator.html http://codereview.stackexchange.com/questions/91656/thread-safe-memoizer https://pythonhosted.org/cachetools/# http://stackoverflow.com/questions/213455/python-threadsafe-object-cache http://codereview.stackexchange.com/questions/91656/thread-safe-memoizer - Then, also, the Django cache system: https://docs.djangoproject.com/en/1.10/topics/cache/ https://github.com/rchrd2/django-cache-decorator https://gist.github.com/tuttle/9190308
Work into main docs: PIPELINE:
PIPELINE =============================================================================== 0. BEFORE FIRST USE =============================================================================== a) Software prerequisites 1) MySQL 5.6 or later. For Ubuntu 14.10: $ sudo apt-get install mysql-server-5.6 mysql-server-core-5.6 \ mysql-client-5.6 mysql-client-core-5.6 - Download the corresponding MySQL Workbench from http://dev.mysql.com/downloads/workbench/ ... though it may moan about library incompatibilities ... but also sensible to use Ubuntu 15.04? 2) Stuff that should come with Ubuntu: git Python 2.7 3) This toolkit: $ git clone https://github.com/RudolfCardinal/anonymise 4) GATE - Download GATE Developer - java -jar gate-8.0-build4825-installer.jar - Read the documentation; it's quite good. b) Ensure that the PYTHONPATH is pointing to necessary Python support files: $ . SET_PATHS.sh To ensure it's working: $ ./anonymise.py --help c) Ensure that all source database(s) are accessible to the Processing Computer. d) Write a draft config file, giving connection details for all source databases. To generate a demo config for editing: $ ./anonymise.py --democonfig > MYCONFIG.ini Edit it so that it can access your databases. e) Ensure that the data dictionary (DD) has been created, and then updated and verified by a human. To generate a draft DD from your databases for editing: $ ./anonymise.py MYCONFIG.ini --draftdd Edit it with any TSV editor (e.g. Excel, LibreOffice Calc). =============================================================================== 1. PRE-PROCESSING =============================================================================== a) Ensure that the databases are copied and ready. b) Add in any additional data. For example, if you want to process a postcode field to geographical output areas, such as http://en.wikipedia.org/wiki/ONS_coding_system then do it now; add in the new fields. Don't remove the old (raw) postcodes; they'll be necessary for anonymisation. c) UNCOMMON OPTION: anonymise using NLP to find names. See below. If you want to anonymise using NLP to find names, rather than just use the name information in your source database, run nlp_manager.py now, using (for example) the Person annotation from GATE's plugins/ANNIE/ANNIE_with_defaults.gapp application, and send the output back into your database. You'll need to ensure the resulting data has patient IDs attached, probably with a view (see (d) below). d) Ensure every table that relates to a patient has a common field with the patient ID that's used across the database(s) to be anonymised. Create views if necessary. The data dictionary should reflect this work. e) Strongly consider using a row_id (e.g. integer primary key) field for each table. This will make natural language batch processing simpler (see below). =============================================================================== 2. ANONYMISATION (AND FULL-TEXT INDEXING) USING A DATA DICTIONARY =============================================================================== OBJECTIVES: - Make a record-by-record copy of tables in the source database(s). Handle tables that do and tables that don't contain patient-identifiable information. - Collect patient-identifiable information and use it to "scrub" free-text fields; for example, with forename=John, surname=Smith, and spouse=Jane, one can convert freetext="I saw John in clinic with Sheila present" to "I saw XXX in clinic with YYY present" in the output. Deal with date, numerical, textual, and number-as-text information sensibly. - Allow other aspects of information restriction, e.g. truncating dates of birth to the first of the month. - Apply one-way encryption to patient ID numbers (storing a secure copy for superuser re-identification). - Enable linking of data from multiple source databases with a common identifier (such as the NHS number), similarly encrypted. - For performance reasons, enable parallel processing and incremental updates. - Deal with binary attachments containing text. For help: anonymise.py --help a) METHOD 1: THREAD-BASED. THIS IS SLOWER. anonymise.py <configfile> [--threads=<n>] b) METHOD 2: PROCESS-BASED. THIS IS FASTER. See example in launch_multiprocess.sh --------------------------------------------------------------------------- Work distribution --------------------------------------------------------------------------- - Best performance from multiprocess (not multithreaded) operation. - Drop/rebuild tables: single-process operation only. - Non-patient tables: - if there's an integer PK, split by row - if there's no integer PK, split by table (in sequence of all tables). - Patient tables: split by patient ID. (You have to process all scrubbing information from a patient simultaneously, so that's the unit of work. Patient IDs need to be integer for this method, though for no other reason.) - Indexes: split by table (in sequence of all tables requiring indexing). (Indexing a whole table at a time is fastest, not index by index.) --------------------------------------------------------------------------- Incremental updates --------------------------------------------------------------------------- - Supported via the --incremental option. - The problems include: - aspects of patient data (e.g. address/phone number) might, in a very less than ideal world, change rather than being added to. How to detect such a change? - If a new phone number is added (even in a sensible way) -- or, more importantly, a new alias (following an anonymisation failure), should re-scrub all records for that patient, even records previously scrubbed. - Solution: - Only tables with a suitable PK can be processed incrementally. The PK must appear in the destination database (and therefore can't be sensitive, but should be an uninformative integer). This is so that if a row is deleted from the source, one can check by looking at the destination. - For a table with a src_pk, one can set the add_src_hash flag. If set, then a hash of all source fields (more specifically: all that are not omitted from the destination, plus any that are used for scrubbing, i.e. scrubsrc_patient or scrubsrc_thirdparty) is created and stored in the destination database. - Let's call tables that use the src_pk/add_src_hash system "hashed" tables. - During incremental processing: 1. Non-hashed tables are dropped and rebuilt entirely. Any records in a hashed destination table that don't have a matching PK in their source table are deleted. 2. For each patient, the scrubber is calculated. If the *scrubber's* hash has changed (stored in the secret_map table), then all destination records for that patient are reworked in full (i.e. the incremental option is disabled for that patient). 3. During processing of a table (either per-table for non-patient tables, or per-patient then per-table for patient tables), each row has its source hash recalculated. For a non-hashed table, this is then reprocessed normally. For a hashed table, if there is a record with a matching PK and a matching source hash, that record is skipped. --------------------------------------------------------------------------- Anonymising multiple databases together --------------------------------------------------------------------------- - RATIONALE: A scrubber will be built across ALL source databases, which may improve anonymisation. - If you don't need this, you can anonymise them separately (even into the same destination database, if you want to, as long as table names don't overlap). - The intention is that if you anonymise multiple databases together, then they must share a patient numbering (ID) system. For example, you might have two databases using RiO numbers; you can anonymise them together. If they also have an NHS number, that can be hashed as a master PID, for linking to other databases (anonymised separately). (If you used the NHS number as the primary PID, the practical difference would be that you would ditch any patients who have a RiO number but no NHS number recorded.) - Each database must each use a consistent name for this field, across all tables, WITHIN that database. - This field, which must be an integer, must fit into a BIGINT UNSIGNED field (see wipe_and_recreate_mapping_table() in anonymise.py). - However, the databases don't have to use the same *name* for the field. For example, RiO might use "id" to mean "RiO number", while CamCOPS might use "_patient_idnum1". =============================================================================== 3. NATURAL LANGUAGE PROCESSING =============================================================================== OBJECTIVES: Send free-text content to natural language processing (NLP) tools, storing the results in structured form in a relational database -- for example, to find references to people, drugs/doses, cognitive examination scores, or symptoms. - For help: nlp_manager.py --help - The Java element needs building; use buildjava.sh - STRUCTURE: see nlp_manager.py; CamAnonGatePipeline.java - Run the Python script in parallel; see launch_multiprocess_nlp.sh --------------------------------------------------------------------------- Work distribution --------------------------------------------------------------------------- - Parallelize by source_pk. --------------------------------------------------------------------------- Incremental updates --------------------------------------------------------------------------- - Here, incremental updates are simpler, as the NLP just requires a record taken on its own. - Nonetheless, still need to deal with the conceptual problem of source record modification; how would we detect that? - One method would be to hash the source record, and store that with the destination... - Solution: 1. Delete any destination records without a corresponding source. 2. For each record, hash the source. If a destination exists with the matching hash, skip. =============================================================================== EXTRA: ANONYMISATION USING NLP. =============================================================================== OBJECTIVE: remove other names not properly tagged in the source database. Here, we have a preliminary stage. Instead of the usual: free text source database -------------------------------------> anonymiser | ^ | | scrubbing +-------------------------------------------+ information we have: free text source database -------------------------------------> anonymiser | | ^ ^ | | | | scrubbing | +-------------------------------------------+ | information | | +---> NLP software ---> list of names ---------------+ (stored in source DB or separate DB) For example, you could: a) run the NLP processor to find names, feeding its output back into a new table in the source database, e.g. with these options: inputfielddefs = SOME_FIELD_DEF outputtypemap = person SOME_OUTPUT_DEF progenvsection = SOME_ENV_SECTION progargs = java -classpath {NLPPROGDIR}:{GATEDIR}/bin/gate.jar:{GATEDIR}/lib/* CamAnonGatePipeline -g {GATEDIR}/plugins/ANNIE/ANNIE_with_defaults.gapp -a Person -it END_OF_TEXT_FOR_NLP -ot END_OF_NLP_OUTPUT_RECORD -lt {NLPLOGTAG} input_terminator = END_OF_TEXT_FOR_NLP output_terminator = END_OF_NLP_OUTPUT_RECORD # ... b) add a view to include patient numbers, e.g. CREATE VIEW patient_nlp_names AS SELECT notes.patient_id, nlp_person_from_notes._content AS nlp_name FROM notes INNER JOIN nlp_person_from_notes ON notes.note_id = nlp_person_from_notes._srcpkval ; c) feed that lot to the anonymiser, including the NLP-generated names as scrubsrc_* field(s). =============================================================================== 4. SQL ACCESS =============================================================================== OBJECTIVE: research access to the anonymised database(s). a) Grant READ-ONLY access to the output database for any relevant user. b) Don't grant any access to the secret mapping database! This is for trusted superusers only. c) You're all set. =============================================================================== 5. FRONT END (SQL BUILDER) FOR NON-SQL AFICIONADOS =============================================================================== ??? Python web site script, reading from data dictionary, etc. - Script knows field type, from data dictionary. - Collapsible list. Comments. - Possible operators include: < <= = => > != - For strings: = LIKE / MATCH(x) AGAINST (...) - SELECT outputfields WHERE constraints - preview with COUNT(*) as output - automatic joins on patient (NOT: separate query) - SQL preview - ??raw SQL option - danger of SQL injection ... all too much effort for what you get? Everyone doing it seriously will use SQL. The front end should be accessible via HTTPS only. =============================================================================== 6. HOT-SWAPPING TWO DATABASES =============================================================================== Since anonymisation is slow, you may want a live research database and another that you can update offline. When you're ready to swap, you'll want to - create DEFUNCT - rename LIVE -> DEFUNCT - rename OFFLINE -> LIVE then either revert: - rename LIVE -> OFFLINE - rename DEFUNCT -> LIVE or commit: - drop DEFUNCT How? http://stackoverflow.com/questions/67093/how-do-i-quickly-rename-a-mysql-database-change-schema-name https://gist.github.com/michaelmior/1173781 =============================================================================== 7. AUDITING ACCESS TO THE RESEARCH DATABASE =============================================================================== a) MYSQL RAW ACCESS. - You need an auditing tool, so we've provided one; see the contents of the "mysql_auditor" directory. - Download and install mysql-proxy, at least version 0.8.5, from https://dev.mysql.com/downloads/mysql-proxy/ Install its files somewhere sensible. - Configure (by editing) mysql_auditor.sh - Run it. By default it's configured for daemon mode. So you can do this: sudo ./mysql_auditor.sh CONFIGFILE start - By default the logs go in /var/log/mysql_auditor; the audit*.log files contain the queries, and the mysqlproxy*.log files contain information from the mysql-proxy program. - The audit log is a comma-separated value (CSV) file with these columns: - date/time, in ISO-8601 format with local timezone information, e.g. "2015-06-24T12:58:29+0100"; - client IP address/port, e.g. "127.0.0.1:52965"; - MySQL username, e.g. "root"; - current schema (database), e.g. "test"; - query, e.g. "SELECT * FROM mytable" Query results (or result success/failure status) are not shown. - To open fresh log files daily, run sudo FULLPATH/mysql_auditor.sh CONFIGFILE restart daily (e.g. from your /etc/crontab, just after midnight). Logs are named e.g. audit_2015_06_24.log, for their creation date. b) FRONT END. The nascent front end will also audit queries. (Since this runs a web service that in principle can have access to proper data, it's probably better to run a username system rather than rely on MySQL usernames alone. Therefore, it can use a single username, and a database-based auditing system. The administrator could also pipe its MySQL connection via the audit proxy, but doesn't have to.) =============================================================================== X. TROUBLESHOOTING =============================================================================== ------------------------------------------------------------------------------- Q. Error: [Microsoft][SQL Server Native Client 11.0]Connection is busy with results for another command. ------------------------------------------------------------------------------- A. In /etc/odbc.ini, for that DSN, set MARS_Connection = yes - https://msdn.microsoft.com/en-us/library/cfa084cz(v=vs.110).aspx - https://msdn.microsoft.com/en-us/library/h32h3abf(v=vs.110).aspx - Rationale: We use gen_patient_ids() to iterate through patients, but then we fetch data for that patient via the same connection to the source database(s). Therefore, we're operating multiple result sets through one connection. ------------------------------------------------------------------------------- Q. How to convert from SQL Server to MySQL? ------------------------------------------------------------------------------- A. MySQL Workbench. Use the "ODBC via connection string" option if others aren't working: DSN=XXX;UID=YYY;PWD=ZZZ If the schema definitions are not seen, it's a permissions issue: http://stackoverflow.com/questions/17038716 ... but you can copy the database using anonymise.py, treating all tables as non-patient tables (i.e. doing no actual anonymisation). ------------------------------------------------------------------------------- Q. MySQL server has gone away... ? ------------------------------------------------------------------------------- A. Probably: processing a big binary field, and MySQL's max_allowed_packet parameter is too small. Try increasing it (e.g. from 16M to 32M). See also http://www.camcops.org/documentation/server.html ------------------------------------------------------------------------------- Q. What settings do I need in /etc/mysql/my.cnf? ------------------------------------------------------------------------------- A. Probably these: [mysqld] max_allowed_packet = 32M innodb_strict_mode = 1 innodb_file_per_table = 1 innodb_file_format = Barracuda # Only for MySQL prior to 5.7.5 (http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-20.html): innodb_log_file_size = 320M # For more performance, less safety: innodb_flush_log_at_trx_commit = 2 # To save memory? # Default is 8; suggestion is ncores * 2 # innodb_thread_concurrency = ... [mysqldump] max_allowed_packet = 32M ------------------------------------------------------------------------------- Q. _mysql_exceptions.OperationalError: (1118, 'Row size too large (> 8126). Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED may help. In current row format, BLOB prefix of 768 bytes is stored inline.') ------------------------------------------------------------------------------- A. See above. If you need to change the log file size, FOLLOW THIS PROCEDURE: https://dev.mysql.com/doc/refman/5.0/en/innodb-data-log-reconfiguration.html ------------------------------------------------------------------------------- Q. Segmentation fault (core dumped)... ? ------------------------------------------------------------------------------- A. Use the Microsoft JDBC driver instead of the Microsoft ODBC driver for Linux, which is buggy. ------------------------------------------------------------------------------- Q. "Killed." ------------------------------------------------------------------------------- A. Out of memory. Suggest reduce MySQL memory footprint; see notes.txt. Steps already taken to reduce memory usage by the anonymise program itself. ------------------------------------------------------------------------------- Q. Anonymisation is slow. ------------------------------------------------------------------------------- A. (1) Make sure you have indexes created on all patient_id fields, because the tool will use this to find (a) values for scrubbing, and (b) records for anonymisation. This makes a huge difference! ------------------------------------------------------------------------------- Q. Can't create FULLTEXT index(es) ------------------------------------------------------------------------------- A. MySQL v5.6 is required to use FULLTEXT indexes with InnoDB tables (as opposed to MyISAM tables, which don't support transactions). On Ubuntu 14.04, default MySQL is 5.5, so use: sudo apt-get install mysql-server-5.6 mysql-server-core-5.6 \ mysql-client-5.6 mysql-client-core-5.6 ------------------------------------------------------------------------------- Q. How to search with FULLTEXT indexes? ------------------------------------------------------------------------------- A. In conventional SQL, you would use: ... WHERE field LIKE '%word%' In a field having a MySQL FULLTEXT index, you can use: ... WHERE MATCH(field) AGAINST ('word') There are several variants. See https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
17.2. Known bugs elsewhere affecting CRATE¶
wkhtmltopdf font size bug
See notes next to PATIENT_FONTSIZE in config/settings.py https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2505
If you try to use django-debug-toolbar when proxying via a Unix domain socket, you need to use a custom INTERNAL_IPS setting; see the specimen config file.
SQL Server returns a rowcount of -1; this is normal.
17.3. Newer¶
- BENCHMARK name blacklisting (with forenames + surnames – English words – eponyms): speed, precision, recall. Share results with MB.
- Should privileged clinical queries be in any way integrated with CRATE? Advantages would include allowing the receiving user to run the query themselves without RDBM intervention and RDBM-to-recipient data transfer considerations, while ensuring the receiving user doesn’t have unrestricted access (e.g. via SQL Server Management Studio). Plus there may be a UI advantage. Implementation would be by letting that query use the privileged connection (see get_executed_researchdb_cursor; see also DJANGO_DEFAULT_CONNECTION constant, currently unused).
- restrict anonymiser to specific patient IDs (for subset generation +/- custom pseudonym)
- option to filter out all free text automatically (as part of full anonymisation)
- Library of public SQL queries (name, description, SQL), editable by RDBM (e.g. “date of most recent progress note”); database-specific, but widely used on one system.
- Personal library?
- Personal configurable highlight colours (with default set if none configured).
- More of JL’s ideas from 8 Jan 2018.
- XXX consent mode lookup
- When the Windows service stops, it is still failing to kill child processes.
See
crate_anon/tools/winservice.py
.