17. Things to do

Todo

Write more documentation here!

(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/docs/source/website_using/index.rst, line 28.)

Todo

Upgrade to django-pyodbc-azure==2.0.6.1, which will require Django 2.0.6, pyodbc 3.0+.

(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/docs/source/installation/database_drivers.rst, line 480.)

Todo

also: use celery beat to refresh regularly +/- trigger withdrawal of consent if consent mode changed; http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html

(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/crateweb/consent/models.py:docstring of crate_anon.crateweb.consent.models.ConsentMode.refresh_from_primary_clinical_record, line 18.)

Todo

also make automatic opt-out list

(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/crateweb/consent/models.py:docstring of crate_anon.crateweb.consent.models.ConsentMode.refresh_from_primary_clinical_record, line 22.)

Todo

Trying Celery --concurrency=1 (from launch_celery.py) as this should prevent multiple Celery threads doing the same work twice if you call crate_django_manage resubmit_unprocessed_tasks more than once. There’s a risk that this breaks Flower or other Celery status monitoring (as it did with Celery v3.1.23, but that was a long time ago).

(The original entry is located in /home/rudolf/Documents/code/crate/crate_anon/docs/source/changelog.rst, line 563.)

17.1. Older: some redundant?

  • Fix parsing bug under SQL Server.

  • Redo consent system as per ethics

  • Pull consent/discharge info from (1) RiO new, (2) RiO old, (3) CRATE

  • “After discharge” code

  • functools.lru_cache is not thread-safe

    - Symptom:
        KeyError at /pe_df_results/4/
        (<crate_anon.crateweb.research.research_db_info.ResearchDatabaseInfo object at ...>,
        <TableId<db='RiO', schema='dbo', table='GenSENFunctionTest') at ...>)
    
        at get_mrid_linkable_patient_pables():
    
            if self.table_contains_rid(table):
    
        which is defined as:
    
            @lru_cache(maxsize=1000)
            def table_contains_rid(self, table: TableId):
    
    - https://bugs.python.org/issue28969
    
    - Thus:
      https://noamkremen.github.io/a-simple-threadsafe-caching-decorator.html
      http://codereview.stackexchange.com/questions/91656/thread-safe-memoizer
      https://pythonhosted.org/cachetools/#
      http://stackoverflow.com/questions/213455/python-threadsafe-object-cache
      http://codereview.stackexchange.com/questions/91656/thread-safe-memoizer
    
    - Then, also, the Django cache system:
      https://docs.djangoproject.com/en/1.10/topics/cache/
      https://github.com/rchrd2/django-cache-decorator
      https://gist.github.com/tuttle/9190308
    
  • Work into main docs: PIPELINE:

    PIPELINE
    
    ===============================================================================
    0. BEFORE FIRST USE
    ===============================================================================
    
    a)  Software prerequisites
    
        1)  MySQL 5.6 or later. For Ubuntu 14.10:
    
                $ sudo apt-get install mysql-server-5.6 mysql-server-core-5.6 \
                    mysql-client-5.6 mysql-client-core-5.6
                - Download the corresponding MySQL Workbench from
                    http://dev.mysql.com/downloads/workbench/
                  ... though it may moan about library incompatibilities
    
            ... but also sensible to use Ubuntu 15.04?
    
        2)  Stuff that should come with Ubuntu:
                git
                Python 2.7
    
        3)  This toolkit:
    
                $ git clone https://github.com/RudolfCardinal/anonymise
    
        4)  GATE
            - Download GATE Developer
            - java -jar gate-8.0-build4825-installer.jar
            - Read the documentation; it's quite good.
    
    b)  Ensure that the PYTHONPATH is pointing to necessary Python support files:
    
            $ . SET_PATHS.sh
    
        To ensure it's working:
    
            $ ./anonymise.py --help
    
    c)  Ensure that all source database(s) are accessible to the Processing
        Computer.
    
    d)  Write a draft config file, giving connection details for all source
        databases. To generate a demo config for editing:
    
            $ ./anonymise.py --democonfig > MYCONFIG.ini
    
        Edit it so that it can access your databases.
    
    e)  Ensure that the data dictionary (DD) has been created, and then updated and
        verified by a human. To generate a draft DD from your databases for
        editing:
    
            $ ./anonymise.py MYCONFIG.ini --draftdd
    
        Edit it with any TSV editor (e.g. Excel, LibreOffice Calc).
    
    ===============================================================================
    1. PRE-PROCESSING
    ===============================================================================
    
    a)  Ensure that the databases are copied and ready.
    
    b)  Add in any additional data. For example, if you want to process a postcode
        field to geographical output areas, such as
            http://en.wikipedia.org/wiki/ONS_coding_system
        then do it now; add in the new fields. Don't remove the old (raw) postcodes;
        they'll be necessary for anonymisation.
    
    c)  UNCOMMON OPTION: anonymise using NLP to find names. See below.
        If you want to anonymise using NLP to find names, rather than just use the
        name information in your source database, run nlp_manager.py now, using
        (for example) the Person annotation from GATE's
            plugins/ANNIE/ANNIE_with_defaults.gapp
        application, and send the output back into your database. You'll need to
        ensure the resulting data has patient IDs attached, probably with a view
        (see (d) below).
    
    d)  Ensure every table that relates to a patient has a common field with the
        patient ID that's used across the database(s) to be anonymised.
        Create views if necessary. The data dictionary should reflect this work.
    
    e)  Strongly consider using a row_id (e.g. integer primary key) field for each
        table. This will make natural language batch processing simpler (see
        below).
    
    ===============================================================================
    2. ANONYMISATION (AND FULL-TEXT INDEXING) USING A DATA DICTIONARY
    ===============================================================================
    
    OBJECTIVES:
        - Make a record-by-record copy of tables in the source database(s).
          Handle tables that do and tables that don't contain patient-identifiable
          information.
        - Collect patient-identifiable information and use it to "scrub" free-text
          fields; for example, with forename=John, surname=Smith, and spouse=Jane,
          one can convert freetext="I saw John in clinic with Sheila present" to
          "I saw XXX in clinic with YYY present" in the output. Deal with date,
          numerical, textual, and number-as-text information sensibly.
        - Allow other aspects of information restriction, e.g. truncating dates of
          birth to the first of the month.
        - Apply one-way encryption to patient ID numbers (storing a secure copy for
          superuser re-identification).
        - Enable linking of data from multiple source databases with a common
          identifier (such as the NHS number), similarly encrypted.
        - For performance reasons, enable parallel processing and incremental
          updates.
        - Deal with binary attachments containing text.
    
        For help: anonymise.py --help
    
    a)  METHOD 1: THREAD-BASED. THIS IS SLOWER.
            anonymise.py <configfile> [--threads=<n>]
    
    b)  METHOD 2: PROCESS-BASED. THIS IS FASTER.
        See example in launch_multiprocess.sh
    
        ---------------------------------------------------------------------------
        Work distribution
        ---------------------------------------------------------------------------
        - Best performance from multiprocess (not multithreaded) operation.
        - Drop/rebuild tables: single-process operation only.
        - Non-patient tables:
            - if there's an integer PK, split by row
            - if there's no integer PK, split by table (in sequence of all tables).
        - Patient tables: split by patient ID.
          (You have to process all scrubbing information from a patient
          simultaneously, so that's the unit of work. Patient IDs need to be
          integer for this method, though for no other reason.)
        - Indexes: split by table (in sequence of all tables requiring indexing).
          (Indexing a whole table at a time is fastest, not index by index.)
    
        ---------------------------------------------------------------------------
        Incremental updates
        ---------------------------------------------------------------------------
        - Supported via the --incremental option.
        - The problems include:
            - aspects of patient data (e.g. address/phone number) might, in a
              very less than ideal world, change rather than being added to. How
              to detect such a change?
            - If a new phone number is added (even in a sensible way) -- or, more
              importantly, a new alias (following an anonymisation failure),
              should re-scrub all records for that patient, even records previously
              scrubbed.
        - Solution:
            - Only tables with a suitable PK can be processed incrementally.
              The PK must appear in the destination database (and therefore can't
              be sensitive, but should be an uninformative integer).
              This is so that if a row is deleted from the source, one can check
              by looking at the destination.
            - For a table with a src_pk, one can set the add_src_hash flag.
              If set, then a hash of all source fields (more specifically: all that
              are not omitted from the destination, plus any that are used for
              scrubbing, i.e. scrubsrc_patient or scrubsrc_thirdparty) is created
              and stored in the destination database.
            - Let's call tables that use the src_pk/add_src_hash system "hashed"
              tables.
            - During incremental processing:
                1. Non-hashed tables are dropped and rebuilt entirely.
                   Any records in a hashed destination table that don't have a
                   matching PK in their source table are deleted.
                2. For each patient, the scrubber is calculated. If the
                   *scrubber's* hash has changed (stored in the secret_map table),
                   then all destination records for that patient are reworked
                   in full (i.e. the incremental option is disabled for that
                   patient).
                3. During processing of a table (either per-table for non-patient
                   tables, or per-patient then per-table for patient tables), each
                   row has its source hash recalculated. For a non-hashed table,
                   this is then reprocessed normally. For a hashed table, if there
                   is a record with a matching PK and a matching source hash, that
                   record is skipped.
    
        ---------------------------------------------------------------------------
        Anonymising multiple databases together
        ---------------------------------------------------------------------------
        - RATIONALE: A scrubber will be built across ALL source databases, which
          may improve anonymisation.
        - If you don't need this, you can anonymise them separately (even into
          the same destination database, if you want to, as long as table names
          don't overlap).
        - The intention is that if you anonymise multiple databases together,
          then they must share a patient numbering (ID) system. For example, you
          might have two databases using RiO numbers; you can anonymise them
          together. If they also have an NHS number, that can be hashed as a master
          PID, for linking to other databases (anonymised separately). (If you used
          the NHS number as the primary PID, the practical difference would be that
          you would ditch any patients who have a RiO number but no NHS number
          recorded.)
        - Each database must each use a consistent name for this field, across all
          tables, WITHIN that database.
        - This field, which must be an integer, must fit into a BIGINT UNSIGNED
          field (see wipe_and_recreate_mapping_table() in anonymise.py).
        - However, the databases don't have to use the same *name* for the field.
          For example, RiO might use "id" to mean "RiO number", while CamCOPS might
          use "_patient_idnum1".
    
    ===============================================================================
    3. NATURAL LANGUAGE PROCESSING
    ===============================================================================
    
    OBJECTIVES: Send free-text content to natural language processing (NLP) tools,
    storing the results in structured form in a relational database -- for example,
    to find references to people, drugs/doses, cognitive examination scores, or
    symptoms.
    
        - For help: nlp_manager.py --help
        - The Java element needs building; use buildjava.sh
    
        - STRUCTURE: see nlp_manager.py; CamAnonGatePipeline.java
    
        - Run the Python script in parallel; see launch_multiprocess_nlp.sh
    
        ---------------------------------------------------------------------------
        Work distribution
        ---------------------------------------------------------------------------
        - Parallelize by source_pk.
    
        ---------------------------------------------------------------------------
        Incremental updates
        ---------------------------------------------------------------------------
        - Here, incremental updates are simpler, as the NLP just requires a record
          taken on its own.
        - Nonetheless, still need to deal with the conceptual problem of source
          record modification; how would we detect that?
            - One method would be to hash the source record, and store that with
              the destination...
        - Solution:
            1. Delete any destination records without a corresponding source.
            2. For each record, hash the source.
               If a destination exists with the matching hash, skip.
    
    ===============================================================================
    EXTRA: ANONYMISATION USING NLP.
    ===============================================================================
    
    OBJECTIVE: remove other names not properly tagged in the source database.
    
    Here, we have a preliminary stage. Instead of the usual:
    
                            free text
        source database -------------------------------------> anonymiser
                    |                                           ^
                    |                                           | scrubbing
                    +-------------------------------------------+ information
    
    
    we have:
    
                            free text
        source database -------------------------------------> anonymiser
              |     |                                           ^  ^
              |     |                                           |  | scrubbing
              |     +-------------------------------------------+  | information
              |                                                    |
              +---> NLP software ---> list of names ---------------+
                                      (stored in source DB
                                       or separate DB)
    
    For example, you could:
    
        a) run the NLP processor to find names, feeding its output back into a new
           table in the source database, e.g. with these options:
    
                inputfielddefs =
                    SOME_FIELD_DEF
                outputtypemap =
                    person SOME_OUTPUT_DEF
                progenvsection = SOME_ENV_SECTION
                progargs = java
                    -classpath {NLPPROGDIR}:{GATEDIR}/bin/gate.jar:{GATEDIR}/lib/*
                    CamAnonGatePipeline
                    -g {GATEDIR}/plugins/ANNIE/ANNIE_with_defaults.gapp
                    -a Person
                    -it END_OF_TEXT_FOR_NLP
                    -ot END_OF_NLP_OUTPUT_RECORD
                    -lt {NLPLOGTAG}
                input_terminator = END_OF_TEXT_FOR_NLP
                output_terminator = END_OF_NLP_OUTPUT_RECORD
    
                # ...
    
        b) add a view to include patient numbers, e.g.
    
                CREATE VIEW patient_nlp_names
                AS SELECT
                    notes.patient_id,
                    nlp_person_from_notes._content AS nlp_name
                FROM notes
                INNER JOIN nlp_person_from_notes
                    ON notes.note_id = nlp_person_from_notes._srcpkval
                ;
    
        c) feed that lot to the anonymiser, including the NLP-generated names as
           scrubsrc_* field(s).
    
    
    ===============================================================================
    4. SQL ACCESS
    ===============================================================================
    
    OBJECTIVE: research access to the anonymised database(s).
    
    a)  Grant READ-ONLY access to the output database for any relevant user.
    
    b)  Don't grant any access to the secret mapping database! This is for
        trusted superusers only.
    
    c)  You're all set.
    
    ===============================================================================
    5. FRONT END (SQL BUILDER) FOR NON-SQL AFICIONADOS
    ===============================================================================
    
    ??? Python web site script, reading from data dictionary, etc.
    
        - Script knows field type, from data dictionary.
        - Collapsible list. Comments.
        - Possible operators include:
    
            <   <=  =   =>  >
                    !=
    
        - For strings: =    LIKE / MATCH(x) AGAINST (...)
    
        - SELECT outputfields WHERE constraints
        - preview with COUNT(*) as output
    
        - automatic joins on patient
          (NOT: separate query)
    
        - SQL preview
        - ??raw SQL option - danger of SQL injection
    
    ... all too much effort for what you get? Everyone doing it seriously will use
    SQL.
    
    The front end should be accessible via HTTPS only.
    
    ===============================================================================
    6. HOT-SWAPPING TWO DATABASES
    ===============================================================================
    
    Since anonymisation is slow, you may want a live research database and another
    that you can update offline. When you're ready to swap, you'll want to
    
        - create DEFUNCT
        - rename LIVE -> DEFUNCT
        - rename OFFLINE -> LIVE
    
    then either revert:
        - rename LIVE -> OFFLINE
        - rename DEFUNCT -> LIVE
    
    or commit:
        - drop DEFUNCT
    
    How?
    
        http://stackoverflow.com/questions/67093/how-do-i-quickly-rename-a-mysql-database-change-schema-name
        https://gist.github.com/michaelmior/1173781
    
    ===============================================================================
    7. AUDITING ACCESS TO THE RESEARCH DATABASE
    ===============================================================================
    
    a)  MYSQL RAW ACCESS.
    
      - You need an auditing tool, so we've provided one; see the contents of the
        "mysql_auditor" directory.
      - Download and install mysql-proxy, at least version 0.8.5, from
            https://dev.mysql.com/downloads/mysql-proxy/
        Install its files somewhere sensible.
      - Configure (by editing) mysql_auditor.sh
      - Run it. By default it's configured for daemon mode. So you can do this:
            sudo ./mysql_auditor.sh CONFIGFILE start
      - By default the logs go in /var/log/mysql_auditor; the audit*.log files
        contain the queries, and the mysqlproxy*.log files contain information from
        the mysql-proxy program.
      - The audit log is a comma-separated value (CSV) file with these columns:
            - date/time, in ISO-8601 format with local timezone information,
              e.g. "2015-06-24T12:58:29+0100";
            - client IP address/port, e.g. "127.0.0.1:52965";
            - MySQL username, e.g. "root";
            - current schema (database), e.g. "test";
            - query, e.g. "SELECT * FROM mytable"
        Query results (or result success/failure status) are not shown.
    
      - To open fresh log files daily, run
            sudo FULLPATH/mysql_auditor.sh CONFIGFILE restart
        daily (e.g. from your /etc/crontab, just after midnight). Logs are named
        e.g. audit_2015_06_24.log, for their creation date.
    
    b)  FRONT END.
    
        The nascent front end will also audit queries.
        (Since this runs a web service that in principle can have access to proper
        data, it's probably better to run a username system rather than rely on
        MySQL usernames alone. Therefore, it can use a single username, and a
        database-based auditing system. The administrator could also pipe its MySQL
        connection via the audit proxy, but doesn't have to.)
    
    ===============================================================================
    X. TROUBLESHOOTING
    ===============================================================================
    
    -------------------------------------------------------------------------------
    Q.  Error: [Microsoft][SQL Server Native Client 11.0]Connection is busy with
        results for another command.
    -------------------------------------------------------------------------------
    A.  In /etc/odbc.ini, for that DSN, set
            MARS_Connection = yes
        - https://msdn.microsoft.com/en-us/library/cfa084cz(v=vs.110).aspx
        - https://msdn.microsoft.com/en-us/library/h32h3abf(v=vs.110).aspx
        - Rationale: We use gen_patient_ids() to iterate through patients, but then
         we fetch data for that patient via the same connection to the source
         database(s). Therefore, we're operating multiple result sets through one
         connection.
    
    -------------------------------------------------------------------------------
    Q.  How to convert from SQL Server to MySQL?
    -------------------------------------------------------------------------------
    A.  MySQL Workbench.
        Use the "ODBC via connection string" option if others aren't working:
            DSN=XXX;UID=YYY;PWD=ZZZ
        If the schema definitions are not seen, it's a permissions issue:
            http://stackoverflow.com/questions/17038716
        ... but you can copy the database using anonymise.py, treating all tables
        as non-patient tables (i.e. doing no actual anonymisation).
    
    -------------------------------------------------------------------------------
    Q.  MySQL server has gone away... ?
    -------------------------------------------------------------------------------
    A.  Probably: processing a big binary field, and MySQL's max_allowed_packet
        parameter is too small. Try increasing it (e.g. from 16M to 32M).
        See also http://www.camcops.org/documentation/server.html
    
    -------------------------------------------------------------------------------
    Q.  What settings do I need in /etc/mysql/my.cnf?
    -------------------------------------------------------------------------------
    A.  Probably these:
    
        [mysqld]
        max_allowed_packet = 32M
    
        innodb_strict_mode = 1
        innodb_file_per_table = 1
        innodb_file_format = Barracuda
    
        # Only for MySQL prior to 5.7.5 (http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-20.html):
        innodb_log_file_size = 320M
    
        # For more performance, less safety:
        innodb_flush_log_at_trx_commit = 2
    
        # To save memory?
        # Default is 8; suggestion is ncores * 2
        # innodb_thread_concurrency = ...
    
        [mysqldump]
        max_allowed_packet = 32M
    
    -------------------------------------------------------------------------------
    Q.  _mysql_exceptions.OperationalError: (1118, 'Row size too large (> 8126).
        Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or
        ROW_FORMAT=COMPRESSED may help. In current row format, BLOB prefix of 768
        bytes is stored inline.')
    -------------------------------------------------------------------------------
    A.  See above.
        If you need to change the log file size, FOLLOW THIS PROCEDURE:
            https://dev.mysql.com/doc/refman/5.0/en/innodb-data-log-reconfiguration.html
    
    -------------------------------------------------------------------------------
    Q.  Segmentation fault (core dumped)... ?
    -------------------------------------------------------------------------------
    A.  Use the Microsoft JDBC driver instead of the Microsoft ODBC driver for
        Linux, which is buggy.
    
    -------------------------------------------------------------------------------
    Q.  "Killed."
    -------------------------------------------------------------------------------
    A.  Out of memory.
        Suggest reduce MySQL memory footprint; see notes.txt.
        Steps already taken to reduce memory usage by the anonymise program itself.
    
    -------------------------------------------------------------------------------
    Q.  Anonymisation is slow.
    -------------------------------------------------------------------------------
    A.  (1) Make sure you have indexes created on all patient_id fields, because
            the tool will use this to find (a) values for scrubbing, and (b)
            records for anonymisation. This makes a huge difference!
    
    -------------------------------------------------------------------------------
    Q.  Can't create FULLTEXT index(es)
    -------------------------------------------------------------------------------
    A.  MySQL v5.6 is required to use FULLTEXT indexes with InnoDB tables (as
        opposed to MyISAM tables, which don't support transactions).
    
        On Ubuntu 14.04, default MySQL is 5.5, so use:
    
            sudo apt-get install mysql-server-5.6 mysql-server-core-5.6 \
                mysql-client-5.6 mysql-client-core-5.6
    
    -------------------------------------------------------------------------------
    Q.  How to search with FULLTEXT indexes?
    -------------------------------------------------------------------------------
    
    A.  In conventional SQL, you would use:
            ... WHERE field LIKE '%word%'
    
        In a field having a MySQL FULLTEXT index, you can use:
            ... WHERE MATCH(field) AGAINST ('word')
    
        There are several variants.
        See https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
    

17.2. Known bugs elsewhere affecting CRATE

17.3. Newer

  • BENCHMARK name blacklisting (with forenames + surnames – English words – eponyms): speed, precision, recall. Share results with MB.
  • Should privileged clinical queries be in any way integrated with CRATE? Advantages would include allowing the receiving user to run the query themselves without RDBM intervention and RDBM-to-recipient data transfer considerations, while ensuring the receiving user doesn’t have unrestricted access (e.g. via SQL Server Management Studio). Plus there may be a UI advantage. Implementation would be by letting that query use the privileged connection (see get_executed_researchdb_cursor; see also DJANGO_DEFAULT_CONNECTION constant, currently unused).
  • restrict anonymiser to specific patient IDs (for subset generation +/- custom pseudonym)
  • option to filter out all free text automatically (as part of full anonymisation)
  • Library of public SQL queries (name, description, SQL), editable by RDBM (e.g. “date of most recent progress note”); database-specific, but widely used on one system.
  • Personal library?
  • Personal configurable highlight colours (with default set if none configured).
  • More of JL’s ideas from 8 Jan 2018.
  • XXX consent mode lookup
  • When the Windows service stops, it is still failing to kill child processes. See crate_anon/tools/winservice.py.