15.1.3. crate_anon.anonymise.anonymise


Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.


Anonymise multiple SQL-based databases using a data dictionary.

crate_anon.anonymise.anonymise.anonymise(args: Any) → None[source]

Main entry point.

crate_anon.anonymise.anonymise.commit_admindb() → None[source]

Execute a COMMIT on the admin database, which is using ORM sessions.

crate_anon.anonymise.anonymise.commit_destdb() → None[source]

Execute a COMMIT on the destination database, and reset row counts.

crate_anon.anonymise.anonymise.create_indexes(tasknum: int = 0, ntasks: int = 1) → None[source]

Create indexes for the destination tables.

crate_anon.anonymise.anonymise.delete_dest_rows_with_no_src_row(srcdbname: str, src_table: str, report_every: int = 100000, chunksize: int = 100000) → None[source]

For a given source database/table, delete any rows in the corresponding destination table where there is no corresponding source row.

  • Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.
  • Can’t do this in a multiprocess way, because we’re trying to do a DELETE WHERE NOT IN.
  • However, we can get stupidly long query lists if we try to SELECT all the values and use a DELETE FROM x WHERE y NOT IN (v1, v2, v3, …) query. This crashes the MySQL connection, etc.
  • Therefore, we need a temporary table in the destination.
crate_anon.anonymise.anonymise.drop_remake(incremental: bool = False, skipdelete: bool = False) → None[source]

Drop and rebuild (a) mapping table, (b) destination tables. If incremental is True, doesn’t drop tables; just deletes destination information where source information no longer exists.

crate_anon.anonymise.anonymise.estimate_count_patients() → int[source]

We can’t easily and quickly get the total number of patients, because they may be defined in multiple tables across multiple databases. We shouldn’t fetch them all into Python in case there are billions, and it’s a waste of effort to stash them in a temporary table and count unique rows, because this is all only for a progress indicator. So we approximate:

crate_anon.anonymise.anonymise.gen_index_row_sets_by_table(tasknum: int = 0, ntasks: int = 1) → Generator[[Tuple[str, List[crate_anon.anonymise.ddr.DataDictionaryRow]], NoneType], NoneType][source]

Generate (table, list-of-DD-rows-for-indexed-fields) tuples for all tables requiring indexing.

crate_anon.anonymise.anonymise.gen_nonpatient_tables_with_int_pk() → Generator[[Tuple[str, str, str], NoneType], NoneType][source]

Generate (source db name, source table, PK name) tuples for all tables that (a) don’t contain patient information and (b) do have an integer PK.

crate_anon.anonymise.anonymise.gen_nonpatient_tables_without_int_pk(tasknum: int = 0, ntasks: int = 1) → Generator[[Tuple[str, str], NoneType], NoneType][source]

Generate (source db name, source table) tuples for all tables that (a) don’t contain patient information and (b) don’t have an integer PK.

crate_anon.anonymise.anonymise.gen_patient_ids(tasknum: int = 0, ntasks: int = 1) → Generator[[int, NoneType], NoneType][source]

Generate patient IDs.

sources: dictionary
key: db name value: rnc_db database object
crate_anon.anonymise.anonymise.gen_pks(srcdbname: str, tablename: str, pkname: str) → Generator[[int, NoneType], NoneType][source]

Generate PK values from a table.

crate_anon.anonymise.anonymise.gen_rows(dbname: str, sourcetable: str, sourcefields: Iterable[str], pid: Union[int, str] = None, intpkname: str = None, tasknum: int = 0, ntasks: int = 1, debuglimit: int = 0) → Generator[[List[Any], NoneType], NoneType][source]

Generates rows from a source table … each row being a list of values … each value corresponding to a field in sourcefields.

… optionally restricted to a single patient

If the table has a PK and we’re operating in a multitasking situation, generate just the rows for this task (thread/process).

crate_anon.anonymise.anonymise.identical_record_exists_by_hash(dest_table: str, pkfield: str, pkvalue: int, hashvalue: str) → bool[source]

For a given PK in a given destination table, is there a record with the specified value for its source hash?

crate_anon.anonymise.anonymise.identical_record_exists_by_pk(dest_table: str, pkfield: str, pkvalue: int) → bool[source]

For a given PK in a given destination table, does a record exist?

crate_anon.anonymise.anonymise.opting_out_mpid(mpid: Union[int, str]) → bool[source]

Does this patient wish to opt out?

crate_anon.anonymise.anonymise.opting_out_pid(pid: Union[int, str]) → bool[source]

Does this patient wish to opt out?

crate_anon.anonymise.anonymise.patient_processing_fn(tasknum: int = 0, ntasks: int = 1, incremental: bool = False) → None[source]
Iterate through patient IDs;
build the scrubber for each patient; process source data for that patient, scrubbing it; insert the patient into the mapping table in the admin database.
crate_anon.anonymise.anonymise.process_nonpatient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False) → None[source]

Copies all non-patient tables. If they have an integer PK, the work may be parallelized. If not, whole tables are assigned to different processes in parallel mode.

crate_anon.anonymise.anonymise.process_patient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False) → None[source]

Process all patient tables, optionally in a parallel-processing fashion.

crate_anon.anonymise.anonymise.process_table(sourcedbname: str, sourcetable: str, patient: crate_anon.anonymise.patient.Patient = None, incremental: bool = False, intpkname: str = None, tasknum: int = 0, ntasks: int = 1) → None[source]

Process a table. This can either be a patient table (in which case the patient’s scrubber is applied and only rows for that patient are processed) or not (in which case the table is just copied).

crate_anon.anonymise.anonymise.show_dest_counts() → None[source]

Show the number of records in all destination tables.

crate_anon.anonymise.anonymise.show_source_counts() → None[source]

Show the number of records in all source tables.

crate_anon.anonymise.anonymise.wipe_and_recreate_destination_db(incremental: bool = False) → None[source]

Drop and recreate all destination tables (as specified in the DD) in the destination database.

crate_anon.anonymise.anonymise.wipe_opt_out_patients(report_every: int = 1000, chunksize: int = 10000) → None[source]

Delete any data from patients that have opted out (after their data was processed on a previous occasion). (Slightly complicated by the fact that the destination database can’t necessarily ‘see’ the mapping database, so we need to cache the RID keys in the destination database temporarily.)