6.6. Data dictionary (DD)¶
The data dictionary is a TSV file with a single header row, and columns as defined below. (The DD columns can be in any order as long as the header row matches the data, and the column heading names are exactly as follows.)
Once you have edited your anonymiser config file to point to your source database, you can generate a draft data dictionary like this:
crate_anonymise --draftdd > mydd.tsv
Now edit the data dictionary as required. Then make your config file point to the data dictionary you want to use.
6.6.1. src_db¶
This column specifies the source database, using a name that matches one from
the source_databases
list in the config file.
6.6.2. src_table¶
This column specifies the table name in the source database.
6.6.3. src_field¶
This column specifies the field (column) name in the source database.
6.6.4. src_datatype¶
This column gives the source column’s SQL data type (e.g. INT, VARCHAR(50)).
6.6.5. src_flags¶
This field can be blank or can contain a string made up of one or more characters. The characters have the following meanings:
Character | Meaning |
---|---|
K |
PK.
This field is the primary key (PK) for the table it’s in.
|
H | ADD SOURCE HASH.
Add source hash of the record, for incremental updates?
|
C |
CONSTANT.
Record contents are constant (will not change) for a given PK.
|
A |
ADDITION ONLY.
Addition only. It is assumed that records can only be added, not
deleted.
|
P |
PRIMARY PID.
Primary patient ID field. If set,
|
* |
DEFINES PRIMARY PIDS.
This field defines primary PIDs. If set, this row will be used
to search for all patient IDs, and will define them for this
database. Only those patients will be processed (for all tables
containing patient info). Typically, this flag is applied to a
SINGLE field in a SINGLE table, usually the principal patient
registration/demographics table.
|
M |
MASTER PID.
Master ID (e.g. NHS number).
The field will be hashed with the master PID hasher.
|
! |
OPT OUT.
This field is used to mark that the patient wishes to opt out
entirely. It must be in a table that also has a primary patient
ID field (because that’s the ID that will be omitted). If the
opt-out field contains a value that’s defined in the
optout_col_values setting (see below), that patient will be opted
out entirely from the anonymised database.
|
R |
REQUIRED SCRUBBER.
If this field is a scrub_src field (see below), and this flag is
set, then at least one non-NULL value for this field must be
present for each patient, or no information will be processed for
this patient. (Typical use: where you have a master patient index
separate from the patient name table, and data might have been
brought across partially, so there are some missing names. In
this situation, text might go unscrubbed because the names are
missing. Setting this flag for the name field will prevent this.)
|
6.6.6. scrub_src¶
One of the following values, or blank:
Value | Meaning |
---|---|
patient |
Contains patient-identifiable information that must be
removed from scrub_in fields. |
thirdparty |
Contains identifiable information about a carer,
family member, or other third party, which must be
removed from scrub_in fields. |
thirdparty_xref_pid |
This field is a patient identifier for ANOTHER patient (such as a relative). The scrubber should recursively include THAT patient’s identifying information as third-party information for THIS patient. |
6.6.7. scrub_method¶
Applicable to scrub_src fields, this column determines the manner in which this field should be treated for scrubbing. It must be one of the following values (or blank):
Value | Meaning |
---|---|
words |
Treat as a set of textual words. This is the default for all textual fields (e.g. CHAR, VARCHAR, TEXT). Typically used for names. Also OK for e-mail addresses. |
phrase |
Treat as a textual phrase (a sequence of words to be replaced only when they occur in sequence). Typically used for address components. |
number |
Treat as a number. This is the default for all numeric fields (e.g. INTEGER, FLOAT). If you have a phone number in a text field, use this method; it will be scrubbed regardless of spacing/punctuation. |
code |
Teat as an alphanumeric code. Suited to postcodes. Very like the numeric method, but permits non-digits. |
date |
Treat as a date. This is the default for all DATE/DATETIME fields. |
6.6.8. decision¶
One of the following two values:
Value | Meaning |
---|---|
OMIT |
Omit the field from the output entirely. |
include |
Include it. |
This is case sensitive, for safety.
6.6.9. inclusion_values¶
Either blank, or an expression that evaluates to a Python iterable (e.g. list or tuple) with Python’s ast.literal_eval() function (see https://docs.python.org/3.4/library/ast.html).
- If this is not blank/None, then it serves as a ROW INCLUSION LIST – the source row will only be processed if the field’s value is one of the inclusion values.
- It applies to the raw value from the database (before any transformation via
alter_method
). - This is not applied to
scrub_src
fields (which contribute to the scrubber regardless). - Note that
[None]
is a list with one member, None, whereasNone
is equivalent to leaving the field blank.
Examples:
[None, 0]
[True, 1, 'yes', 'true', 'Yes', 'True']
6.6.10. exclusion_values¶
As for inclusion_values
, but the row is excluded if the field’s value is in
the exclusion_values list.
6.6.11. alter_method¶
Manner in which to alter the data. Blank, or a comma-separated list of one or more of:
Component | Meaning |
---|---|
scrub |
Scrub in. Applies to text fields only. The field will have its contents anonymised (using information from other fields). Use this for any text field that end users might store free-text comments in. |
truncate_date |
Truncate this date to the first of the month. Applicable to text or date-as-text fields. |
binary_to_text=EXTFIELDNAME |
Convert a binary field (e.g. `VARBINARY`,
`BLOB`) to text (e.g. `LONGTEXT`). The binary
data is taken to be the representation of a
document. The field EXTFIELDNAME, which must
be in the same source table, must contain the
file extension (e.g. 'pdf' , '.pdf' ) or
a filename with that extension (e.g.
'/some/path/mything.pdf' ), so that the
anonymiser knows how to treat the binary data
to extract text from it. |
filename_format_to_text=FMT |
A more powerful way of specifying a filename that can be created using data from this table. The FMT parameter is an unquoted Python str.format() string; see https://docs.python.org/3.4/library/stdtypes.html#str.format. The dictionary passed to format() is created from all fields in the row. Using an example from RiO: if your
ClientDocuments table contains a ClientID
column (e.g. filename_format_to_text=C:\some\path\{ClientID}\docs\{Path}
You probably want to apply this
|
filename_to_text |
As for the binary-to-text option, but the field contains a filename (the contents of which is converted to text), rather than containing binary data directly. |
skip_if_extract_fails |
If one of the text extraction methods is specified, and this flag is also specified, then the data row will be skipped if text extrcation fails (rather than inserted with a NULL value for the text). This is helpful, for example, if your text-processing pipeline breaks; the option prevents rows being created erroneously with NULL text values, so that a subsequent incremental update will fix the problems once you’ve fixed your text extraction tools. |
html_unescape |
HTML encoding is removed, e.g. convert
& to & and < to < |
html_untag |
HTML tags are removed, e.g. from
<a href="http://somewhere">see link</a>
to see link |
You can specify multiple options separated by commas.
Not all are compatible (e.g. scrubbing is for text; date truncation is for dates).
If there’s more than one, text extraction from BLOBs/files is performed first. After that, they are executed in sequence. (The position of the skip-if-text-extraction-fails flag is immaterial.)
A typical combination might be:
filename_to_text,skip_if_extract_fails,scrub
or:
html_untag,html_unescape,scrub
6.6.12. dest_table¶
Table name in the destination database.
6.6.13. dest_field¶
Field (column) name in the destination database.
6.6.14. dest_datatype¶
SQL data type in the destination database.
If omitted, the source SQL data type is translated appropriately.
6.6.15. index¶
One of:
Value | Meaning |
---|---|
(blank) | No index. |
I |
Create a normal index on the destination field. |
U |
Create a unique index on the destination field. |
F |
Create a FULLTEXT index, for rapid searching within long text fields. Only applicable to one field per table. |
6.6.16. indexlen¶
Integer. Can be blank. If not, sets the prefix length of the index. This is mandatory in MySQL if you apply a normal (+/- unique) index to a TEXT or BLOB field. It is not required for FULLTEXT indexes.
6.6.17. comment¶
Field (column) comment, stored in the destination database.