15.5.3. crate_anon.preprocess.preprocess_pcmis¶
Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.
PCMIS table structure
No proper documentation, but the structure is clear. See pcmis_information_schema.ods
- PatientDetails
PatientID -- PK; patient-defining field; VARCHAR(100)
FirstName
LastName
NHSNumber -- VARCHAR(100)
...
- Other per-patient things: Patient*
- CasesAll
CaseNumber -- appears to be unique; same #records as ReferralDetails
ReferralDate
- Many other per-case things: Case*
CaseNumber -- FK to CasesAll/ReferralDetails
- Group things: linked to cases via GroupMember
IMPORTANTLY: there are only two, Groups and GroupSession
and neither are identifiable.
- Carers: from PatientCarerDetails (CarerNumber, PatientID)
- Children: from PatientChildDetails (ChildNumber, PatientID)
- ReferralDetails
CaseNumber -- appears to be unique; same #records as CasesAll
PatientID -- not unique
PrimaryDiagnosis (e.g. 'F41.1')
- Non-patient stuff we'll filter out:
pcmis_UserProfiles
Users
- Then a lot of other things are index by ContactNumber, which probably
cross-refers to CaseContacts, having
ContactNumber INT
CaseNumber VARCHAR(100)
Decisions re database keys and anonymisation
For RiO, we had integer patient IDs but mangled into a text format. So there were distinct performance advantages in making an integer version. For PCMIS, patient IDs look like ‘JC000001’, ‘SB000001’ (where the letters are unrelated to patients’ initials; I’m not sure what they refer to). There are numerical overlaps if you ignore the letters. So there is no neat integer mapping; we’d be inventing an arbitrary new key if we added one.
So the tradeoff is simplicity (keep textual PK for patients) versus speed (parallel processing based on an integer operation). It’s natural to think of an integer hash of a string, but this hash has to operate in the SQL domain, be portable, and produce an integer (so SQL Server’s HASHBYTES is of no use. At present (2017-05-02), our PCMIS copy has ~53,000 patients in, and there are lots of tables with patients in.
Therefore, DECISION: create an integer PK.
However, we could do this deterministically. Since the length is fixed, and the numerical part goes up to 999999, and the letters are always upper case – ah, no, there are some like <digit><letter>999999. But 0-99 would be fine.
SELECT (
(ASCII(SUBSTRING(PatientID, 1, 1))) * 100000000 +
(ASCII(SUBSTRING(PatientID, 2, 1))) * 1000000 +
CAST(SUBSTRING(PatientID, 3, 6) AS BIGINT)
) AS patient_id_int
FROM PatientDetails
If we’re using SQLAlchemy, then use things like func.substr instead, but it’s a reasonable compromise for now to say that a specific database like PCMIS is going to be hosted on SQL Server, since PCMIS uses that
————– ——————- SQL Server SQLAlchemy ————– ——————- SUBSTR func.substr ASCII ————– ——————-
What about CaseNumber – is that identifying? If not, it can remain the internal key to identify cases. If it is, then we have to replace it. The first character is 1,9,A-Z except Q, X, Y (n=25). The second character is 0,A-Z except I, Q, U, X, Z (n=22). So, pretty spread. The digits seem to be approximately consecutive. So it does look more like an internal PK than something identifiable.
Mind you, very often it is identical to CaseNumber. So, do we need a second hash?
Our PCMIS copy certainly has free text (search the schema for text types).
Therefore, views and the like
MAIN SOFTWARE CHANGES
- Support non-integer PIDs/MPIDs.
- Add an AlterMethod that is hash=hash_config_key_name with e.g.
[hash_config_key_name]
method = hmacsha256
key = somesecretkey
TABLES
- If a table doesn’t have a PK, give it an AUTONUMBER integer PK (e.g. “crate_pk”). That looks to be true of ?all tables.
VIEWS
- In general, not needed: we can use PatientId and CaseNumber as non-integer fields.
- We do need the geography views, though.
DATA DICTIONARY AUTOGENERATIO
- PatientId: always the PID.
- NHSNumber: always the MPID.
- CaseNumber: belongs in ddgen_extra_hash_fields, and users should give it the same hash key as for the PID-to-RID conversion, since it’s often the same code.