15.1.17. crate_anon.anonymise.scrub


Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.


Scrubber classes for CRATE anonymiser.

class crate_anon.anonymise.scrub.NonspecificScrubber(replacement_text: str, hasher: cardinal_pythonlib.hash.GenericHasher, anonymise_codes_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = True, blacklist: crate_anon.anonymise.scrub.WordList = None, scrub_all_numbers_of_n_digits: List[int] = None, scrub_all_uk_postcodes: bool = False)[source]
get_hash() → str[source]

Returns a hash of the scrubber itself.

scrub(text: str) → str[source]

Returns a scrubbed version of the text.

class crate_anon.anonymise.scrub.PersonalizedScrubber(replacement_text_patient: str, replacement_text_third_party: str, hasher: cardinal_pythonlib.hash.GenericHasher, anonymise_codes_at_word_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = True, anonymise_numbers_at_numeric_boundaries_only: bool = True, anonymise_strings_at_word_boundaries_only: bool = True, min_string_length_for_errors: int = 4, min_string_length_to_scrub_with: int = 3, scrub_string_suffixes: List[str] = None, string_max_regex_errors: int = 0, whitelist: crate_anon.anonymise.scrub.WordList = None, nonspecific_scrubber: crate_anon.anonymise.scrub.NonspecificScrubber = None, debug: bool = False)[source]

Accepts patient-specific (patient and third-party) information, and uses that to scrub text.

add_value(value: Any, scrub_method: crate_anon.anonymise.constants.SCRUBMETHOD, patient: bool = True, clear_cache: bool = True) → None[source]

Add a specific value via a specific scrub_method.

The patient flag controls whether it’s treated as a patient value or a third-party value.

get_hash() → str[source]

Returns a hash of the scrubber itself.

get_patient_regex_string() → str[source]

Return the string version of the patient regex, sorted.

static get_scrub_method(datatype_long: str, scrub_method: Union[crate_anon.anonymise.constants.SCRUBMETHOD, NoneType]) → crate_anon.anonymise.constants.SCRUBMETHOD[source]

Return the default scrub method for a given SQL datatype, unless overridden.

get_tp_regex_string() → str[source]

Return the string version of the third-party regex, sorted.

scrub(text: str) → Union[str, NoneType][source]

Scrub some text and return the scrubbed result.

class crate_anon.anonymise.scrub.ScrubberBase(hasher: cardinal_pythonlib.hash.GenericHasher)[source]

Scrubber base class.

get_hash() → str[source]

Returns a hash of the scrubber itself.

scrub(text: str) → str[source]

Returns a scrubbed version of the text.

class crate_anon.anonymise.scrub.WordList(filenames: Iterable[str] = None, words: Iterable[str] = None, replacement_text: str = '[---]', hasher: cardinal_pythonlib.hash.GenericHasher = None, suffixes: List[str] = None, at_word_boundaries_only: bool = True, max_errors: int = 0, regex_method: bool = False)[source]
clear_cache() → None[source]

Clear cached information.

get_hash() → str[source]

Returns a hash of the scrubber itself.

scrub(text: str) → str[source]

Returns a scrubbed version of the text.