15.1.2. crate_anon.anonymise.anonregex


Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.


class crate_anon.anonymise.anonregex.TestAnonRegexes(methodName='runTest')[source]
crate_anon.anonymise.anonregex.escape_literal_string_for_regex(s: str) → str[source]

Escape any regex characters.

Start with -> \
… this should be the first replacement in REGEX_METACHARS.
crate_anon.anonymise.anonregex.get_anon_fragments_from_string(s: str) → List[str][source]

Takes a complex string, such as a name or address with its components separated by spaces, commas, etc., and returns a list of substrings to be used for anonymisation. - For example, from “John Smith”, return [“John”, “Smith”];

from “John D’Souza”, return [“John”, “D”, “Souza”]; from “42 West Street”, return [“42”, “West”, “Street”].
  • Try the examples listed below the function.
  • Note that this is a LIBERAL algorithm, i.e. one prone to anonymise too much (e.g. all instances of “Street” if someone has that as part of their address).
  • NOTE THAT WE USE THE “WORD BOUNDARY” FACILITY WHEN REPLACING, AND THAT TREATS APOSTROPHES AND HYPHENS AS WORD BOUNDARIES. Therefore, we don’t need the largest-level chunks, like D’Souza.
crate_anon.anonymise.anonregex.get_code_regex_elements(s: str, liberal: bool = True, very_liberal: bool = True, at_word_boundaries_only: bool = True, at_numeric_boundaries_only: bool = False) → List[str][source]

Takes a STRING representation of a number or an alphanumeric code, which may include leading zeros (as for phone numbers), and produces a list of regex strings for scrubbing.

We allow all sorts of separators. For example, 0123456789 might appear as
(01234) 56789 0123 456 789 01234-56789 0123.456.789

This can also be used for postcodes, which should have whitespace prestripped, so e.g. PE123AB might appear as

PE123AB PE12 3AB PE 12 3 AB
crate_anon.anonymise.anonregex.get_date_regex_elements(dt: datetime.date, at_word_boundaries_only: bool = False) → List[str][source]

Takes a datetime object and returns a list of regex strings with which to scrub.

crate_anon.anonymise.anonregex.get_number_of_length_n_regex_elements(n: int, liberal: bool = True, very_liberal: bool = False, at_word_boundaries_only: bool = True) → List[str][source]

Get a list of regex strings for scrubbing n-digit numbers – for example, to remove all 10-digit numbers as putative NHS numbers, or all 11-digit numbers as putative UK phone numbers.

crate_anon.anonymise.anonregex.get_phrase_regex_elements(phrase: str, at_word_boundaries_only: bool = True, max_errors: int = 0) → List[str][source]

phrase: e.g. ‘4 Privet Drive’

crate_anon.anonymise.anonregex.get_regex_from_elements(elementlist: List[str]) → Union[Pattern[~AnyStr], NoneType][source]

Convert a list of regex elements into a compiled regex, which will operate in case-insensitive fashion on Unicode strings.

crate_anon.anonymise.anonregex.get_regex_string_from_elements(elementlist: List[str]) → str[source]

Convert a list of regex elements into a single regex string.

crate_anon.anonymise.anonregex.get_string_regex_elements(s: str, suffixes: List[str] = None, at_word_boundaries_only: bool = True, max_errors: int = 0) → List[str][source]

Takes a string and returns a list of regex strings with which to scrub. Options: - list of suffixes to permit, typically [“s”] - typographical errors - whether to constrain to word boundaries or not

… if false: will scrub ANN from bANNed
crate_anon.anonymise.anonregex.get_uk_postcode_regex_elements(at_word_boundaries_only: bool = True) → List[str][source]

Get a list of regex strings for scrubbing UK postcodes.