15.1.2. crate_anon.anonymise.anonregex¶
Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.
-
crate_anon.anonymise.anonregex.
escape_literal_string_for_regex
(s: str) → str[source]¶ Escape any regex characters.
- Start with -> \
- … this should be the first replacement in REGEX_METACHARS.
-
crate_anon.anonymise.anonregex.
get_anon_fragments_from_string
(s: str) → List[str][source]¶ Takes a complex string, such as a name or address with its components separated by spaces, commas, etc., and returns a list of substrings to be used for anonymisation. - For example, from “John Smith”, return [“John”, “Smith”];
from “John D’Souza”, return [“John”, “D”, “Souza”]; from “42 West Street”, return [“42”, “West”, “Street”].- Try the examples listed below the function.
- Note that this is a LIBERAL algorithm, i.e. one prone to anonymise too much (e.g. all instances of “Street” if someone has that as part of their address).
- NOTE THAT WE USE THE “WORD BOUNDARY” FACILITY WHEN REPLACING, AND THAT TREATS APOSTROPHES AND HYPHENS AS WORD BOUNDARIES. Therefore, we don’t need the largest-level chunks, like D’Souza.
-
crate_anon.anonymise.anonregex.
get_code_regex_elements
(s: str, liberal: bool = True, very_liberal: bool = True, at_word_boundaries_only: bool = True, at_numeric_boundaries_only: bool = False) → List[str][source]¶ Takes a STRING representation of a number or an alphanumeric code, which may include leading zeros (as for phone numbers), and produces a list of regex strings for scrubbing.
- We allow all sorts of separators. For example, 0123456789 might appear as
- (01234) 56789 0123 456 789 01234-56789 0123.456.789
This can also be used for postcodes, which should have whitespace prestripped, so e.g. PE123AB might appear as
PE123AB PE12 3AB PE 12 3 AB
-
crate_anon.anonymise.anonregex.
get_date_regex_elements
(dt: datetime.date, at_word_boundaries_only: bool = False) → List[str][source]¶ Takes a datetime object and returns a list of regex strings with which to scrub.
-
crate_anon.anonymise.anonregex.
get_number_of_length_n_regex_elements
(n: int, liberal: bool = True, very_liberal: bool = False, at_word_boundaries_only: bool = True) → List[str][source]¶ Get a list of regex strings for scrubbing n-digit numbers – for example, to remove all 10-digit numbers as putative NHS numbers, or all 11-digit numbers as putative UK phone numbers.
-
crate_anon.anonymise.anonregex.
get_phrase_regex_elements
(phrase: str, at_word_boundaries_only: bool = True, max_errors: int = 0) → List[str][source]¶ phrase: e.g. ‘4 Privet Drive’
-
crate_anon.anonymise.anonregex.
get_regex_from_elements
(elementlist: List[str]) → Union[Pattern[~AnyStr], NoneType][source]¶ Convert a list of regex elements into a compiled regex, which will operate in case-insensitive fashion on Unicode strings.
-
crate_anon.anonymise.anonregex.
get_regex_string_from_elements
(elementlist: List[str]) → str[source]¶ Convert a list of regex elements into a single regex string.
-
crate_anon.anonymise.anonregex.
get_string_regex_elements
(s: str, suffixes: List[str] = None, at_word_boundaries_only: bool = True, max_errors: int = 0) → List[str][source]¶ Takes a string and returns a list of regex strings with which to scrub. Options: - list of suffixes to permit, typically [“s”] - typographical errors - whether to constrain to word boundaries or not
… if false: will scrub ANN from bANNed