15.4.22. crate_anon.nlp_manager.regex_parser¶
Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.
-
class
crate_anon.nlp_manager.regex_parser.
NumeratorOutOfDenominatorParser
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfgsection: str, variable_name: str, variable_regex_str: str, expected_denominator: int, numerator_text_fieldname: str = 'numerator_text', numerator_fieldname: str = 'numerator', denominator_text_fieldname: str = 'denominator_text', denominator_fieldname: str = 'denominator', correct_numerator_fieldname: str = None, take_absolute: bool = True, commit: bool = False, debug: bool = False)[source]¶ Base class for X-out-of-Y numerical results, e.g. for MMSE/ACE. Integer denominator, expected to be positive. Otherwise similar to SimpleNumericalResultParser.
-
dest_tables_columns
() → Dict[str, List[sqlalchemy.sql.schema.Column]][source]¶ Returns a dictionary of {tablename: destination_columns}.
-
parse
(text: str, debug: bool = False) → Generator[[Tuple[str, Dict[str, Any]], NoneType], NoneType][source]¶ Default parser for NumeratorOutOfDenominatorParser.
-
test_numerator_denominator_parser
(test_expected_list: List[Tuple[str, List[Tuple[float, float]]]], verbose: bool = False) → None[source]¶ - test_expected_list: list of tuples of (a) test string and
- (b) list of expected numerical (numerator, denominator) results, which can be an empty list
Returns: none; will assert on failure
-
-
class
crate_anon.nlp_manager.regex_parser.
NumericalResultParser
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfgsection: str, variable: str, target_unit: str, regex_str_for_debugging: str, commit: bool = False)[source]¶ DO NOT USE DIRECTLY. Base class for generic numerical results, where a SINGLE variable is produced.
-
dest_tables_columns
() → Dict[str, List[sqlalchemy.sql.schema.Column]][source]¶ Returns a dictionary of {tablename: destination_columns}.
-
-
class
crate_anon.nlp_manager.regex_parser.
SimpleNumericalResultParser
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfgsection: str, regex_str: str, variable: str, target_unit: str, units_to_factor: Dict[str, float], take_absolute: bool = False, commit: bool = False, debug: bool = False)[source]¶ Base class for simple single-format numerical results. Use this when not only do you have a single variable to produce, but you have a single regex (in a standard format) that can produce it.
-
class
crate_anon.nlp_manager.regex_parser.
ValidatorBase
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfgsection: str, regex_str_list: List[str], validated_variable: str, commit: bool = False)[source]¶ DO NOT USE DIRECTLY. Base class for validating regex parser sensitivity. The validator will find fields that refer to the variable, whether or not they meet the other criteria of the actual NLP processors (i.e. whether or not they contain a valid value). More explanation below.
- Suppose we’re validating C-reactive protein (CRP). Key concepts:
source (true state of the world): Pr present, Ab absent
software decision: Y yes, N no
- signal detection theory classification:
hit = Pr & Y = true positive miss = Pr & N = false negative false alarm = Ab & Y = false positive correct rejection = Ab & N = true negative
- common SDT metrics:
positive predictive value, PPV = P(Pr | Y) = precision (*) negative predictive value, NPV = P(Ab | N) sensitivity = P(Y | Pr) = recall (*) = true positive rate specificity = P(N | Ab) = true negative rate (*) common names used in the NLP context.
- other common classifier metric:
- F_beta score = (1 + beta^2) * precision * recall /
((beta^2 * precision) + recall)
… which measures performance when you value recall beta times as much as precision; e.g. the F1 score when beta = 1. See https://en.wikipedia.org/wiki/F1_score
- Working from source to NLP, we can see there are a few types of “absent”:
- unselected database field containing text
- field contains “CRP”, “C-reactive protein”, etc.; something
that a human (or as a proxy: a machine) would judge as containing a textual reference to CRP. - Pr. Present: a human would judge that a CRP value is present,
e.g. “today her CRP is 7, which I am not concerned about.” - H. Hit: software reports the value. - M. Miss: software misses the value.
(maybe: “his CRP was twenty-one”.)
- Ab1. Absent: reference to CRP, but no numerical information,
e.g. “her CRP was normal”. - FA1. False alarm: software reports a numerical value.
(maybe: “my CRP was 7 hours behind my boss’s deadline”)
- CR1. Correct rejection: software doesn’t report a value.
- Ab2. field contains no reference to CRP at all.
- FA2. False alarm: software reports a numerical value.
- (a bit hard to think of examples…)
- CR2. Correct rejection: software doesn’t report a value.
- From NLP backwards to source:
- Software says value present.
- Hit: value is present.
- FA. False alarm: value is absent.
- Software says value absent.
- CR. Correct rejection: value is absent.
- Miss: value is present.
- The key metrics are:
- precision = positive predictive value = P(Pr | Y)
- … relatively easy to check; find all the “Y” records and check manually that they’re correct.
- sensitivity = recall = P(Y | Pr)
- … Here, we want a sample that is enriched for “symptom actually present”, for human reasons. For example, if 0.1% of text entries refer to CRP, then to assess 100 “Pr” samples we would have to review 100,000 text records, 99,900 of which are completely irrelevant. So we want an automated way of finding “Pr” records. That’s what the validator classes do.
- You can enrich for “Pr” records with SQL, e.g.
- SELECT textfield FROM sometable WHERE (
- textfield LIKE ‘%CRP%’ OR textfield LIKE ‘%C-reactive protein%’);
or similar, but really we want the best “CRP detector” possible. That is probably to use a regex, either in SQL (… “WHERE textfield REGEX ‘myregex’”) or using these validator classes. (The main NLP regexes don’t distinguish between “CRP present, no valid value” and “CRP absent”, because regexes either match or don’t.)
Each validator class implements the core variable-finding part of its corresponding NLP regex class, but without the value or units. For example, the CRP class looks for things like “CRP is 6” or “CRP 20 mg/L”, whereas the CRP validator looks for things like “CRP”.
-
dest_tables_columns
() → Dict[str, List[sqlalchemy.sql.schema.Column]][source]¶ Returns a dictionary of {tablename: destination_columns}.
-
parse
(text: str) → Generator[[Tuple[str, Dict[str, Any]], NoneType], NoneType][source]¶ Parser for ValidatorBase.
-
test_validator
(test_expected_list: List[Tuple[str, bool]], verbose: bool = False) → None[source]¶ The ‘bool’ part of test_expected_list is: should it match any? … noting that “match anywhere” is the “search” function, whereas
“match” matches at the beginning: https://docs.python.org/3/library/re.html#re.regex.match