7.2. NLP config file

This file controls the behaviour of the NLP manager. It defines source and destination databases, and one or more NLP definitions.

You can generate a specimen config file with

crate_nlp --democonfig > test_nlp_config.ini

You should save this, then edit it to your own needs.

Make it point to other necessary things, like your GATE installation if you want to use GATE NLP.

For convenience, you may want the CRATE_NLP_CONFIG environment variable to point to this file. (Otherwise you must specify it each time.)

7.2.1. Detail

The config file describes NLP definitions, such as ‘people_and_places’, ‘drugs_and_doses’, or ‘haematology_white_cell_differential’. You choose the names of these NLP definitions as you define them. You select one when you run the NLP manager (using the --nlpdef argument) [1].

The NLP definition sets out the following things.

  • “Where am I going to find my source text?” You specify this by giving one or more inputfielddefs. Each one of those specifies (via its own config section) a database/table/field combination, such as the ‘Notes’ field of the ‘Progress Notes’ table in the ‘RiO’ database, or the ‘text_content’ field of the ‘Clinical_Documents’ table in the ‘CDL’ database, or some such [2].
    • CRATE will always store minimal source information (database, table, integer PK) with the NLP output. However, for convenience you are also likely to want to copy some other key information over to the output, such as patients’ research identifiers (RIDs). You can specify these via the copyfields option to the input field definitions. For validation purposes, you might even choose to copy the full source text (just for convenience), but it’s unlikely you’d want to do this routinely (because it wastes space).
  • “Which NLP processor will run, and where will it store its output?” This might be an external GATE program specializing in finding drug names, or one of CRATE’s built-in regular expression (regex) parsers, such as for inflammatory markers from blood tests. The choice of NLP processor also determines the fields that will appear in the output; for example, a drug-detecting NLP program might provide fields such as ‘drug’, ‘dose’, ‘units’, and ‘route’, while a white-cell differential processor might provide output such as ‘cell type’, ‘value_in_billion_per_litre’, and so on [3].
    • In fact, GATE applications can simultaneously provide more than one type of output; for example, GATE’s demonstration people-and-places application yields both ‘person’ information (rule, firstname, surname, gender…) and ‘location’ information (rule, loctype…), and it can be computationally more efficient to run them together. Therefore, CRATE supports multiple types of output from ‘single’ NLP processors.
    • Each NLP processor may have its own set of options. For example, the GATE controller requires information about the specific external GATE app to run, and about any necessary environment variables. Others, such as CRATE’s build-in regular expression parsers, are simpler.
    • You might want all of your “drugs and doses” information to be stored in a single table (such that drugs found in your Progress_Notes and drugs found in your Clinical_Documents get stored together); this would be common and sensible (and CRATE will keep a record of where the information came from). However, it’s possible that you might want to segregate them (e.g. having C-reactive protein information extracted from your Progress_Notes stored in a different table to C-reactive protein information extracted from your High_Sensitivity_CRP_Notes_For_Bobs_Project table).
    • For GATE apps that provide more than one type of output structure, you will need to specify more than one output table.
    • You can batch different NLP processors together. For example, the demo config batches up CRATE’s internal regular expression NLP processors together. This is more efficient, because one record fetched from the source database can then be sent to multiple NLP processors. However, it’s less helpful if you are developing new NLP tools and want to be able to re-run just one NLP tool frequently.

All NLP configuration ‘chunks’ are sections within the NLP config file, which is in standard .INI format. For example, an input field definition is a section; a database definition is a section; an environment variable section for use with external programs is a section; and so on.

To allow incremental updates, CRATE will keep a master progress table, storing a reference to the source information (database, table, PK), a hash of the source information (to work out later on if the source has changed), and a date/time when the NLP was last run, and the name of the NLP definition that was run.

It’s definitely better if your source table has integer PKs, but you might not have a choice in the matter (and be unable to add one to a read-only source database), so CRATE also supports string PKs. In this instance it will create an integer by hashing the string and store that along with the string PK itself. (That integer is not guaranteed to be unique, because of hash collisions [4], but it allows some efficiency to be added.)

7.2.2. Parallel processing

There are two ways to parallelize CRATE NLP.

  1. You can run multiple NLP processors at the same time, by specifying multiple NLP processors in a single NLP definition within your configuration file.

    There can be different source of bottlenecks. One is if database access is limiting. Specifying multiple NLP processors means that text is fetched once (for a given set of input fields) and then run through multiple NLP processors in one go.

    However, GATE apps can take e.g. 1 Gb RAM per process, so be careful if trying to run several of those! CRATE’s regular expression parsers use very little RAM (and can go quite fast: e.g. 2 CPUs processing about 15,000 records through 10 regex parsers in about 166 s, or of the order of 1 kHz).

  2. You can run multiple simultaneous copies of CRATE’s NLP manager.

    This will divide up the work across the copies (by dividing up the records retrieved from the database).

You can use both strategies simultaneously.

7.2.3. Specimen config

Here’s the specimen config as of 2017-02-08:

# Configuration file for CRATE NLP manager (crate_nlp).
# Version 0.18.51 (2018-06-29).
#
# PLEASE SEE THE HELP.

# =============================================================================
# A. Individual NLP definitions
#
#    These map from INPUTS FROM YOUR DATABASE to PROCESSORS and a PROGRESS-
#    TRACKING DATABASE, and give names to those mappings.
# =============================================================================
# - referred to by the nlp_manager.py's command-line arguments
# - You are likely to need to alter these (particularly the bits in capital
#   letters) to refer to your own database(s).

# -----------------------------------------------------------------------------
# GATE people-and-places demo
# -----------------------------------------------------------------------------

[MY_NLPDEF_NAME_LOCATION_NLP]

    # Input is from one or more source databases/tables/fields.
    # This list refers to config sections that define those fields in more
    # detail.

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES

    # Which NLP processors shall we use?
    # Specify these as a list of (processor_type, config_section) pairs.
    # For possible processor types, see "crate_nlp --listprocessors".

processors =
    GATE procdef_gate_name_location

    # To allow incremental updates, information is stored in a progress table.
    # The database name is a cross-reference to another section in this config
    # file. The table name is hard-coded to 'crate_nlp_progress'.

progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
    # ... you should replace this with a hash phrase of your own, but it's not
    # especially secret (it's only used for change detection and users are
    # likely to have access to the source material anyway), and its specific
    # value is unimportant.

    # Temporary tablename to use (in progress and destination databases).
    # Default is _crate_nlp_temptable
# temporary_tablename = _crate_nlp_temptable

# -----------------------------------------------------------------------------
# KConnect (Bio-YODIE) GATE app
# -----------------------------------------------------------------------------

[MY_NLPDEF_KCONNECT]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    GATE procdef_gate_kconnect
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter

# -----------------------------------------------------------------------------
# Medex-UIMA drug-finding app
# -----------------------------------------------------------------------------

[MY_NLPDEF_MEDEX_DRUGS]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    Medex procdef_medex_drugs
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter

# -----------------------------------------------------------------------------
# CRATE number-finding Python regexes
# -----------------------------------------------------------------------------

[MY_NLPDEF_BIOMARKERS]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES

processors =
    # -------------------------------------------------------------------------
    # Biochemistry
    # -------------------------------------------------------------------------
    CRP procdef_crp
    CRPValidator procdef_validate_crp
    Sodium procdef_sodium
    SodiumValidator procdef_validate_sodium
    TSH procdef_tsh
    TSHValidator procdef_validate_tsh
    # -------------------------------------------------------------------------
    # Clinical
    # -------------------------------------------------------------------------
    Height procdef_height
    HeightValidator procdef_validate_height
    Weight procdef_weight
    WeightValidator procdef_validate_weight
    Bmi procdef_bmi
    BmiValidator procdef_validate_bmi
    Bp procdef_bp
    BpValidator procdef_validate_bp
    # -------------------------------------------------------------------------
    # Cognitive
    # -------------------------------------------------------------------------
    MMSE procdef_mmse
    MMSEValidator procdef_validate_mmse
    ACE procdef_ace
    ACEValidator procdef_validate_ace
    MiniACE procdef_mini_ace
    MiniACEValidator procdef_validate_mini_ace
    MOCA procdef_moca
    MOCAValidator procdef_validate_moca
    # -------------------------------------------------------------------------
    # Haematology
    # -------------------------------------------------------------------------
    ESR procdef_esr
    ESRValidator procdef_validate_esr
    WBC procdef_wbc
    WBCValidator procdef_validate_wbc
    Basophils procdef_basophils
    BasophilsValidator procdef_validate_basophils
    Eosinophils procdef_eosinophils
    EosinophilsValidator procdef_validate_eosinophils
    Lymphocytes procdef_lymphocytes
    LymphocytesValidator procdef_validate_lymphocytes
    Monocytes procdef_monocytes
    MonocytesValidator procdef_validate_monocytes
    Neutrophils procdef_neutrophils
    NeutrophilsValidator procdef_validate_neutrophils

progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter

    # Specify the maximum number of rows to be processed before a COMMIT is
    # issued on the database transaction(s). This prevents the transaction(s)
    # growing too large.
    # Default is 1000.
max_rows_before_commit = 1000

    # Specify the maximum number of source-record bytes (approximately!) that
    # are processed before a COMMIT is issued on the database transaction(s).
    # This prevents the transaction(s) growing too large. The COMMIT will be
    # issued *after* this limit has been met/exceeded, so it may be exceeded if
    # the transaction just before the limit takes the cumulative total over the
    # limit.
    # Default is 83886080.
max_bytes_before_commit = 83886080


# =============================================================================
# B. NLP processor definitions
#
#    These control the behaviour of individual NLP processors.
#    In the case of CRATE's built-in processors, the only configuration needed
#    is the destination database/table, but for some, like GATE applications,
#    you need to define more -- such as how to run the external program, and
#    what sort of table structure should be created to receive the results.
# =============================================================================
# - You're likely to have to modify the destination databases these point to,
#   but otherwise you can probably leave them as they are.

# -----------------------------------------------------------------------------
# Specimen CRATE regular expression processor definitions
# -----------------------------------------------------------------------------

    # Most of these are very simple, and just require a destination database
    # (as a cross-reference to a database section within this file) and a
    # destination table.

    # Biochemistry

[procdef_crp]
destdb = DESTINATION_DATABASE
desttable = crp
[procdef_validate_crp]
destdb = DESTINATION_DATABASE
desttable = validate_crp

[procdef_sodium]
destdb = DESTINATION_DATABASE
desttable = sodium
[procdef_validate_sodium]
destdb = DESTINATION_DATABASE
desttable = validate_sodium

[procdef_tsh]
destdb = DESTINATION_DATABASE
desttable = tsh
[procdef_validate_tsh]
destdb = DESTINATION_DATABASE
desttable = validate_tsh

    # Clinical

[procdef_height]
destdb = DESTINATION_DATABASE
desttable = height
[procdef_validate_height]
destdb = DESTINATION_DATABASE
desttable = validate_height

[procdef_weight]
destdb = DESTINATION_DATABASE
desttable = weight
[procdef_validate_weight]
destdb = DESTINATION_DATABASE
desttable = validate_weight

[procdef_bmi]
destdb = DESTINATION_DATABASE
desttable = bmi
[procdef_validate_bmi]
destdb = DESTINATION_DATABASE
desttable = validate_bmi

[procdef_bp]
destdb = DESTINATION_DATABASE
desttable = bp
[procdef_validate_bp]
destdb = DESTINATION_DATABASE
desttable = validate_bp

    # Cognitive

[procdef_mmse]
destdb = DESTINATION_DATABASE
desttable = mmse
[procdef_validate_mmse]
destdb = DESTINATION_DATABASE
desttable = validate_mmse

[procdef_ace]
destdb = DESTINATION_DATABASE
desttable = ace
[procdef_validate_ace]
destdb = DESTINATION_DATABASE
desttable = validate_ace

[procdef_mini_ace]
destdb = DESTINATION_DATABASE
desttable = mini_ace
[procdef_validate_mini_ace]
destdb = DESTINATION_DATABASE
desttable = validate_mini_ace

[procdef_moca]
destdb = DESTINATION_DATABASE
desttable = moca
[procdef_validate_moca]
destdb = DESTINATION_DATABASE
desttable = validate_moca

    # Haematology

[procdef_esr]
destdb = DESTINATION_DATABASE
desttable = esr
[procdef_validate_esr]
destdb = DESTINATION_DATABASE
desttable = validate_esr

[procdef_wbc]
destdb = DESTINATION_DATABASE
desttable = wbc
[procdef_validate_wbc]
destdb = DESTINATION_DATABASE
desttable = validate_wbc

[procdef_basophils]
destdb = DESTINATION_DATABASE
desttable = basophils
[procdef_validate_basophils]
destdb = DESTINATION_DATABASE
desttable = validate_basophils

[procdef_eosinophils]
destdb = DESTINATION_DATABASE
desttable = eosinophils
[procdef_validate_eosinophils]
destdb = DESTINATION_DATABASE
desttable = validate_eosinophils

[procdef_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = lymphocytes
[procdef_validate_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = validate_lymphocytes

[procdef_monocytes]
destdb = DESTINATION_DATABASE
desttable = monocytes
[procdef_validate_monocytes]
destdb = DESTINATION_DATABASE
desttable = validate_monocytes

[procdef_neutrophils]
destdb = DESTINATION_DATABASE
desttable = neutrophils
[procdef_validate_neutrophils]
destdb = DESTINATION_DATABASE
desttable = validate_neutrophils

# -----------------------------------------------------------------------------
# Specimen GATE demo people/places processor definition
# -----------------------------------------------------------------------------

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[procdef_gate_name_location]

    # Which database will this processor write to?

destdb = DESTINATION_DATABASE

    # Map GATE '_type' parameters to possible destination tables (in
    # case-insensitive fashion). What follows is a list of pairs: the first
    # item is the annotation type coming out of the GATE system, and the second
    # is the output type section defined in this file (as a separate section).
    # Those sections (q.v.) define tables and columns (fields).

outputtypemap =
    Person output_person
    Location output_location

    # GATE NLP is done by an external program.
    # SEE THE MANUAL FOR DETAIL.
    #
    # Here we specify a program and associated arguments, and an optional
    # environment variable section.
    # The example shows how to use Java to launch a specific Java program
    # (CrateGatePipeline), having set a path to find other Java classes, and then to
    # pass arguments to the program itself.
    #
    # NOTE IN PARTICULAR:
    # - Use double quotes to encapsulate any filename that may have spaces
    #   within it (e.g. C:/Program Files/...).
    #   Use a forward slash director separator, even under Windows.
    #   ... ? If that doesn't work, use a double backslash, \.
    # - Under Windows, use a semicolon to separate parts of the Java classpath.
    #   Under Linux, use a colon.
    # - So a Linux Java classpath looks like
    #       /some/path:/some/other/path:/third/path
    #   and a Windows one looks like
    #       C:/some/path;C:/some/other/path;C:/third/path
    # - To make this simpler, we can define the environment variable OS_PATHSEP
    #   (by analogy to Python's os.pathsep), as below.
    #
    # You can use substitutable parameters:
    #
    #   {X}
    #       Substitutes variable X from the environment you specify (see
    #       below).
    #   {NLPLOGTAG}
    #       Additional environment variable that indicates the process being
    #       run; used to label the output from CrateGatePipeline.

progargs = java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATEDIR}/bin/gate.jar"{OS_PATHSEP}"{GATEDIR}/lib/*"
    -Dgate.home="{GATEDIR}"
    CrateGatePipeline
    --gate_app "{GATEDIR}/plugins/ANNIE/ANNIE_with_defaults.gapp"
    --annotation Person
    --annotation Location
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --verbose

progenvsection = MY_ENV_SECTION

    # The external program is slow, because NLP is slow. Therefore, we set up
    # the external program and use it repeatedly for a whole bunch of text.
    # Individual pieces of text are sent to it (via its stdin). We finish our
    # piece of text with a delimiter, which should (a) be specified in the -it
    # parameter above, and (b) be set below, TO THE SAME VALUE. The external
    # program should return a TSV-delimited set of field/value pairs, like
    # this:
    #
    #       field1\tvalue1\tfield2\tvalue2...
    #       field1\tvalue3\tfield2\tvalue4...
    #       ...
    #       TERMINATOR
    #
    # ... where TERMINATOR is something that you (a) specify with the -ot
    # parameter above, and (b) set below, TO THE SAME VALUE.

input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD

    # If the external program leaks memory, you may wish to cap the number of
    # uses before it's restarted. Specify the max_external_prog_uses option if
    # so. Specify 0 or omit the option entirely to ignore this.

# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # (This is an additional thing we need for GATE applications, since CRATE
    # doesn't automatically know what sort of output they will produce.)

[output_person]

    # The tables and SPECIFIC output fields for a given GATE processor are
    # defined here.

desttable = person

renames =  # one pair per line; can quote, using shlex rules; case-sensitive
    firstName   firstname

destfields =
    rule        VARCHAR(100)
    firstname   VARCHAR(100)
    surname     VARCHAR(100)
    gender      VARCHAR(7)
    kind        VARCHAR(100)

    # ... longest gender: "unknown" (7)

indexdefs =
    firstname   64
    surname     64

    # ... a set of (indexed field, index length) pairs; length can be "None"

[output_location]

desttable = location
renames =
    locType     loctype
destfields =
    rule        VARCHAR(100)
    loctype     VARCHAR(100)
indexdefs =
    rule    100
    loctype 100


# -----------------------------------------------------------------------------
# Specimen Sheffield/KCL KConnect (Bio-YODIE) processor definition
# -----------------------------------------------------------------------------
# https://gate.ac.uk/applications/bio-yodie.html

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[procdef_gate_kconnect]

destdb = DESTINATION_DATABASE
outputtypemap =
    Disease_or_Syndrome output_disease_or_syndrome
progargs = java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATEDIR}/bin/gate.jar"{OS_PATHSEP}"{GATEDIR}/lib/*"
    -Dgate.home="{GATEDIR}"
    CrateGatePipeline
    --gate_app "{KCONNECTDIR}/main-bio/main-bio.xgapp"
    --annotation Disease_or_Syndrome
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output_disease_or_syndrome]

desttable = kconnect_diseases
renames =
    Experiencer     experiencer
    Negation        negation
    PREF            pref
    STY             sty
    TUI             tui
    Temporality     temporality
    VOCABS          vocabs
destfields =
    # Found by manual inspection of KConnect/Bio-YODIE output from the GATE console:
    experiencer  VARCHAR(100)  # e.g. "Patient"
    negation     VARCHAR(100)  # e.g. "Affirmed"
    pref         VARCHAR(100)  # e.g. "Rheumatic gout"; PREFferred name
    sty          VARCHAR(100)  # e.g. "Disease or Syndrome"; Semantic Type (STY) [semantic type name]
    tui          VARCHAR(4)    # e.g. "T047"; Type Unique Identifier (TUI) [semantic type identifier]; 4 characters; https://www.ncbi.nlm.nih.gov/books/NBK9679/
    temporality  VARCHAR(100)  # e.g. "Recent"
    vocabs       VARCHAR(255)  # e.g. "AIR,MSH,NDFRT,MEDLINEPLUS,NCI,LNC,NCI_FDA,NCI,MTH,AIR,ICD9CM,LNC,SNOMEDCT_US,LCH_NW,HPO,SNOMEDCT_US,ICD9CM,SNOMEDCT_US,COSTAR,CST,DXP,QMR,OMIM,OMIM,AOD,CSP,NCI_NCI-GLOSS,CHV"; list of UMLS vocabularies
    inst         VARCHAR(8)    # e.g. "C0003873"; looks like a Concept Unique Identifier (CUI); 1 letter then 7 digits
    inst_full    VARCHAR(255)  # e.g. "http://linkedlifedata.com/resource/umls/id/C0003873"
    language     VARCHAR(100)  # e.g. ""; ?will look like "ENG" for English? See https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er1.html
    tui_full     VARCHAR(255)  # e.g. "http://linkedlifedata.com/resource/semanticnetwork/id/T047"
indexdefs =
    pref    100
    sty     100
    tui     4
    inst    8

# -----------------------------------------------------------------------------
# Specimen KCL GATE pharmacotherapy processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-pharmacotherapy

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[procdef_gate_pharmacotherapy]

destdb = DESTINATION_DATABASE
outputtypemap =
    Prescription output_prescription
progargs = java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATEDIR}/bin/gate.jar"{OS_PATHSEP}"{GATEDIR}/lib/*"
    -Dgate.home="{GATEDIR}"
    CrateGatePipeline
    --gate_app "{GATE_PHARMACOTHERAPY_DIR}/application.xgapp"
    --include_set Output
    --annotation Prescription
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --show_contents_on_crash

#    -v
progenvsection = CPFT_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Note new "renames" option, because the names of the annotations are not
# always valid SQL column names.

[output_prescription]

desttable = medications_gate
renames =  # one pair per line; can quote, using shlex rules; case-sensitive
    drug-type           drug_type
    dose-value          dose_value
    dose-unit           dose_unit
    dose-multiple       dose_multiple
    Directionality      directionality
    Experiencer         experiencer
    "Length of Time"    length_of_time
    Temporality         temporality
    "Unit of Time"      unit_of_time
null_literals =
    # Sometimes GATE provides "null" for a NULL value; we can convert to SQL NULL.
    # Sequence of words; shlex rules.
    null
    ""
destfields =
    # Found by (a) manual inspection of BRC GATE pharmacotherapy output from
    # the GATE console; (b) inspection of
    # application-resources/schemas/Prescription.xml
    # Note preference for DECIMAL over FLOAT/REAL; see
    # https://stackoverflow.com/questions/1056323
    # Note that not all annotations appear for all texts. Try e.g.:
    #   Please start haloperidol 5mg tds.
    #   I suggest you start haloperidol 5mg tds for one week.
    rule            VARCHAR(100)  # not in XML but is present in a subset: e.g. "weanOff"; max length unclear
    drug            VARCHAR(200)  # required string; e.g. "haloperidol"; max length 47 from "wc -L BNF_generic.lst", 134 from BNF_trade.lst
    drug_type       VARCHAR(100)  # required string; from "drug-type"; e.g. "BNF_generic"; ?length of longest drug ".lst" filename
    dose            VARCHAR(100)  # required string; e.g. "5mg"; max length unclear
    dose_value      DECIMAL       # required numeric; from "dose-value"; "double" in the XML but DECIMAL probably better; e.g. 5.0
    dose_unit       VARCHAR(100)  # required string; from "dose-unit"; e.g. "mg"; max length unclear
    dose_multiple   INT           # required integer; from "dose-multiple"; e.g. 1
    route           VARCHAR(7)    # required string; one of: "oral", "im", "iv", "rectal", "sc", "dermal", "unknown"
    status          VARCHAR(10)   # required; one of: "start", "continuing", "stop"
    tense           VARCHAR(7)    # required; one of: "past", "present"
    date            VARCHAR(100)  # optional string; max length unclear
    directionality  VARCHAR(100)  # optional string; max length unclear
    experiencer     VARCHAR(100)  # optional string; e.g. "Patient"
    frequency       DECIMAL       # optional numeric; "double" in the XML but DECIMAL probably better
    interval        DECIMAL       # optional numeric; "double" in the XML but DECIMAL probably better
    length_of_time  VARCHAR(100)  # optional string; from "Length of Time"; max length unclear
    temporality     VARCHAR(100)  # optional string; e.g. "Recent"
    time_unit       VARCHAR(100)  # optional string; from "time-unit"; e.g. "day"; max length unclear
    unit_of_time    VARCHAR(100)  # optional string; from "Unit of Time"; max length unclear
    when            VARCHAR(100)  # optional string; max length unclear
indexdefs =
    rule    100
    drug    200
    route   7
    status  10
    tense   7

# -----------------------------------------------------------------------------
# Specimen KCL Lewy Body Diagnosis Application (LBDA) processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-LBD

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[procdef_gate_kcl_lbda]

    # "cDiagnosis" is the "confirmed diagnosis" field, as d/w Jyoti Jyoti
    # 2018-03-20; see also README.md. This appears in the "Automatic" and the
    # unnamed set. There is also a near-miss one, "DiagnosisAlmost", which
    # appears in the unnamed set.
    #   "Mr Jones has Lewy body dementia."
    #       -> DiagnosisAlmost
    #   "Mr Jones has a diagnosis of Lewy body dementia."
    #       -> DiagnosisAlmost, cDiagnosis
    # Note that we must use lower case in the outputtypemap.

destdb = DESTINATION_DATABASE
outputtypemap =
    cDiagnosis output_lbd_diagnosis
    DiagnosisAlmost output_lbd_diagnosis
progargs = java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATEDIR}/bin/gate.jar"{OS_PATHSEP}"{GATEDIR}/lib/*"
    -Dgate.home="{GATEDIR}"
    CrateGatePipeline
    --gate_app "{KCL_LBDA_DIR}/application.xgapp"
    --set_annotation "" DiagnosisAlmost
    --set_annotation Automatic cDiagnosis
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output_lbd_diagnosis]

desttable = lewy_body_dementia_gate
null_literals =
    null
    ""
destfields =
    # Found by
    # (a) manual inspection of output from the GATE Developer console:
    # - e.g. {rule=Includefin, text=Lewy body dementia}
    # (b) inspection of contents:
    # - run a Cygwin shell
    # - find . -type f -exec grep cDiagnosis -l {} \;
    # - 3 hits:
    #       ./application-resources/jape/DiagnosisExclude2.jape
    #           ... part of the "Lewy"-detection apparatus
    #       ./application-resources/jape/text-feature.jape
    #           ... adds "text" annotation to cDiagnosis Token
    #       ./application.xgapp
    #           ... in annotationTypes
    # On that basis:
    rule            VARCHAR(100)  #
    text            VARCHAR(200)  #
indexdefs =
    rule    100
    text    200

# -----------------------------------------------------------------------------
# Specimen MedEx processor definition
# -----------------------------------------------------------------------------
# https://sbmi.uth.edu/ccb/resources/medex.htm

[procdef_medex_drugs]

destdb = DESTINATION_DATABASE
desttable = drugs
progargs = java
    -classpath {NLPPROGDIR}:{MEDEXDIR}/bin:{MEDEXDIR}/lib/*
    -Dfile.encoding=UTF-8
    CrateMedexPipeline
    -lt {NLPLOGTAG}
    -v -v
# ... other arguments are added by the code
progenvsection = MY_ENV_SECTION


# =============================================================================
# C. Environment variable definitions
#
#    We define environment variable groups here, with one group per section.
#    When a section is selected (e.g. by a "progenvsection" command in an NLP
#    processor definition), these variables can be substituted into the
#    "progargs" part of the NLP definition (for when external programs are
#    called) and are available in the operating system environment for those
#    programs themselves.
# =============================================================================
# - The environment will start by inheriting the parent environment, then add
#   variables here.
# - Keys are case-sensitive.
# - You'll need to modify this according to your local configuration.

[MY_ENV_SECTION]

GATEDIR = /home/myuser/somewhere/GATE_Developer_8.0
NLPPROGDIR = /home/myuser/somewhere/crate_anon/nlp_manager/compiled_nlp_classes
MEDEXDIR = /home/myuser/somewhere/Medex_UIMA_1.3.6
KCONNECTDIR = /home/myuser/somewhere/yodie-pipeline-1-2-umls-only
OS_PATHSEP = :


# =============================================================================
# D. Input field definitions
#
#    These define database "inputs" in more detail, including the database,
#    table, and field (column) containing the input, the associated primary key
#    field, and fields that should be copied to the destination to make
#    subsequent work easier (e.g. patient research IDs).
# =============================================================================
# - Referred to within the NLP definition, and cross-referencing database
#   definitions.
# - The 'copyfields' are optional.
# - The 'indexed_copyfields' are an optional subset of 'copyfields'; they'll be
#   indexed.

[INPUT_FIELD_CLINICAL_DOCUMENTS]

srcdb = SOURCE_DATABASE
srctable = EXTRACTED_CLINICAL_DOCUMENTS
srcpkfield = DOCUMENT_PK
srcfield = DOCUMENT_TEXT
copyfields = RID_FIELD
    TRID_FIELD
indexed_copyfields = RID_FIELD
    TRID_FIELD

    # Optional: specify 0 (the default) for no limit, or a number of rows (e.g.
    # 1000) to limit fetching, for debugging purposes.
# debug_row_limit = 0

[INPUT_FIELD_PROGRESS_NOTES]

srcdb = SOURCE_DATABASE
srctable = PROGRESS_NOTES
srcpkfield = PN_PK
srcfield = PN_TEXT
copyfields = RID_FIELD
    TRID_FIELD
indexed_copyfields = RID_FIELD
    TRID_FIELD


# =============================================================================
# E. Database definitions, each in its own section
#
#    These are simply URLs that define how to connect to different databases.
# =============================================================================
# Use SQLAlchemy URLs: http://docs.sqlalchemy.org/en/latest/core/engines.html

[SOURCE_DATABASE]

url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8

[DESTINATION_DATABASE]

url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8

Footnotes

[1]Internally, the config file section is represented by the NlpDefinition class, which acts as the master config class.
[2]Internally, this information is represented by the InputFieldConfig class.
[3]Internally, this information is represented by classes such as GateExtProgController and NumericalResultParser, which are subclasses of NlpParser.
[4]https://en.wikipedia.org/wiki/Hash_function