15.4.11. crate_anon.nlp_manager.nlp_manager


Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.


Manage natural-language processing (NLP) via external tools.

Speed testing:

  • 8 processes, extracting person, location from a mostly text database
  • commit off during full (non-incremental) processing (much faster)
  • needs lots of RAM; e.g. Java subprocess uses 1.4 Gb per process as an average (rises from ~250Mb to ~1.4Gb and falls; steady rise means memory leak!); tested on a 16 Gb machine. See also the max_external_prog_uses parameter.

from __future__ import division test_size_mb = 1887 n_person_tags_found = n_locations_tags_found = time_s = 10333 # 10333 s for main bit; 10465 including indexing; is 2.9 hours speed_mb_per_s = test_size_mb / time_s

… 0.18 Mb/s … and note that’s 1.9 Gb of text, not of attachments

  • With incremental option, and nothing to do:

    same run took 18 s

  • During the main run, snapshot CPU usage:
    java about 81% across all processes, everything else close to 0

    (using about 12 Gb RAM total)

    … or 75-85% * 8 [from top] mysqld about 18% [from top] nlp_manager.py about 4-5% * 8 [from top]

TO DO:
  • comments for NLP output fields (in table definition, destfields)
crate_anon.nlp_manager.nlp_manager.delete_where_no_source(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, ifconfig: crate_anon.nlp_manager.input_field_config.InputFieldConfig, report_every: int = 100000, chunksize: int = 100000) → None[source]

Delete destination records where source records no longer exist.

  • Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.
  • Can’t use a single temporary table, since the progress database isn’t necessarily the same as any of the destination database(s).
  • Can’t do this in a multiprocess way, because we’re trying to do a DELETE WHERE NOT IN.
  • So we fetch all source PKs (which, by definition, do exist), stash them keep them in memory, and do a DELETE WHERE NOT IN based on those specified values (or, if there are no PKs in the source, delete everything from the destination).

Problems: - This is IMPERFECT if we have string source PKs and there are hash

collisions (e.g. PKs for records X and Y both hash to the same thing; record X is deleted; then its processed version might not be).
  • With massive tables, we might run out of memory or (much more likely) SQL parameter slots. – This is now happening; error looks like: pyodbc.ProgrammingError: (‘The SQL contains 30807 parameter parkers, but 2717783 parameters were supplied’, ‘HY000’)

A better way might be: - for each table, make a temporary table in the same database - populate that table with (source PK integer/hash, source PK string) pairs - delete where pairs don’t match – is that portable SQL?

  • More efficient would be to make one table per destination database.

On the “delete where multiple fields don’t match”: - Single field syntax is

DELETE FROM a WHERE a1 NOT IN (SELECT b1 FROM b)
crate_anon.nlp_manager.nlp_manager.drop_remake(progargs, nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, incremental: bool = False, skipdelete: bool = False) → None[source]

Drop output tables and recreate them.

crate_anon.nlp_manager.nlp_manager.main() → None[source]

Command-line entry point.

crate_anon.nlp_manager.nlp_manager.process_nlp(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, incremental: bool = False, report_every: int = 500, tasknum: int = 0, ntasks: int = 1) → None[source]

Main NLP processing function. Fetch text, send it to the GATE app (storing the results), and make a note in the progress database.

crate_anon.nlp_manager.nlp_manager.show_dest_counts(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition) → None[source]

Show the number of records in all destination tables.

crate_anon.nlp_manager.nlp_manager.show_source_counts(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition) → None[source]

Show the number of records in all source tables.