15.4.11. crate_anon.nlp_manager.nlp_manager¶
Copyright (C) 2015-2018 Rudolf Cardinal (rudolf@pobox.com).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <http://www.gnu.org/licenses/>.
Manage natural-language processing (NLP) via external tools.
Speed testing:
- 8 processes, extracting person, location from a mostly text database
- commit off during full (non-incremental) processing (much faster)
- needs lots of RAM; e.g. Java subprocess uses 1.4 Gb per process as an average (rises from ~250Mb to ~1.4Gb and falls; steady rise means memory leak!); tested on a 16 Gb machine. See also the max_external_prog_uses parameter.
from __future__ import division test_size_mb = 1887 n_person_tags_found = n_locations_tags_found = time_s = 10333 # 10333 s for main bit; 10465 including indexing; is 2.9 hours speed_mb_per_s = test_size_mb / time_s
… 0.18 Mb/s … and note that’s 1.9 Gb of text, not of attachments
- With incremental option, and nothing to do:
same run took 18 s
- During the main run, snapshot CPU usage:
- java about 81% across all processes, everything else close to 0
(using about 12 Gb RAM total)
… or 75-85% * 8 [from top] mysqld about 18% [from top] nlp_manager.py about 4-5% * 8 [from top]
- TO DO:
- comments for NLP output fields (in table definition, destfields)
-
crate_anon.nlp_manager.nlp_manager.
delete_where_no_source
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, ifconfig: crate_anon.nlp_manager.input_field_config.InputFieldConfig, report_every: int = 100000, chunksize: int = 100000) → None[source]¶ Delete destination records where source records no longer exist.
- Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.
- Can’t use a single temporary table, since the progress database isn’t necessarily the same as any of the destination database(s).
- Can’t do this in a multiprocess way, because we’re trying to do a DELETE WHERE NOT IN.
- So we fetch all source PKs (which, by definition, do exist), stash them keep them in memory, and do a DELETE WHERE NOT IN based on those specified values (or, if there are no PKs in the source, delete everything from the destination).
Problems: - This is IMPERFECT if we have string source PKs and there are hash
collisions (e.g. PKs for records X and Y both hash to the same thing; record X is deleted; then its processed version might not be).- With massive tables, we might run out of memory or (much more likely) SQL parameter slots. – This is now happening; error looks like: pyodbc.ProgrammingError: (‘The SQL contains 30807 parameter parkers, but 2717783 parameters were supplied’, ‘HY000’)
A better way might be: - for each table, make a temporary table in the same database - populate that table with (source PK integer/hash, source PK string) pairs - delete where pairs don’t match – is that portable SQL?
- More efficient would be to make one table per destination database.
On the “delete where multiple fields don’t match”: - Single field syntax is
DELETE FROM a WHERE a1 NOT IN (SELECT b1 FROM b)- Multiple field syntax is
- DELETE FROM a WHERE NOT EXISTS (
SELECT 1 FROM b WHERE a.a1 = b.b1 AND a.a2 = b.b2
)
- In SQLAlchemy, exists():
http://stackoverflow.com/questions/14600619 http://docs.sqlalchemy.org/en/latest/core/selectable.html
Furthermore, in SQL NULL = NULL is false, and NULL <> NULL is also false, so we have to do an explicit null check. You do that with “field == None” (disable See http://stackoverflow.com/questions/21668606 We’re aiming, therefore, for:
- DELETE FROM a WHERE NOT EXISTS (
SELECT 1 FROM b WHERE a.a1 = b.b1 AND (
a.a2 = b.b2 OR (a.a2 IS NULL AND b.b2 IS NULL)
)
)
-
crate_anon.nlp_manager.nlp_manager.
drop_remake
(progargs, nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, incremental: bool = False, skipdelete: bool = False) → None[source]¶ Drop output tables and recreate them.
-
crate_anon.nlp_manager.nlp_manager.
process_nlp
(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, incremental: bool = False, report_every: int = 500, tasknum: int = 0, ntasks: int = 1) → None[source]¶ Main NLP processing function. Fetch text, send it to the GATE app (storing the results), and make a note in the progress database.