7.1. Overview of NLP

The purpose of NLP is to start with free text that a human wrote and end up with structured data that a machine can deal with.

CRATE provides a high-level NLP management system. It churns through databases (typically, databases that have already been de-identified by CRATE’s anonymisation system) and sends text to one or more NLP processors. It takes the results and stashes them in an NLP output database.

NLP processors can be disparate. For example, CRATE has support for:

7.1.1. Standard NLP output columns

All CRATE NLP processors use the following output columns:

Column SQL type Description
_pk BIGINT Arbitrary PK of output record
_nlpdef VARCHAR(64) Name of the NLP definition producing this row
_srcdb VARCHAR(64) Source database name (from CRATE NLP config)
_srctable VARCHAR(64) Source table name
_srcpkfield VARCHAR(64) PK field (column) name in source table
_srcpkval BIGINT PK of source record (or integer hash of PK if the PK is a string)
_srcpkstr VARCHAR(64) NULL if the table has an integer PK, but the PK itself if the PK was a string, to deal with hash collisions.
_srcfield VARCHAR(64) Field (column) name of source text

The length of the VARCHAR field is set by the MAX_SQL_FIELD_LEN constant.