caterpillar.processing package¶
caterpillar.processing.index module¶
An index represents a collection of documents and associated information about those documents. When a document is added to an index using an IndexWriter, some or all of its fields will be analysed (see caterpillar.processing.schema) and information about those fields stored in various sub-indexes. Caterpillar stores a number of sub-indexes:
- The frequencies index::
- {
“term”: count, “term”: count, ...
}
- The positions index (an inverted text index)::
- {
- “term”: {
“frame_id”: [(start, end), (start, end)], ...
}
- The associations index::
- {
- term: {
other_term: count, ...
}
Documents can be added to an index using an IndexWriter. Data can be read from an index using IndexReader. There can only ever be one IndexWriter active per index. There can be an unlimited number of ``IndexReader``s active per index.
The type of index stored by caterpillar is different from those stored by regular information retrieval libraries (like Lucene for example). Caterpillar is designed for text analytics as well as information retrieval. One side affect of this is that caterpillar breaks documents down into frames. Breaking documents down into smaller parts (or context blocks) enables users to implement their own statistical methods for analysing text. Frames are a configurable component. See IndexWriter for more information.
Here is a quick example:
>>> from caterpillar.processing import index
>>> from caterpillar.processing import schema
>>> from caterpillar.storage.sqlite import SqliteStorage
>>> config = index.IndexConfig(SqliteStorage, schema.Schema(text=schema.TEXT))
>>> with index.IndexWriter('/tmp/test_index', config) as writer:
... writer.add_document(text="This is my text")...
'935ed96520ab44879a6e76195a9d7046'
>>> with index.IndexReader('/tmp/test_index') as reader:
... reader.get_document_count()
...
1
- exception caterpillar.processing.index.CaterpillarIndexError¶
Bases: exceptions.Exception
Common base class for index errors.
- exception caterpillar.processing.index.DocumentNotFoundError¶
Bases: caterpillar.processing.index.CaterpillarIndexError
No document by that name exists.
- class caterpillar.processing.index.IndexConfig(storage_cls, schema)¶
Bases: object
Stores configuration information about an index.
This object is a core part of any index. It is serialised and stored with every index so that an index can be opened. It tells an IndexWriter and IndexReader what type of storage class to use via storage_cls (must be a subclass of Storage) and structure of the index via schema (an instance of Schema).
In the interest of future proofing this object, it will also store a version number with itself so that older/new version have the best possible chance at opening indexes.
This class might be extended later to store other things.
- dumps()¶
Dump this instance as a string for serialization.
- exception caterpillar.processing.index.IndexNotFoundError¶
Bases: caterpillar.processing.index.CaterpillarIndexError
No index exists at specified location.
- class caterpillar.processing.index.IndexReader(path)¶
Bases: object
Read information from an existing index.
Once an IndexReader is opened, it will not see any changes written to the index by an IndexWriter. To see any new changes you must open a new IndexReader.
To search an index, use searcher() to fetch an caterpillar.searching.IndexSearcher instance to execute the search with. A searcher will only work while this IndexReader remains open.
Access to the raw underlying associations, frequencies and positions index is provided by this class but a caller needs to be aware that these may consume a LARGE amount of memory depending on the size of the index. As such, all access to these indexes are provided by generators (see get_frames() for example).
Once you are finished with an IndexReader you need to call the close() method.
IndexReader is a context manager and can be used via the with statement to make this easier. For example:
>>> with IndexReader('/path/to/index') as r: ... # Do stuff ... doc = r.get_document(d_id) >>> # Reader is closed
Warning
While opening an IndexReader is quite cheap, it definitely isn’t free. If you are going to do a large amount of reading over a relatively short time span, it is much better to do so using one reader.
There is no limit to the number of IndexReader objects which can be active on an index. IndexReader objects are also thread-safe.
IndexReader doesn’t cache any data. Every time you ask for data, the underlying caterpillar.storage.Storage instance is used to fetch that data. If you were to call ,get_associations_index() 10 times, each time the data will be fetched from the storage instance and not some internal cache. The underlying storage instance may do some of it’s own caching but that is transparent to us.
- begin()¶
Begin reading with this IndexReader.
From this point on, no changes to the underlying index made by an IndexWriter will be seen.
Warning
This method must be called before any reading can be done.
- close()¶
Release all resources used by this IndexReader.
Calling this method renders this instance unusable.
- get_associations_index()¶
Term associations for this Index.
This is used to record when two terms co-occur in a document. Be aware that only 1 co-occurrence for two terms is recorded per document no matter the frequency of each term. The format is as follows:
{ term: { other_term: count, ... }, ... }
This method is a generator which yields key/value pair tuples.
- get_document(document_id)¶
Returns the document with the given document_id (str) as a dict.
- get_document_count()¶
Returns the int count of documents added to this index.
- get_documents(document_ids=None)¶
Generator that yields documents from this index as (id, data) tuples.
If present, the returned documents will be restricted to those with ids in document_ids (list).
- get_frame(frame_id)¶
Fetch frame frame_id (str).
- get_frame_count()¶
Return the int count of frames stored on this index.
- get_frame_ids()¶
Generator of ids for all frames stored on this index.
- get_frames(frame_ids=None)¶
Generator across frames from this index.
If present, the returned frames will be restricted to those with ids in frame_ids (list). Format of the frames index data is as follows:
{ frame_id: { //framed data }, frame_id: { //framed data }, frame_id: { //framed data }, ... }
This method is a generator that yields tuples of frame_id and frame data dict.
- get_frequencies()¶
Term frequencies for this index.
Be aware that a terms frequency is only incremented by 1 per frame no matter the frequency within that frame. The format is as follows:
{ term: count }
This method is a generator that yields key/value paris of tuples (term, count).
Note
If you want to get the term frequency at a document level rather then a frame level then you should count all of the terms positions returned by get_term_position().
- get_metadata()¶
Get the metadata index.
This method is a generator that yields a key, value tuple. The index is in the following format:
{ "field_name": { "value": ["frame_id", "frame_id"], "value": ["frame_id", "frame_id"], "value": ["frame_id", "frame_id"], ... }, ... }
- get_plugin_data(plugin, container, keys=None)¶
All data stored for plugin AnalyticsPlugin <caterpillar.processing.plugin.AnalyticsPlugin from container (str).
If not None, the key/value pairs returned will be limited to those with a key in keys (list).
This method is an generator that yields key/value tuples.
- get_positions_index()¶
Get all term positions.
This is a generator which yields a key/value pair tuple.
This is what is known as an inverted text index. Structure is as follows:
{ "term": { "frame_id": [(start, end), (start, end)], ... }, ... }
- get_revision()¶
Return the str revision identifier for this index.
The revision identifier is a version identifier. It gets updated every time the index gets changed.
- get_schema()¶
Get the caterpillar.processing.schema.Schema for this index.
- get_setting(name)¶
Get the setting identified by name (str).
- get_settings(names)¶
All settings listed in names (list).
This method is a generator that yields name/value pair tuples. The format of the settings index is:
{ name: value, name: value, ... }
- get_term_association(term, association)¶
Returns a count of term associations between term (str) and association (str).
- get_term_frequency(term)¶
Return the frequency of term (str) as an int.
- get_term_positions(term)¶
Returns a dict of term positions for term (str).
Structure of returned dict is as follows:
{
frame_id1: [(start, end), (start, end)], frame_id2: [(start, end), (start, end)], ...}
- get_vocab_size()¶
Get total number of unique terms identified for this index (int).
- searcher(scorer_cls=<class 'caterpillar.searching.scoring.TfidfScorer'>)¶
Return an IndexSearcher for this Index.
- exception caterpillar.processing.index.IndexWriteLockedError¶
Bases: caterpillar.processing.index.CaterpillarIndexError
There is already an existing writer for this index.
- class caterpillar.processing.index.IndexWriter(path, config=None)¶
Bases: object
Write to an existing index or create a new index and write to it.
An instance of an IndexWriter represents a transaction. To begin the transaction, you need to call begin() on this writer. There can only be one active IndexWriter transaction per index. This is enforced via a write lock on the index. When you begin the write transaction the IndexWriter instance tries to acquire the write lock. By default it will block indefinitely until it gets the write lock but this can be overridden using the timeout argument to begin(). If begin() times-out when trying to get a lock, then IndexWriteLockedError is raised.
Writes to an index are internally buffered until they reach IndexWriter.RAM_BUFFER_MB when flush() is called. Alternatively a caller is free to call flush whenever they like. Calling flush will take any in-memory writes recorded by this class and write them to the underlying storage object using the methods it provides.
Note
The behaviour of flush() depends on the underlying Storage implementation used. Some implementations might just record the writes in memory. Consult the specific storage type for more information.
Once you have performed all the writes/deletes you like you need to call commit() to finalise the transaction. Alternatively, if there was a problem during your transaction, you can call rollback() instead to revert any changes you made using this writer. IMPORTANT - Finally, you need to call close() to release the lock.
Using IndexWriter this way should look something like this:
>>> writer = IndexWriter('/some/path/to/an/index') >>> try: ... writer.begin(timeout=2) # timeout in 2 seconds ... # Do stuff, like add_document(), flush() etc... ... writer.commit() # Write the changes (calls flush)... ... except IndexWriteLockedError: ... # Do something else, maybe try again ... except SomeOtherException: ... writer.rollback() # Error occurred, undo our changes ... finally: ... writer.close() # Release lock
This class is also a context manager and so can be used via the with statement. HOWEVER, be aware that using this class in a context manager will block indefinitely until the lock becomes available. Using the context manager has the added benefit of calling commit()/rollback() (if an exception breaks the context) and close() for you automatically:
>>> writer = IndexWriter('/some/path/to/a/index') >>> with writer: ... add_document(field="value")
Again, be warned that this will block until the write lock becomes available!
Finally, pay particular attention to the frame_size arguments of add_document(). This determines the size of the frames the document will be broken up into.
- add_document(frame_size=2, encoding='utf-8', encoding_errors='strict', **fields)¶
Add a document to this index.
We index TEXT fields by breaking them into frames for analysis. The frame_size (int) param controls the size of those frames. Setting frame_size to an int < 1 will result in all text being put into one frame or, to put it another way, the text not being broken up into frames.
Note
Because we always store a full positional index with each index, we are still able to do document level searches like TF/IDF even though we have broken the text down into frames. So, don’t fret!
encoding (str) and encoding_errors (str) are passed directly to str.decode() to decode the data for all TEXT fields. Refer to its documentation for more information.
**fields is the fields and their values for this document. Calling this method will look something like this:
>>> writer.add_document(field1=value1, field2=value2).
Any unrecognized fields are just ignored.
Raises TypeError if something other then str or bytes is given for a TEXT field and :exec:`IndexError` if there are any problems decoding a field.
This method will call flush() if the internal buffers go over RAM_BUFFER_SIZE.
Returns the id (str) of the document added.
Internally what is happening here is that a document is broken up into its fields and a mini-index of the document is generated and stored with our buffers for writing out later.
- add_fields(**fields)¶
Add new fields to the schema.
All keyword arguments are treated as (field_name, field_type) pairs.
- begin(timeout=None)¶
Acquire the write lock and begin the transaction.
If this index has yet to be created, create it (folder and storage). If timeout``(int) is omitted (or None), wait forever trying to lock the file. If ``timeout > 0, try to acquire the lock for that many seconds. If the lock period expires and the lock hasn’t been acquired raise IndexWriteLockedError. If timeout <= 0, raise IndexWriteLockedError immediately if the lock can’t be acquired.
- close()¶
Close this writer.
Calls rollback() if we are in the middle of a transaction.
- commit()¶
Commit changes made by this writer by calling flush() then commit() on the storage instance.
- delete_document(d_id)¶
Delete the document with given d_id (str).
Raises a DocumentNotFound exception if the d_id doesn’t match any document.
- flush()¶
Flush the internal buffers to the underlying storage implementation.
This method iterates through the frames that have been buffered by this writer and creates an internal index of them. Then, it merges that index with the existing index already stored via storage. Finally, it also flushes the internal frames and documents buffers to storage before clearing them. This includes both added documents and deleted documents.
Warning
Even thought this method flushes all internal buffers to the underlying storage implementation, this does NOT constitute writing your changes! To actually have your changes persisted by the underlying storage implementation, you NEED to call commit()! Nothing is final until commit is called.
- fold_term_case(merge_threshold=0.7)¶
Perform case folding on this index, merging words into names (camel cased word or phrase) and vice-versa depending merge_threshold.
merge_threshold (float) is used to test when to merge two variants. When the ratio between word and name version of a term falls below this threshold the merge is carried out.
Warning
This method calls flush before it runs and doesn’t use the internal buffers.
- merge_terms(merges)¶
Merge the terms in merges.
merges (list) should be a list of str tuples of the format (old_term, new_term,). If new_term is '' then old_term is removed. N-grams can be specified by supplying a str tuple instead of str for the old term. For example:
>>> (('hot', 'dog'), 'hot dog')
The n-gram case does not allow empty values for new_term.
Warning
This method calls flush before it runs and doesn’t use the internal buffers.
- rollback()¶
Rollback any changes made by this writer.
- run_plugin(cls, **args)¶
Instantiates and runs the plugin.AnalyticsPlugin type in cls on this index.
Creates an instance of the plugin by passing it the path of this index before calling its run method passing **args and saving the result into container(s) prefixed with the plugins name.
Any keyword arguments passed to this method in args are passed onto the run method of the plugin.
Returns the plugin instance.
- set_schema(schema)¶
Update the schema for this index.
- set_setting(name, value)¶
Set the value of setting identified by name.
- exception caterpillar.processing.index.SettingNotFoundError¶
Bases: caterpillar.processing.index.CaterpillarIndexError
No setting by that name exists.
- exception caterpillar.processing.index.TermNotFoundError¶
Bases: caterpillar.processing.index.CaterpillarIndexError
Term doesn’t exist in index.
- caterpillar.processing.index.find_bi_gram_words(frames, min_bi_gram_freq=3, min_bi_gram_coverage=0.65)¶
This function finds bi-gram words from the specified frames iterable.
For two terms to be considered a bi-gram it must occur at least min_bi_gram_freq (int) times across all frames and the ratio of bi-gram appearances to no bi-gram appearances must be no less then min_bi_gram_coverage (float). For example, if min_bi_gram_coverage is 0.65 (the default) and the bi-gram is good quality then the frequency of the bi-gram good quality divided by the frequency of the term good must be higher then 0.65. Also, the frequency of the bi-gram good quality divided by the frequency of the term quality must be higher then 0.65. If both of these conditions are True and the bi-gram frequency is >= min_bi_gram_freq the it will be returned as a bi-gram.
This function uses a caterpillar.processing.analysis.analyse.PotentialBiGramAnalyser to identify potential bi-grams. Names and stopwords are not considered for bi-grams.
Returns a list of bi-gram strings that pass the criteria.
caterpillar.processing.plugin module¶
- class caterpillar.processing.plugin.AnalyticsPlugin(index)¶
Bases: object
Plugins are registered on an index and allow external pieces of analytics to run on the index and store their results in a container.
Plugins are run by the index and get passed an instance of the index with which to work. This allows them to access the underlying data structures of the index if they desire. They are expected to return a dict from string -> dict (string -> string) which will be stored on the index. Each item in the returned dict will be added as a container to the storage object of the index.
Plugins must define a run() method by which they will be called. The method must return a dict of string -> dict(string -> string). They are also responsible for giving access to the underlying data structures they store on the index. How they do this is up to them.
- get_name()¶
Get the name of this plugin. Used when storing the output of a plugin on a Index.
- run(**fields)¶
The run method is how an index will call the plugin passing any arguments it was called with.
This method must return a dict in the following format for storage on the index:
{
- container_name: {
- key(str): value(str)
}, container_name: {
key(str): value(str)}
}
caterpillar.processing.schema module¶
Indexes (see caterpillar.processing.index) in caterpillar must have a Schema. This module defines that schema and also provides a bunch of utility functions for working with schemas and csv.
- class caterpillar.processing.schema.BOOLEAN(indexed=False, stored=True)¶
Bases: caterpillar.processing.schema.CategoricalFieldType
bool field type that lets you index boolean values (True and False).
The field converts the bool values to text for you before indexing.
- class caterpillar.processing.schema.CATEGORICAL_TEXT(indexed=False, stored=True)¶
Bases: caterpillar.processing.schema.CategoricalFieldType
Configured field type for categorical text fields.
- class caterpillar.processing.schema.CategoricalFieldType(analyser=<caterpillar.processing.analysis.analyse.EverythingAnalyser object at 0x10d7fe650>, indexed=False, stored=True)¶
Bases: caterpillar.processing.schema.FieldType
Represents a categorical field type. Categorical fields can extend this class for convenience.
- value_of(raw_value)¶
Return the value of raw_value after being processed by this field type’s analyse method.
- class caterpillar.processing.schema.ColumnDataType¶
Bases: object
A ColumnDataType object identifies different data types in a CSV column.
There are five possible data types: FLOAT – A floating point number. Should be a string in a format that float() can parse. INTEGER – A integer. Should be in a format that int() can parse. STRING – A string type. This type ISN’T analysed. It is just stored. TEXT – A string type. This is like STRING but it IS analysed and stored (it is used to generate a frame stream). IGNORE – Ignore this column.
- class caterpillar.processing.schema.ColumnSpec(name, type)¶
Bases: object
A ColumnSpec object represents information about a column in a CSV file.
This includes the column’s name and its type.
- class caterpillar.processing.schema.CsvSchema(columns, has_header, dialect, sample_rows=[])¶
Bases: object
This class represents the schema required to process a particular CSV data file.
Required Arguments: columns – A list of ColumnSpec objects to define how the data should be processed. has_header – A boolean indicating whether the first row of the file contains headers. dialect – The dialect to use when parsing the file.
Optional Arguments: sample_rows – A list of row data that was used to generate the schema.
- as_index_schema(bi_grams=None, stopword_list=None)¶
Return a representation of this CsvSchema in the form of a Schema instance that can be used for generating a text index.
Optional Arguments: bi_grams – A list of bi-grams to use for TEXT columns. stopword_list – A list of stop words to use instead of default English list.
- map_row(row)¶
Convert a row into dict form to make it easier to index.
Required Arguments: row – A list of values for a row.
- class caterpillar.processing.schema.FieldType(analyser=<caterpillar.processing.analysis.analyse.EverythingAnalyser object at 0x10d7feb90>, indexed=False, categorical=False, stored=True)¶
Bases: object
Represents a field configuration. :class:`.Schema`s are built out of fields.
The FieldType object controls how a field is analysed via the analyser (caterpillar.processing.analysis.Analyser) attribute.
If you don’t provide an analyser for your field, it will default to a :class:caterpillar.processing.analysis.EverythingAnalyser.
- analyse(value)¶
Analyse value, returning a caterpillar.processing.analysis.tokenize.Token generator.
- equals(value1, value2)¶
Returns whether value1 is equal to value2.
- equals_wildcard(value, wildcard_value)¶
Returns whether value matches regex wildcard_value.
- evaluate_op(operator, value1, value2)¶
Evaluate operator (str from FieldType.FIELD_OPS) on operands value1 and value2.
- gt(value1, value2)¶
Returns whether value1 is greater than value2.
- gte(value1, value2)¶
Returns whether value1 is greater than or equal to value2.
- lt(value1, value2)¶
Returns whether value1 is less than value2.
- lte(value1, value2)¶
Returns whether value1 is less than or equal to value2.
- class caterpillar.processing.schema.ID(indexed=False, stored=True)¶
Bases: caterpillar.processing.schema.CategoricalFieldType
Configured field type that indexes the entire value of the field as one token. This is useful for data you don’t want to tokenize, such as the path of a file.
- class caterpillar.processing.schema.NUMERIC(indexed=False, stored=True, num_type=<type 'int'>, default_value=None)¶
Bases: caterpillar.processing.schema.CategoricalFieldType
Special field type that lets you index ints or floats.
- class caterpillar.processing.schema.Schema(**fields)¶
Bases: object
Represents the collection of fields in an index. Maps field names to FieldType objects which define the behavior of each field.
Low-level parts of the index use field numbers instead of field names for compactness. This class has several methods for converting between the field name, field number, and field object itself.
- add(name, field_type)¶
Adds a field to this schema.
name (str) is the name of the field. fieldtype (FieldType) is either instantiated FieldType object, or a FieldType subclass. If you pass an instantiated object, the schema will use that as the field configuration for this field. If you pass a FieldType subclass, the schema will automatically instantiate it with the default constructor.
- items()¶
Returns a list of ("field_name", field_object) pairs for the fields in this schema.
- names()¶
Returns a list of the names of the fields in this schema.
- class caterpillar.processing.schema.TEXT(analyser=<caterpillar.processing.analysis.analyse.DefaultAnalyser object at 0x10d807410>, indexed=True, stored=True)¶
Bases: caterpillar.processing.schema.FieldType
Configured field type for text fields.
- caterpillar.processing.schema.csv_has_header(csv_file, dialect, num_check_rows=50)¶
Custom heuristic for recognising header in CSV files. Intended to be used as an alternative for the csv.Sniffer.has_header method which doesn’t work well for mostly-text CSV files.
The heuristic we use simply checks the total size of the header row compared to the average row size for the following sample rows. If a large discrepancy is found, we assume that the first row contains headers.
Required Arguments: csv_file – The csv data file to analyse. dialect – CSV dialect to use.
Optional Arguments: num_check_rows – The number of rows to analyse (defaults to 50).
- caterpillar.processing.schema.generate_csv_schema(csv_file, delimiter=', ', encoding='utf8')¶
Attempt to generate a schema for the csv file automatically.
Required Arguments: csv_file – The CSV file to generate a schema for.
Optional Arguments: delimiter – CSV delimiter character. encoding – Character encoding of the file.
Returns a 2-tuple containing the generated schema and the sample rows used to generate the schema.