omterms package

Submodules

omterms.cleaner module

OpenMaker text cleaner.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

The module contains a set of tools to clean textual data.

class omterms.cleaner.TextCleaner(stopwords=[], exceptions=[], fstemmer=<function TextCleaner.<lambda>>, stemcheck=False)[source]

Bases: object

An object that contains a set of tools to clean and preprocess textual data.

Note:
The object uses nltk.FreqDist object For stem checks during pruninng it needs an external stemmer.
Attributes:
exceptions (list of str): List of excepted terms. stopwords (list of str): List of stopwords. stemf: A stemmer funtion.
clean(words, display_top=10, logging=True, exceptions=[])[source]

Removes panctuations and stopwords from a corpus.

Args:

words (list of str): The input corpus as list of words. display_top (int, optional): Logging size (default 10). logging (bool): Optional. When true stdout logging is done (default True). exceptions (list of str, optional): The list terms that will not

be pruned (default None).
Returns:
(nltk.FreqDist): Returns the trimmed corpus as the NLTK obj.
extend_stopwords(spointer)[source]

The method extends a new stopwords list.

Args:
spointer (list of str, str): Either file path string or
a list of stopwords.
Returns:
bool: True if successful, False otherwise.
Raises:
FileNotFoundError: Raised if a given file is not accessable.
static freq_dist(words)[source]

The static method computes frequency distribution of a word list.

Args:
words (list of str, str): list of words.
Returns:
(nltk.FreqDist): Returns frequency dist.
isexception(term, exceptions=[], stemcheck=False)[source]

The static makes the exception list and returns it.

Args:
term (str): The term. exceptions (list, optional): The list of exception terms (default None). stemcheck (bool, optional): if the list to be extended via the stems (default False)
Returns:
(bool):
load_stopwords(spointer)[source]

The method reloads a new stopwords list.

Note:
Internal stopword is overwritten.
Args:
spointer (list of str`or :obj:`str): Either file path string or
a list of stopwords.
Returns:
bool: True if successful, False otherwise.
Raises:
FileNotFoundError: Raised if a given file is not accessable.
static make_exceptions(exceptions, stemf=<function TextCleaner.<lambda>>, stemcheck=False)[source]

The static method makes the exception list and returns it.

Args:
exceptions (list): The list of exception terms. stemf (x str -> y: str, optional): A stemmer function (default f(x) = x) stemcheck (bool, optional): if the list to be extended via the stems (default False)
Returns:
(set): The exception set. If stemcheck is opted both terms and their stems
is be represented in the list.
remove_contains(freq_dist, literals="., ():;!?\n`'=+/\\[]}{|><@#$%^&*_‘’“”.", exceptions=[])[source]

Removes the terms that contains the specific literals.

Args:
freq_dist (nltk.FreqDist): list of words and more. literals (list of str): list of literals. exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
remove_numerals(freq_dist, remove_any=False, exceptions=[])[source]

The method removes terms with numeral literals.

Note:
When remove_any is selected, literals such as 3D would vanish.
Args:
freq_dist (nltk.FreqDist): list of words and more. remove_any (bool, optional): If True mumeral and literal mixed terms are removed. exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
remove_panctuation(freq_dist, exceptions=[])[source]

The static method removes punctuation only terms.

Args:
freq_dist (nltk.FreqDist): list of words and more. exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
remove_rare_terms(freq_dist, below=3, exceptions=[])[source]

The method removes terms that have rare occurances.

Note:
Such removal may help reduce errenous and random terms.
Args:
freq_dist (nltk.FreqDist): list of words and more. below (int, optional): The minumum allowed frequency count (default 3). exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
remove_short_terms(freq_dist, threshold=1, exceptions=[])[source]

The method removes terms that are below a certain length.

Args:
freq_dist (nltk.FreqDist): list of words and more. threshold (int, optional): The charcter length of a term (default 1). exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
remove_stopwords(freq_dist, exceptions=[])[source]

The static method removes stopwords.

Args:
freq_dist (nltk.FreqDist): list of words and more. exceptions (list of str, optional): The exception list.
Returns:
(nltk.FreqDist): Returns frequency dist.
reset_exceptions()[source]

Resets the exception set to empty.

Args:

Returns:
(bool):
set_exceptions(exceptions, stemcheck=False)[source]

Sets instance-wide exception set.

Args:
exceptions (list of str): The list of exception terms. stemcheck (bool, optional): if the list to be extended via the stems (default False)
Returns:
(bool): True

omterms.datauis module

OpenMaker textaul data holders and interfaces.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

The module contains a set of tools used to hold and process text corpuses and collections.

class omterms.datauis.Corpus(tf_dist, stemmer=None)[source]

Bases: object

A generic class to be used for foreground or background corpuses.

Attributes:
tf_dist: (nltk.FreqDist): An NLTK container for tokenized and cleaned
terms in the corpus.

stemmer (x str -> y: str): A stemmer function. stems (dict): A dictionary of terms and their stems labels (dict): Term level labels. scores (dict): A dictionary of terms and their corpus sepcificity scores

as of a reference corpus
ref (dict): A dictionary of terms that holds normalized occurance frequency of
the term at a reference corpus.

sepficifs (set): A set of terms appointed/associated with the corpus externally. To be implemented:

texts_raw (json): A JSON collection of raw texts of the corpus.

texts_clean (json): A JSON collection of cleaned/processed texts of the corpus.

tf_idf (json): Tf-Idf analyses of the corpus (to be implemented).

compute_stems()[source]

The function returns the a dictionary of terms and their corresponding stems.

Args:
stemmer (x str -> y: str): A stemmer function (default None)
Returns:
(bool): True
difference(other, as_corpus=False, stats=False)[source]

The method identifies and returns the difference of the self from the other.

Note:
Implementation needs style and refactoring.
Args:

other (Corpus): An instance of this Corpus Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequency

count of the difference set.
Returns:
(dict): A dictionary of terms and their frequency counts.
get_count_uniques()[source]

The method identifies and returns top frequent terms.

Returns:
(int): Returns an integer.
get_least_frequents(bottom=42)[source]

The method identifies and returns least frequent terms.

Args:
bottom (int): Size of the short list (default 42).
Returns:
(list of tuple of str and int): Returns the frequency dist
for the least frequent terms as list of tuples of term and frequency pairs.
get_size()[source]

The returns the size of the corpus in terms of number of terms it has.

Returns:
(int): Returns an integer. It is summation of raw frequency counts.
get_stems()[source]

The function returns the a dictionary of terms and their corresponding stems.

Args:

Returns:
(dict): A dictionary of {term:stem of the term}.
get_top_frequents(top=42)[source]

The method identifies and returns top frequent terms.

Args:
top (int): Size of the short list (default 42).
Returns:
(list of tuple of str and int): Returns the frequency dist
for top terms as list of tuples of term and frequency pairs.
intersection(other, as_corpus=False, stats=False)[source]

The method identifies and returns the intersection of two corpora.

Args:

other (Corpus): An instance of this Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequency

count of the intersections.
Returns:
(list of str): If as_corpus is False and stats is False
it returns the list of joint terms.
(list of tuple of str and int): Returns the frequency dist
for the joint terms, if as_corpus is False and stats is True.
(Corpus): In all other cases it returns a nrew Corpus class for the intersection.
Frequencies are the minimum of the two occurances.
label(marker, labels=None)[source]

The function labels yet not labeled terms according to the user defined scheme.

Args:
marker (x str -> y: str): A marker function
Returns:
(dict): A dictionary of {term:label of term}.
list_terms()[source]

It returns the list terms in the corpus

Note:
Implementation needs refactoring.
Returns:
(list): An alphabetically sorted list.
plot(top, cumulative=False)[source]

Plotting.

Note:

Returns:
(bool): True
set_specific_set(terms)[source]

The function sets the set of corpus specific terms.

Args:
terms (set): The list of corpus specific terms
Returns:
(bool): True.
set_stemmer(stemmer)[source]

The appointing a new stemmer function to the corpus.

Args:
stemmer (x str -> y: str): A stemmer function (default None)
Returns:
(bool): True
tabulate(top)[source]

Tabulating.

Note:
Works better when used to see few top terms.
Returns:
(bool): True
to_pandas()[source]

The function exports its data into pandas dataframe.

Note:
ToDo: The function needs parameterization, generalization and error checks.
Returns:
(pandas.DataFrame) The tabulated data
union(other, as_corpus=False, stats=False)[source]

The method identifies and returns the union of two corpora.

Args:

other (Corpus): An instance of this Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequency

count of the union.
Returns:
(list of str): If as_corpus is False and stats is False
it returns the list of union of terms in both cases.
(list of tuple of str and int): Returns the frequency dist
for the union terms, ff as_corpus is False and stats is True.
(Corpus): In all other cases it returns a nrew Corpus class for the intersection.
Frequencies are the minimum of the two occurances.
Examples:
>>> Corpus(FreqDist('abbbc')).union(Corpus(FreqDist('bccd')), stats = True)
[('a', 1), ('b', 3), ('c', 2), ('d', 1)]
class omterms.datauis.WikiArticles(collection)[source]

Bases: object

The object contains a set of tools to process the set of
documents collected and cleaned by the wiki crawler.
Attributes:
collection_json (str): This is a filename to the scraped data.

Each JSON document is expected to have following fields: - theme: Topic identifier, ex: Sustainability - theme.id: A unique category identifier - document.id: A unique document id - title: Title of the document - url: Full URL of the document - depth: The link distance from the seed docuement. The seed documents depth is 0. - text: The string data scraped from the page without tags. Pancuations are not

required but terms are expected to be delineated by white space.

collection (list of dict): Loaded json file into native list of dictionaries.

collate(by_theme_id=None, by_doc_ids=[], marker='\n')[source]
The method collects the desired set of documents concatenates them creating a unified document.
The order of merge is as follows: - When neither list of theme nor doc ids is provided, it collates entire text. - If a theme is given then all the documents under that theme are to be joined first. - When a list of docs is given, only those in the list are kept. Note that if both theme id and doc ids provided, precedence is on themes.
Args:
by_theme_id (int, optional): The theme id of the docs to be collated. by_doc_ids (list of int, optional): The list of doc ids to be collated (default Empty). marker (str, optional): A delimiter (default newline)
Returns:
(str): The collated text.
display_documents_list(tid=None, stdout=True)[source]

List the articles meta data and crawling information on them.

Args:
tid (int, optional): Used if documents info under a specific theme is desired
otherwise a summary of whole set is returned or displayed (default None).
stdout (bool, optional): Whether the info is to be displayed/printed
to standard io (default True).
Returns:
( list): A summary list of documents in the collection.
get_document_fields()[source]

The method lists the fields of each json field of its collection.

Args:
None
Returns:
(dict_keys): List of keys. None: When the collection is empty.
get_theme_id(theme_name)[source]

The method returns topic id of the first match theme name.

Args:
theme_name (str): The theme or topic name.
Returns:
( int): A unique theme identifier.
get_theme_title(theme_id)[source]

The method returns topic name.

Args:
theme_id (int): The unique theme id.
Returns:
( str): Topic name.
list_themes()[source]

The method lists the summary of themes/topics in the collection.

Args:
None
Returns:
(list of dict): List of dictionry where keys are
  • name: theme’s textual descriptor
  • id: theme’s unique id
  • count: number of articles under the theme.
load_corpus(collection=None)[source]

The method loads imports json file into a native collection.

Args:
collection (str): A filename to a previously scraped data.
Returns:
(bool): True.
prune(themes_to_keep=[], docs_to_drop=[], istodrop=<function WikiArticles.<lambda>>)[source]
The method is used to filter out documents from the set.

The order of prunning is as follows: - when a none empty list is provide all the documents not belonging themes to be kept

are prunned entirely. Note that when initial list is empty it doesn’t have an effect.
  • of remaing documents those appear in docs_to_drop are prunned
  • of the remaing docs those produce a True at a call on the predicate function are dropped.

The function can be repeatedly called until a desired level of prunning is achieved.

Args:
themes_to_keep (list of int, optional): The list of theme ids to be kept (default Empty). docs_to_drop (list, optional): The list of doc ids to be dropt (default Empty). f (x dict_item -> bool, optional): A predicate function (default lambda x:False)
Returns:
(bool): True.

omterms.interface module

OpenMaker term extraction application interface.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

The application interface provides an encapsualtion and standardization at preparing a set of texts for further analyses. The standardized term extraction process covers tokenization, counting both raw and normalized, cleaning, stemming(optional), and scoring(optional) chioices. The tabulated output can be exported in .csv file format and/or Pandas dataframe format.

The input text(s) can be provided in any of the following formats: - raw text - tokenized text - tokenized and counted

In the same manner if a background courpus based scoring is desired then the reference corpus can be provided in any of the above formats. Depending on the desired actions on the input text, the output may contain not only raw and normalized term frequency counts but also stems of the terms its frequency count in the background corpus, when provided, and term’s log likelihood weight wrt to its prevalance in the reference corpus.

For deatils please see the README and tuturials that comes alongs with installation of this package.

Attributes:
OUTPUT_FOLDER (file path): The folder to export the tabulated outputs
in .csv files.
OUTPUT_FNAME (file name): The default output file name where results
for multiple texts are merged and presented.
STOPWORDS_STANDARD (file path): The location of standard stopward list,
if desired and exists.
STOPWORDS_SPECIFIC (file path): The location of topic(s) specific stopward
list, if desired and exists.
NOTALLOWED (list of str): The list of symbols that would flag the

removal of the term, if needed.

Note: The removal would not take place if the term is a specific term or
marked as an exception.
TERMS_SPECIFIC (file path): The location of exception list of terms which is
shielded from cleaning process, if needed.

TOKENIZER_FUNC (x str -> y: str): A tokenizer function.

STEMMER_FUNC (x str -> y: str): A stemmer function,
if needed

MIN_LENGTH (int): The minimum allowed term length.

MIN_FREQ (float): The minimum allowed frequency count for
the tabuleted outputs.
MODEL_THRESHOLD (int): if scoring is requested and if the input text
is not driven from the reference corpus, then the paramter is used at training the predection model for the terms that don’t occure in the reference corpus.
Todo:
  • Re-implement the module as a memoized object either via a class or
    a via wrapper function.
  • Add a functionality where the configuration paramters can be loaded
    from a JSON file.
  • Make tokenization an optional process, for the cases where the input
    is already provided in tokens.
omterms.interface.extract_terms(texts, tokenizer=<function tokenize_strip_non_words>, merge=False, min_termlength=1, min_tf=1, topics=[], extra_process=[], stemmer=<bound method PorterStemmer.stem of <PorterStemmer>>, refcorpus=None, export=False, basefname='omterms.csv', outputdir='./output/', notallowed_symbols="., ():;!?\n`'=+/\\[]}{|><@#$%^&*_‘’“”.", nonremovable_terms='./data/specifics_openmaker.txt', file_standard_stopwords='./data/stopwords_standard.txt', file_specific_sopwords='./data/stopwords_openmaker.txt', regression_threshold=1.0)[source]

Term extraction modules main driver function.

Args:

texts (str or dict of str or omterms.WikiArticles): The input text can be any of the following:

  • a string,
  • or a dictionary of strings where the key denotes the topic
    or any desired label/annotation regarding the text,
  • or a special data holder which contains labeled text scraped
    from Wikipedia articles.
tokenizer (x str -> y: str): The tokenizer,
(defualt omterms.tokenizer.tokenize_strip_non_words).
merge (bool): When a collection of text is provided via a dict or
WikiArticles, the parmater detertmines whether they should be conactinated for the term extraction (default False).

min_termlength (int): The minimum allowed term length (default 1).

min_tf (int): The minimum allowed frequency count for the tabuleted
outputs (default 4).
topics (list of str, optional): The list of topics from the input

texts to be considered (default Empty).

If topic list is not provided and the merge is not requested but the input text is given either via dict or via the WikiArticles data holder, then the topic list will be driven from the input collection automatically.

extra_process (list of str, optional): Whether stemming and/or scoring

is requested (default Empty). - ‘stem’ is used/needed to flag stemming. - ‘compare’ is used/needed to scoring texts against the designated

reference corpus.
stemmer (x str -> y: str, optional): A stemmer function,
if needed (default omterms.stemmer.porter).
refcorpus (str

or list of str or dict of str or omterms.WikiArticles, optional): The reference corpus

(Default None)
The refcorpus can beny of the following:
  • None: if it is none yet the a scoring process is requested then
    NLTK’s Brown corpus is loaded.
  • a string: A plain text.
  • list of words: List of words or tokens.
  • or a dictionary of strings where the text from the text will
    be unified for the reference corpus.
  • or a special data holder which contains labeled text scraped from
    Wikipedia articles, where all the texts from the collection will be combined to be used as the reference background corpus.
export (bool, optional): Whether the resulting tables should be
exported (default False).
basefname (str, optional): The output table name/prefix. Is effective
only when export is requested (default ‘omsterms.csv’).
outputdir (str, optional): The file path, that is the folder to export
the tabulated outputs in .csv files (default ‘./data/’).
notallowed_symbols (list of str, optional): The list of symbols
that would flag the removal of the term if needed (defualt omterms.tokenizer.CHARACTERS_TO_SPLIT
nonremovable_terms (str, optional): File path to the list
of exceptions.
file_standard_stopwords (str, optional): The file path to the

standard stopward list, if desired and exists.

Note: The removal would not take place if the term is a specific
term or marked as an exception.
file_specific_stopwords (str, optional): The file path to a

specifix stopward list, if desired and exists.

Note: The removal would not take place if the term is a specific term that is if marked as an exception.

regression_threshold (float, optional): if scoring is requested
and if the input text is not driven from the reference corpus then this paramter is used at training the predection model for the terms that don’t occure in the reference corpus (default 1.0).
Returns:
(pandas.DataFrame, optional) The tabulated data.

omterms.measures module

OpenMaker term scoring tools.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

class omterms.measures.Scoring(sCorpus, rCorpus, nsteps=3, mutate=False, model_threshold=1.0)[source]

Bases: object

The object given term frequency distribution of a foreground specific corpus and a background reference corpus, provides tools that help to compute specificity of each term in the foreground corpus.

This kind of scoring is mainly to be used for the cases where an input text around a specific theme or topic is given. The process expects a tokenized, cleaned text with term counts.

Note:
It consumes a Corpus object and uses its methods and attributed and mutates it unless desired otherwise.
Attributes:

sCorpus (Corpus): A Corpus class instance of the specific corpus to be scored. rCorpus (Corpus): A Corpus class instance of the reference corpus. common (list of str): The common terms between the foreground and backgrouns corpus distinct (list of str): The terms observed in the foreground but not in the backgrouns corpus model: a prediction model created during instantiation process using the data of the class instance.

For details see`form_prediction_model` method description.
compute_commons()[source]

Computes the specifity score of the terms in the corpus.

Note:

It is a simple log likelihood measure. It compares frequency count of a term in a specific corpus versus its frequency count in the reference reference corpus. Here assumption is that the reference corpus is a large enough sample of the language at observing the occurance of a term. Then having a higher/lower observation frequency of a term in the specific corpus is a proxy indicator for the term choice while having a debate on the topic.

The likelihood ratio for a term $P_t$ is calculated as: .. math:

$P_t = log ( (ntS/NS) / (ntR/NR) )$
where
  • ntS is the raw frequency count of the term in the entire specific corpus
  • ntR is the raw frequenccy count of the term in the reference corpus
  • NS is the total number of terms in the specific corpus
  • NR is the total number of terms in the reference corpus

It should be noted that frequency counts are calculated after having applied the same tokenization and post processing such as excluding stop-words, pancuations, rare terms, etc both on the reference corpus and the specific corpus.

Args:
None
Returns:
(bool): Notifying completion of scoring.
compute_distincts()[source]
Computes the specifity score of the terms in the corpus when neither the term nor its stems
matched by the background corpus.
Note:

It uses a log linear regression model to predict likelihood of the dictinct terms. The model is trained using the scores and frequencies within the matching set.

See form_prediction_model method description for details.

Args:
None
Returns:
(bool): Notifying completion of scoring.
compute_stem_commons()[source]
Computes the specifity score of the terms in the corpus when the term as it is not

matched by a term in the reference corpus. It matches the stems. The loglikelihood ration is applied over the mean frequency counts of the matching stems.

See compute_commons method description for details.

Args:
None
Returns:
(bool): Notifying completion of scoring.
form_prediction_model(threshold=1.0)[source]

The method creats the prediction model to be used for distinct terms.

Note:

It is based on a log-linear regression. The model is created using the observed scores and frequencies within the matching set. The model aims to fit a best line to logarithm of the observed term frequencies vs associated scores.

Considering the fact that frequent distinct terms are likely among the ones with a higher specificity, the terms with relatively high scores are used for the regression. The R-squared of the regression tests have been used for validation of the approach. In the same reasoning among the all distinct terms the ones with relatively higher frequencies are considered for scoring.

ToDo:

As a second approach, the model training to be improved considering terms with relatively high term frequencies and high specificity scores. Observe the scatter plots for the insight.

An alternative, a third approach, would be forming the logarithmic bins on frequencies and using distributional charcteristics of each bin at making predictions. For instance, by simply predicting the median value as the guess.

Args:
threshold (float, optional): The default value is driven from regression tests on
test cases (default 1.0).
Returns:
(bool): Notifying completion of scoring.
get_scores_by(stype='raw')[source]

The method returns computed/available scores by the label of the terms.

Note:

The labels in this implementation correspond: - raw: the term as it is was identified in the background corpus, so

a loglikelihood scoring was applied
  • stem: not the term as it is but its stem was identified, so mean of the observed
    stem occurances in the background was used as the reference
  • noref: neither the term nor its stem was identified, so the prediction model was used
    for the frequent ones.
Args:
stype (str, optional): The term scoring type (default ‘raw’).
Returns:
(dict): The term scores.
plot(threshold=1.0, islog=True)[source]

Scatter plot of frequency vs scores.

Args:
threshold (float, optional): The default value is driven from regression tests on
test cases (default 1.0).

islog (bool): Whether natural log of the frequency counts to be returned (default True).

Returns:
(bool): True.
predict(w, count, minp=0.001, minf=3)[source]
The method assigns a predicted score to a given term with a a frequency
over the designated threshold. An internally formed prediction model is used. The natural logorithm of raw frequency counts is passed to the model. See form_prediction_model method description for details.
Args:
count (int): The raw frequency count. minp (float, optional): The relative frequency threshold (default 0.001). minf (int, optional): The raw frequency threshold (default 3).
Returns:
(float): The predicted score.

omterms.stemmer module

OpenMaker text stemmers.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

omterms.tokenizer module

OpenMaker text tokenizer.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

The module contains a set of basic tools in order to tokenize a given inout text.

Todo:
  • Nothing at the moment ;)
omterms.tokenizer.normalise(s)[source]

Basic string normalisation.

Args:
s: (str): Input string to normalise.
Returns:
(str): Normalised string.
omterms.tokenizer.tokenize(raw)[source]
The function tokenizes by splitting them on spaces, line breaks or characters
in CHARACTERS_TO_SPLIT.
Args:
raw: (str): Input string to split
Returns:
(list of str): list of terms
omterms.tokenizer.tokenize_strip_non_words(raw)[source]

Same as tokenize, but also removes non-word characters.

Args:
raw: (str): Input string to split
Returns:
(list of str): list of terms
omterms.tokenizer.tokenized_pprint(tokens)[source]

A pretty print function for strings tokenized by tokenize.

Args:
tokens: (list of str): list of terms
Returns:
(str): The joined terms.

omterms.utilities module

OpenMaker text utilities.

The module holds a set of utilities for data handling and io related tasks.

Author: Bulent Ozel e-mail: bulent.ozel@gmail.com

omterms.utilities.format_output_fname(current_theme)[source]

Formatting output file name.

omterms.utilities.load_from_file(fname)[source]

The method reloads a new stopwords list.

Note:
Internal stopword is overwritten.
Args:
fname (str): a file path string
Returns:
(set): The list of terms
Raises:
FileNotFoundError: Raised if a given file is not accessable.
omterms.utilities.pandas_filter_rows(df, col='Score', min_t=None, max_t=None)[source]

The method extracts rows from a Pandas data frame for the given score range. The scores above the minimum and below the maximum is selected.

Note:
This function should be generalized so that it can work on any predicate function.
Args:
df (pandas.core.frame.DataFrame): A Pandas data frame. col (str): The column that the filtering operation to be applied (default ‘Score’) min_t (float): The minumum score threshold to be included when assigned (default None). max_t (float): The maximum score threshold to be included when assigned (default None).
Returns:
df (pandas.core.frame.DataFrame): A Pandas data frame.
Raises:
TypeError: Raised if the column data type is not a number.
omterms.utilities.pandas_merge_dfs(dfs)[source]

The method renames designated columns of the the pandas data frame.

Args:
df (pandas.core.frame.DataFrame): A Pandas data frame. cols (list of str): The list of columns to renamed.
Returns:
df (pandas.core.frame.DataFrame): A Pandas data frame.
omterms.utilities.pandas_rename_cols(df, cols=['TF', 'wTF', 'Score'], prefix='u')[source]

The method renames designated columns of the the pandas data frame.

Args:
df (pandas.core.frame.DataFrame): A Pandas data frame. cols (list of str): The list of columns to renamed.
Returns:
df (pandas.core.frame.DataFrame): A Pandas data frame.
omterms.utilities.run_cleaning_process(Cleaner, tokens, exceptions=[], minL=1, minF=4, notallowed=['*'], logging=True)[source]

Term function cleans and counts the words in the ist.

Args:

Cleaner (:obj:`omterms.cleaner.TextCleaner’): The text cleaning object.

tokens (list of str ): The list of words.

minL (int): The minimum allowed term length (default 1).

minF(int): The minimum allowed frequency count for the tabuleted outputs
(default 4).
notallowed (list of str, optional): The list of symbols that would flag
the removal of the term if needed (defualt [‘*’])

exceptions (list of str, optional): The list of exceptions.

logging (bool): Logging (default True)

Returns:
(nltk.FreqDist): Returns the trimmed corpus as the NLTK obj, where essentially
it is python dictionary cleaned terms. The eys are terms and the values are the frequency counts.
omterms.utilities.run_stemming_process(theCorpus, stemf)[source]

The functions computes the stems of terms in the corpus.

Args:
theCorpus (omterms.datauis.Corpus): The text corpus.
Returns:
theCorpus (omterms.datauis.Corpus): The text corpus.
omterms.utilities.run_tokenizing_process(text, tokenizer)[source]

The functions tokenizes the given input text.

Args:

text (str ): The input text.

tokenizer (x str -> y: str): The tokenizer function.

Returns:
(list of str ): The list of words as tokens.
omterms.utilities.summary_corpus(SC, top=20, bottom=20)[source]

The functions computes the stems of terms in the corpus.

Args:
SC (omterms.datauis.Corpus): The text corpus. top (int): The top most common items to be displayed (default 20). bottom (int): The least common items to be displayed (default 20).
Returns:
(pandas.DataFrame) The tabulated data

Module contents