omterms package¶
Submodules¶
omterms.cleaner module¶
OpenMaker text cleaner.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
The module contains a set of tools to clean textual data.
-
class
omterms.cleaner.
TextCleaner
(stopwords=[], exceptions=[], fstemmer=<function TextCleaner.<lambda>>, stemcheck=False)[source]¶ Bases:
object
An object that contains a set of tools to clean and preprocess textual data.
- Note:
- The object uses nltk.FreqDist object For stem checks during pruninng it needs an external stemmer.
- Attributes:
- exceptions (
list
ofstr
): List of excepted terms. stopwords (list
ofstr
): List of stopwords. stemf: A stemmer funtion.
-
clean
(words, display_top=10, logging=True, exceptions=[])[source]¶ Removes panctuations and stopwords from a corpus.
- Args:
words (
list
ofstr
): The input corpus as list of words. display_top (int
, optional): Logging size (default 10). logging (bool
): Optional. When true stdout logging is done (default True). exceptions (list
ofstr
, optional): The list terms that will notbe pruned (default None).- Returns:
- (
nltk.FreqDist
): Returns the trimmed corpus as the NLTK obj.
-
extend_stopwords
(spointer)[source]¶ The method extends a new stopwords list.
- Args:
- spointer (
list
ofstr
,str
): Either file path string or - a list of stopwords.
- spointer (
- Returns:
- bool: True if successful, False otherwise.
- Raises:
- FileNotFoundError: Raised if a given file is not accessable.
-
static
freq_dist
(words)[source]¶ The static method computes frequency distribution of a word list.
- Args:
- words (
list
ofstr
,str
): list of words. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
-
isexception
(term, exceptions=[], stemcheck=False)[source]¶ The static makes the exception list and returns it.
- Args:
- term (
str
): The term. exceptions (list
, optional): The list of exception terms (default None). stemcheck (bool
, optional): if the list to be extended via the stems (default False) - Returns:
- (
bool
):
-
load_stopwords
(spointer)[source]¶ The method reloads a new stopwords list.
- Note:
- Internal stopword is overwritten.
- Args:
- spointer (
list
ofstr`or :obj:`str
): Either file path string or - a list of stopwords.
- spointer (
- Returns:
- bool: True if successful, False otherwise.
- Raises:
- FileNotFoundError: Raised if a given file is not accessable.
-
static
make_exceptions
(exceptions, stemf=<function TextCleaner.<lambda>>, stemcheck=False)[source]¶ The static method makes the exception list and returns it.
- Args:
- exceptions (
list
): The list of exception terms. stemf (xstr
-> y:str
, optional): A stemmer function (default f(x) = x) stemcheck (bool
, optional): if the list to be extended via the stems (default False) - Returns:
- (
set
): The exception set. If stemcheck is opted both terms and their stems - is be represented in the list.
- (
-
remove_contains
(freq_dist, literals="., ():;!?\n`'=+/\\[]}{|><@#$%^&*_‘’“”.", exceptions=[])[source]¶ Removes the terms that contains the specific literals.
- Args:
- freq_dist (
nltk.FreqDist
): list of words and more. literals (list
ofstr
): list of literals. exceptions (list
ofstr
, optional): The exception list. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
-
remove_numerals
(freq_dist, remove_any=False, exceptions=[])[source]¶ The method removes terms with numeral literals.
- Note:
- When remove_any is selected, literals such as 3D would vanish.
- Args:
- freq_dist (
nltk.FreqDist
): list of words and more. remove_any (bool
, optional): If True mumeral and literal mixed terms are removed. exceptions (list
ofstr
, optional): The exception list. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
-
remove_panctuation
(freq_dist, exceptions=[])[source]¶ The static method removes punctuation only terms.
- Args:
- freq_dist (
nltk.FreqDist
): list of words and more. exceptions (list
ofstr
, optional): The exception list. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
-
remove_rare_terms
(freq_dist, below=3, exceptions=[])[source]¶ The method removes terms that have rare occurances.
- Note:
- Such removal may help reduce errenous and random terms.
- Args:
- freq_dist (
nltk.FreqDist
): list of words and more. below (int
, optional): The minumum allowed frequency count (default 3). exceptions (list
ofstr
, optional): The exception list. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
-
remove_short_terms
(freq_dist, threshold=1, exceptions=[])[source]¶ The method removes terms that are below a certain length.
- Args:
- freq_dist (
nltk.FreqDist
): list of words and more. threshold (int
, optional): The charcter length of a term (default 1). exceptions (list
ofstr
, optional): The exception list. - Returns:
- (
nltk.FreqDist
): Returns frequency dist.
omterms.datauis module¶
OpenMaker textaul data holders and interfaces.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
The module contains a set of tools used to hold and process text corpuses and collections.
-
class
omterms.datauis.
Corpus
(tf_dist, stemmer=None)[source]¶ Bases:
object
A generic class to be used for foreground or background corpuses.
- Attributes:
- tf_dist: (
nltk.FreqDist
): An NLTK container for tokenized and cleaned - terms in the corpus.
stemmer (x
str
-> y:str
): A stemmer function. stems (dict
): A dictionary of terms and their stems labels (dict
): Term level labels. scores (dict
): A dictionary of terms and their corpus sepcificity scoresas of a reference corpus- ref (
dict
): A dictionary of terms that holds normalized occurance frequency of - the term at a reference corpus.
sepficifs (
set
): A set of terms appointed/associated with the corpus externally. To be implemented:texts_raw (
json
): A JSON collection of raw texts of the corpus.texts_clean (
json
): A JSON collection of cleaned/processed texts of the corpus.tf_idf (
json
): Tf-Idf analyses of the corpus (to be implemented).- tf_dist: (
-
compute_stems
()[source]¶ The function returns the a dictionary of terms and their corresponding stems.
- Args:
- stemmer (x
str
-> y:str
): A stemmer function (default None) - Returns:
- (
bool
): True
-
difference
(other, as_corpus=False, stats=False)[source]¶ The method identifies and returns the difference of the self from the other.
- Note:
- Implementation needs style and refactoring.
- Args:
other (
Corpus
): An instance of this Corpus Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequencycount of the difference set.- Returns:
- (
dict
): A dictionary of terms and their frequency counts.
-
get_count_uniques
()[source]¶ The method identifies and returns top frequent terms.
- Returns:
- (
int
): Returns an integer.
-
get_least_frequents
(bottom=42)[source]¶ The method identifies and returns least frequent terms.
- Args:
- bottom (int): Size of the short list (default 42).
- Returns:
- (
list
oftuple
ofstr
andint
): Returns the frequency dist - for the least frequent terms as list of tuples of term and frequency pairs.
- (
-
get_size
()[source]¶ The returns the size of the corpus in terms of number of terms it has.
- Returns:
- (
int
): Returns an integer. It is summation of raw frequency counts.
-
get_stems
()[source]¶ The function returns the a dictionary of terms and their corresponding stems.
Args:
- Returns:
- (
dict
): A dictionary of {term:stem of the term}.
-
get_top_frequents
(top=42)[source]¶ The method identifies and returns top frequent terms.
- Args:
- top (int): Size of the short list (default 42).
- Returns:
- (
list
oftuple
ofstr
andint
): Returns the frequency dist - for top terms as list of tuples of term and frequency pairs.
- (
-
intersection
(other, as_corpus=False, stats=False)[source]¶ The method identifies and returns the intersection of two corpora.
- Args:
other (
Corpus
): An instance of this Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequencycount of the intersections.- Returns:
- (
list
ofstr
): If as_corpus is False and stats is False - it returns the list of joint terms.
- (
list
oftuple
ofstr
andint
): Returns the frequency dist - for the joint terms, if as_corpus is False and stats is True.
- (
Corpus
): In all other cases it returns a nrew Corpus class for the intersection. - Frequencies are the minimum of the two occurances.
- (
-
label
(marker, labels=None)[source]¶ The function labels yet not labeled terms according to the user defined scheme.
- Args:
- marker (x
str
-> y:str
): A marker function - Returns:
- (
dict
): A dictionary of {term:label of term}.
-
list_terms
()[source]¶ It returns the list terms in the corpus
- Note:
- Implementation needs refactoring.
- Returns:
- (
list
): An alphabetically sorted list.
-
set_specific_set
(terms)[source]¶ The function sets the set of corpus specific terms.
- Args:
- terms (
set
): The list of corpus specific terms - Returns:
- (
bool
): True.
-
set_stemmer
(stemmer)[source]¶ The appointing a new stemmer function to the corpus.
- Args:
- stemmer (x
str
-> y:str
): A stemmer function (default None) - Returns:
- (
bool
): True
-
tabulate
(top)[source]¶ Tabulating.
- Note:
- Works better when used to see few top terms.
- Returns:
- (
bool
): True
-
to_pandas
()[source]¶ The function exports its data into pandas dataframe.
- Note:
- ToDo: The function needs parameterization, generalization and error checks.
- Returns:
- (
pandas.DataFrame
) The tabulated data
-
union
(other, as_corpus=False, stats=False)[source]¶ The method identifies and returns the union of two corpora.
- Args:
other (
Corpus
): An instance of this Class object. as_corpus (bool): When True it returns a new Corpus (default False). stats (bool): When True and as_corpus is false returns the frequencycount of the union.- Returns:
- (
list
ofstr
): If as_corpus is False and stats is False - it returns the list of union of terms in both cases.
- (
list
oftuple
ofstr
andint
): Returns the frequency dist - for the union terms, ff as_corpus is False and stats is True.
- (
Corpus
): In all other cases it returns a nrew Corpus class for the intersection. - Frequencies are the minimum of the two occurances.
- (
- Examples:
>>> Corpus(FreqDist('abbbc')).union(Corpus(FreqDist('bccd')), stats = True) [('a', 1), ('b', 3), ('c', 2), ('d', 1)]
-
class
omterms.datauis.
WikiArticles
(collection)[source]¶ Bases:
object
- The object contains a set of tools to process the set of
- documents collected and cleaned by the wiki crawler.
- Attributes:
- collection_json (
str
): This is a filename to the scraped data. Each JSON document is expected to have following fields: - theme: Topic identifier, ex: Sustainability - theme.id: A unique category identifier - document.id: A unique document id - title: Title of the document - url: Full URL of the document - depth: The link distance from the seed docuement. The seed documents depth is 0. - text: The string data scraped from the page without tags. Pancuations are not
required but terms are expected to be delineated by white space.
collection (
list
ofdict
): Loaded json file into native list of dictionaries.- collection_json (
-
collate
(by_theme_id=None, by_doc_ids=[], marker='\n')[source]¶ - The method collects the desired set of documents concatenates them creating a unified document.
- The order of merge is as follows: - When neither list of theme nor doc ids is provided, it collates entire text. - If a theme is given then all the documents under that theme are to be joined first. - When a list of docs is given, only those in the list are kept. Note that if both theme id and doc ids provided, precedence is on themes.
- Args:
- by_theme_id (
int
, optional): The theme id of the docs to be collated. by_doc_ids (list
ofint
, optional): The list of doc ids to be collated (default Empty). marker (str
, optional): A delimiter (default newline) - Returns:
- (
str
): The collated text.
-
display_documents_list
(tid=None, stdout=True)[source]¶ List the articles meta data and crawling information on them.
- Args:
- tid (
int
, optional): Used if documents info under a specific theme is desired - otherwise a summary of whole set is returned or displayed (default None).
- stdout (
bool
, optional): Whether the info is to be displayed/printed - to standard io (default True).
- tid (
- Returns:
- (
list
): A summary list of documents in the collection.
-
get_document_fields
()[source]¶ The method lists the fields of each json field of its collection.
- Args:
- None
- Returns:
- (
dict_keys
): List of keys. None: When the collection is empty.
-
get_theme_id
(theme_name)[source]¶ The method returns topic id of the first match theme name.
- Args:
- theme_name (
str
): The theme or topic name. - Returns:
- (
int
): A unique theme identifier.
-
get_theme_title
(theme_id)[source]¶ The method returns topic name.
- Args:
- theme_id (
int
): The unique theme id. - Returns:
- (
str
): Topic name.
-
list_themes
()[source]¶ The method lists the summary of themes/topics in the collection.
- Args:
- None
- Returns:
- (
list
ofdict
): List of dictionry where keys are - name: theme’s textual descriptor
- id: theme’s unique id
- count: number of articles under the theme.
- (
-
load_corpus
(collection=None)[source]¶ The method loads imports json file into a native collection.
- Args:
- collection (
str
): A filename to a previously scraped data. - Returns:
- (
bool
): True.
-
prune
(themes_to_keep=[], docs_to_drop=[], istodrop=<function WikiArticles.<lambda>>)[source]¶ - The method is used to filter out documents from the set.
The order of prunning is as follows: - when a none empty list is provide all the documents not belonging themes to be kept
are prunned entirely. Note that when initial list is empty it doesn’t have an effect.- of remaing documents those appear in docs_to_drop are prunned
- of the remaing docs those produce a True at a call on the predicate function are dropped.
The function can be repeatedly called until a desired level of prunning is achieved.
- Args:
- themes_to_keep (
list
ofint
, optional): The list of theme ids to be kept (default Empty). docs_to_drop (list
, optional): The list of doc ids to be dropt (default Empty). f (xdict_item
->bool
, optional): A predicate function (default lambda x:False) - Returns:
- (
bool
): True.
omterms.interface module¶
OpenMaker term extraction application interface.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
The application interface provides an encapsualtion and standardization at preparing a set of texts for further analyses. The standardized term extraction process covers tokenization, counting both raw and normalized, cleaning, stemming(optional), and scoring(optional) chioices. The tabulated output can be exported in .csv file format and/or Pandas dataframe format.
The input text(s) can be provided in any of the following formats: - raw text - tokenized text - tokenized and counted
In the same manner if a background courpus based scoring is desired then the reference corpus can be provided in any of the above formats. Depending on the desired actions on the input text, the output may contain not only raw and normalized term frequency counts but also stems of the terms its frequency count in the background corpus, when provided, and term’s log likelihood weight wrt to its prevalance in the reference corpus.
For deatils please see the README and tuturials that comes alongs with installation of this package.
- Attributes:
- OUTPUT_FOLDER (file path): The folder to export the tabulated outputs
- in .csv files.
- OUTPUT_FNAME (file name): The default output file name where results
- for multiple texts are merged and presented.
- STOPWORDS_STANDARD (file path): The location of standard stopward list,
- if desired and exists.
- STOPWORDS_SPECIFIC (file path): The location of topic(s) specific stopward
- list, if desired and exists.
- NOTALLOWED (
list
of str): The list of symbols that would flag the removal of the term, if needed.
- Note: The removal would not take place if the term is a specific term or
- marked as an exception.
- TERMS_SPECIFIC (file path): The location of exception list of terms which is
- shielded from cleaning process, if needed.
TOKENIZER_FUNC (x
str
-> y:str
): A tokenizer function.- STEMMER_FUNC (x
str
-> y:str
): A stemmer function, - if needed
MIN_LENGTH (
int
): The minimum allowed term length.- MIN_FREQ (
float
): The minimum allowed frequency count for - the tabuleted outputs.
- MODEL_THRESHOLD (
int
): if scoring is requested and if the input text - is not driven from the reference corpus, then the paramter is used at training the predection model for the terms that don’t occure in the reference corpus.
- Todo:
- Re-implement the module as a memoized object either via a class or
- a via wrapper function.
- Add a functionality where the configuration paramters can be loaded
- from a JSON file.
- Make tokenization an optional process, for the cases where the input
- is already provided in tokens.
-
omterms.interface.
extract_terms
(texts, tokenizer=<function tokenize_strip_non_words>, merge=False, min_termlength=1, min_tf=1, topics=[], extra_process=[], stemmer=<bound method PorterStemmer.stem of <PorterStemmer>>, refcorpus=None, export=False, basefname='omterms.csv', outputdir='./output/', notallowed_symbols="., ():;!?\n`'=+/\\[]}{|><@#$%^&*_‘’“”.", nonremovable_terms='./data/specifics_openmaker.txt', file_standard_stopwords='./data/stopwords_standard.txt', file_specific_sopwords='./data/stopwords_openmaker.txt', regression_threshold=1.0)[source]¶ Term extraction modules main driver function.
- Args:
texts (
str
ordict
of str oromterms.WikiArticles
): The input text can be any of the following:- a string,
- or a dictionary of strings where the key denotes the topic
- or any desired label/annotation regarding the text,
- or a special data holder which contains labeled text scraped
- from Wikipedia articles.
- tokenizer (x
str
-> y:str
): The tokenizer, - (defualt omterms.tokenizer.tokenize_strip_non_words).
- merge (
bool
): When a collection of text is provided via a dict or - WikiArticles, the parmater detertmines whether they should be conactinated for the term extraction (default False).
min_termlength (
int
): The minimum allowed term length (default 1).- min_tf (
int
): The minimum allowed frequency count for the tabuleted - outputs (default 4).
- topics (
list
of str, optional): The list of topics from the input texts to be considered (default Empty).
If topic list is not provided and the merge is not requested but the input text is given either via dict or via the WikiArticles data holder, then the topic list will be driven from the input collection automatically.
- extra_process (
list
of str, optional): Whether stemming and/or scoring is requested (default Empty). - ‘stem’ is used/needed to flag stemming. - ‘compare’ is used/needed to scoring texts against the designated
reference corpus.- stemmer (x
str
-> y:str
, optional): A stemmer function, - if needed (default omterms.stemmer.porter).
- refcorpus (
str
or
list
of str ordict
of str oromterms.WikiArticles
, optional): The reference corpus(Default None)- The refcorpus can beny of the following:
- None: if it is none yet the a scoring process is requested then
- NLTK’s Brown corpus is loaded.
- a string: A plain text.
- list of words: List of words or tokens.
- or a dictionary of strings where the text from the text will
- be unified for the reference corpus.
- or a special data holder which contains labeled text scraped from
- Wikipedia articles, where all the texts from the collection will be combined to be used as the reference background corpus.
- export (
bool
, optional): Whether the resulting tables should be - exported (default False).
- basefname (
str
, optional): The output table name/prefix. Is effective - only when export is requested (default ‘omsterms.csv’).
- outputdir (
str
, optional): The file path, that is the folder to export - the tabulated outputs in .csv files (default ‘./data/’).
- notallowed_symbols (
list
of str, optional): The list of symbols - that would flag the removal of the term if needed (defualt omterms.tokenizer.CHARACTERS_TO_SPLIT
- nonremovable_terms (
str
, optional): File path to the list - of exceptions.
- file_standard_stopwords (
str
, optional): The file path to the standard stopward list, if desired and exists.
- Note: The removal would not take place if the term is a specific
- term or marked as an exception.
- file_specific_stopwords (
str
, optional): The file path to a specifix stopward list, if desired and exists.
Note: The removal would not take place if the term is a specific term that is if marked as an exception.
- regression_threshold (
float
, optional): if scoring is requested - and if the input text is not driven from the reference corpus then this paramter is used at training the predection model for the terms that don’t occure in the reference corpus (default 1.0).
- Returns:
- (
pandas.DataFrame
, optional) The tabulated data.
omterms.measures module¶
OpenMaker term scoring tools.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
-
class
omterms.measures.
Scoring
(sCorpus, rCorpus, nsteps=3, mutate=False, model_threshold=1.0)[source]¶ Bases:
object
The object given term frequency distribution of a foreground specific corpus and a background reference corpus, provides tools that help to compute specificity of each term in the foreground corpus.
This kind of scoring is mainly to be used for the cases where an input text around a specific theme or topic is given. The process expects a tokenized, cleaned text with term counts.
- Note:
- It consumes a Corpus object and uses its methods and attributed and mutates it unless desired otherwise.
- Attributes:
sCorpus (
Corpus
): A Corpus class instance of the specific corpus to be scored. rCorpus (Corpus
): A Corpus class instance of the reference corpus. common (list
of str): The common terms between the foreground and backgrouns corpus distinct (list
of str): The terms observed in the foreground but not in the backgrouns corpus model: a prediction model created during instantiation process using the data of the class instance.For details see`form_prediction_model` method description.
-
compute_commons
()[source]¶ Computes the specifity score of the terms in the corpus.
- Note:
It is a simple log likelihood measure. It compares frequency count of a term in a specific corpus versus its frequency count in the reference reference corpus. Here assumption is that the reference corpus is a large enough sample of the language at observing the occurance of a term. Then having a higher/lower observation frequency of a term in the specific corpus is a proxy indicator for the term choice while having a debate on the topic.
The likelihood ratio for a term $P_t$ is calculated as: .. math:
$P_t = log ( (ntS/NS) / (ntR/NR) )$
- where
- ntS is the raw frequency count of the term in the entire specific corpus
- ntR is the raw frequenccy count of the term in the reference corpus
- NS is the total number of terms in the specific corpus
- NR is the total number of terms in the reference corpus
It should be noted that frequency counts are calculated after having applied the same tokenization and post processing such as excluding stop-words, pancuations, rare terms, etc both on the reference corpus and the specific corpus.
- Args:
- None
- Returns:
- (
bool
): Notifying completion of scoring.
-
compute_distincts
()[source]¶ - Computes the specifity score of the terms in the corpus when neither the term nor its stems
- matched by the background corpus.
- Note:
It uses a log linear regression model to predict likelihood of the dictinct terms. The model is trained using the scores and frequencies within the matching set.
See form_prediction_model method description for details.
- Args:
- None
- Returns:
- (
bool
): Notifying completion of scoring.
-
compute_stem_commons
()[source]¶ - Computes the specifity score of the terms in the corpus when the term as it is not
matched by a term in the reference corpus. It matches the stems. The loglikelihood ration is applied over the mean frequency counts of the matching stems.
See compute_commons method description for details.
- Args:
- None
- Returns:
- (
bool
): Notifying completion of scoring.
-
form_prediction_model
(threshold=1.0)[source]¶ The method creats the prediction model to be used for distinct terms.
- Note:
It is based on a log-linear regression. The model is created using the observed scores and frequencies within the matching set. The model aims to fit a best line to logarithm of the observed term frequencies vs associated scores.
Considering the fact that frequent distinct terms are likely among the ones with a higher specificity, the terms with relatively high scores are used for the regression. The R-squared of the regression tests have been used for validation of the approach. In the same reasoning among the all distinct terms the ones with relatively higher frequencies are considered for scoring.
- ToDo:
As a second approach, the model training to be improved considering terms with relatively high term frequencies and high specificity scores. Observe the scatter plots for the insight.
An alternative, a third approach, would be forming the logarithmic bins on frequencies and using distributional charcteristics of each bin at making predictions. For instance, by simply predicting the median value as the guess.
- Args:
- threshold (
float
, optional): The default value is driven from regression tests on - test cases (default 1.0).
- threshold (
- Returns:
- (
bool
): Notifying completion of scoring.
-
get_scores_by
(stype='raw')[source]¶ The method returns computed/available scores by the label of the terms.
- Note:
The labels in this implementation correspond: - raw: the term as it is was identified in the background corpus, so
a loglikelihood scoring was applied- stem: not the term as it is but its stem was identified, so mean of the observed
- stem occurances in the background was used as the reference
- noref: neither the term nor its stem was identified, so the prediction model was used
- for the frequent ones.
- Args:
- stype (
str
, optional): The term scoring type (default ‘raw’). - Returns:
- (
dict
): The term scores.
-
plot
(threshold=1.0, islog=True)[source]¶ Scatter plot of frequency vs scores.
- Args:
- threshold (
float
, optional): The default value is driven from regression tests on - test cases (default 1.0).
islog (
bool
): Whether natural log of the frequency counts to be returned (default True).- threshold (
- Returns:
- (
bool
): True.
-
predict
(w, count, minp=0.001, minf=3)[source]¶ - The method assigns a predicted score to a given term with a a frequency
- over the designated threshold. An internally formed prediction model is used. The natural logorithm of raw frequency counts is passed to the model. See form_prediction_model method description for details.
- Args:
- count (
int
): The raw frequency count. minp (float
, optional): The relative frequency threshold (default 0.001). minf (int
, optional): The raw frequency threshold (default 3). - Returns:
- (
float
): The predicted score.
omterms.tokenizer module¶
OpenMaker text tokenizer.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
The module contains a set of basic tools in order to tokenize a given inout text.
- Todo:
- Nothing at the moment ;)
-
omterms.tokenizer.
normalise
(s)[source]¶ Basic string normalisation.
- Args:
- s: (
str
): Input string to normalise. - Returns:
- (
str
): Normalised string.
-
omterms.tokenizer.
tokenize
(raw)[source]¶ - The function tokenizes by splitting them on spaces, line breaks or characters
- in CHARACTERS_TO_SPLIT.
- Args:
- raw: (
str
): Input string to split - Returns:
- (
list
ofstr
): list of terms
omterms.utilities module¶
OpenMaker text utilities.
The module holds a set of utilities for data handling and io related tasks.
Author: Bulent Ozel e-mail: bulent.ozel@gmail.com
-
omterms.utilities.
load_from_file
(fname)[source]¶ The method reloads a new stopwords list.
- Note:
- Internal stopword is overwritten.
- Args:
- fname (
str
): a file path string - Returns:
- (
set
): The list of terms - Raises:
- FileNotFoundError: Raised if a given file is not accessable.
-
omterms.utilities.
pandas_filter_rows
(df, col='Score', min_t=None, max_t=None)[source]¶ The method extracts rows from a Pandas data frame for the given score range. The scores above the minimum and below the maximum is selected.
- Note:
- This function should be generalized so that it can work on any predicate function.
- Args:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame. col (str
): The column that the filtering operation to be applied (default ‘Score’) min_t (float
): The minumum score threshold to be included when assigned (default None). max_t (float
): The maximum score threshold to be included when assigned (default None). - Returns:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame. - Raises:
- TypeError: Raised if the column data type is not a number.
-
omterms.utilities.
pandas_merge_dfs
(dfs)[source]¶ The method renames designated columns of the the pandas data frame.
- Args:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame. cols (list
ofstr
): The list of columns to renamed. - Returns:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame.
-
omterms.utilities.
pandas_rename_cols
(df, cols=['TF', 'wTF', 'Score'], prefix='u')[source]¶ The method renames designated columns of the the pandas data frame.
- Args:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame. cols (list
ofstr
): The list of columns to renamed. - Returns:
- df (
pandas.core.frame.DataFrame
): A Pandas data frame.
-
omterms.utilities.
run_cleaning_process
(Cleaner, tokens, exceptions=[], minL=1, minF=4, notallowed=['*'], logging=True)[source]¶ Term function cleans and counts the words in the ist.
- Args:
Cleaner (:obj:`omterms.cleaner.TextCleaner’): The text cleaning object.
tokens (
list
ofstr
): The list of words.minL (
int
): The minimum allowed term length (default 1).- minF(
int
): The minimum allowed frequency count for the tabuleted outputs - (default 4).
- notallowed (
list
ofstr
, optional): The list of symbols that would flag - the removal of the term if needed (defualt [‘*’])
exceptions (
list
ofstr
, optional): The list of exceptions.logging (
bool
): Logging (default True)- minF(
- Returns:
- (
nltk.FreqDist
): Returns the trimmed corpus as the NLTK obj, where essentially - it is python dictionary cleaned terms. The eys are terms and the values are the frequency counts.
- (
-
omterms.utilities.
run_stemming_process
(theCorpus, stemf)[source]¶ The functions computes the stems of terms in the corpus.
- Args:
- theCorpus (
omterms.datauis.Corpus
): The text corpus. - Returns:
- theCorpus (
omterms.datauis.Corpus
): The text corpus.
-
omterms.utilities.
run_tokenizing_process
(text, tokenizer)[source]¶ The functions tokenizes the given input text.
- Args:
text (
str
): The input text.tokenizer (x
str
-> y:str
): The tokenizer function.- Returns:
- (
list
ofstr
): The list of words as tokens.
-
omterms.utilities.
summary_corpus
(SC, top=20, bottom=20)[source]¶ The functions computes the stems of terms in the corpus.
- Args:
- SC (
omterms.datauis.Corpus
): The text corpus. top (int
): The top most common items to be displayed (default 20). bottom (int
): The least common items to be displayed (default 20). - Returns:
- (
pandas.DataFrame
) The tabulated data