Corpus¶
- class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]¶
Internal class for storing a corpus.
- __init__(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]¶
- Parameters
domain (Orange.data.Domain) – the domain for this Corpus
X (numpy.ndarray) – attributes
Y (numpy.ndarray) – class variables
metas (numpy.ndarray) – meta attributes; e.g. text
W (numpy.ndarray) – instance weights
text_features (list) – meta attributes that are used for text mining. Infer them if None.
ids (numpy.ndarray) – Indices
- property dictionary¶
A token to id mapper.
- Type
corpora.Dictionary
- property documents¶
Returns a list of strings representing documents — created by joining selected text features.
- documents_from_features(feats)[source]¶
- Parameters
feats (list) – A list fo features to join.
Returns: a list of strings constructed by joining feats.
- extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]¶
Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.
- Parameters
X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append
feature_names (list) – List of string containing feature names
feature_values (list) – A list of possible values for Discrete features.
compute_values (list) – Compute values for corresponding features.
var_attrs (dict) – Additional attributes appended to variable.attributes.
sparse (bool) – Whether the features should be marked as sparse.
rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features
- extend_corpus(metadata, Y)[source]¶
Append documents to corpus.
- Parameters
metadata (numpy.ndarray) – Meta data
Y (numpy.ndarray) – Class variables
- static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]¶
Create corpus from documents.
- Parameters
documents (list) – List of documents.
name (str) – Name of the corpus
attributes (list) – List of tuples (Variable, getter) for attributes.
class_vars (list) – List of tuples (Variable, getter) for class vars.
metas (list) – List of tuples (Variable, getter) for metas.
title_indices (list) – List of indices into domain corresponding to features which will be used as titles.
- Returns
Corpus.
- classmethod from_file(filename)[source]¶
Read a data table from a file. The path can be absolute or relative.
- Parameters
filename (str) – File name
sheet (str) – Sheet in a file (optional)
- Returns
a new data table
- Return type
Orange.data.Table
- classmethod from_numpy(*args, **kwargs)[source]¶
Construct a table from numpy arrays with the given domain. The number of variables in the domain must match the number of columns in the corresponding arrays. All arrays must have the same number of rows. Arrays may be of different numpy types, and may be dense or sparse.
- Parameters
domain (Orange.data.Domain) – the domain for the new table
X (np.array) – array with attribute values
Y (np.array) – array with class values
metas (np.array) – array with meta attributes
W (np.array) – array with weights
- Returns
- classmethod from_table(domain, source, row_indices=Ellipsis)[source]¶
Create a new table from selected columns and/or rows of an existing one. The columns are chosen using a domain. The domain may also include variables that do not appear in the source table; they are computed from source variables if possible.
The resulting data may be a view or a copy of the existing data.
- Parameters
domain (Orange.data.Domain) – the domain for the new table
source (Orange.data.Table) – the source table
row_indices (a slice or a sequence) – indices of the rows to include
- Returns
a new table
- Return type
Orange.data.Table
- classmethod from_table_rows(source, row_indices)[source]¶
Construct a new table by selecting rows from the source table.
- Parameters
source (Orange.data.Table) – an existing table
row_indices (a slice or a sequence) – indices of the rows to include
- Returns
a new table
- Return type
Orange.data.Table
- property ngrams¶
Ngram representations of documents.
- Type
generator
- property pos_tags¶
A list of lists containing POS tags. If there are no POS tags available, return None.
- Type
np.ndarray
- property pp_documents¶
Preprocessed documents (transformed).
- static retain_preprocessing(orig, new, key=Ellipsis)[source]¶
Set preprocessing of ‘new’ object to match the ‘orig’ object.
- set_text_features(feats: Optional[List[Orange.data.variable.Variable]]) None [source]¶
Select which meta-attributes to include when mining text.
- Parameters
feats – List of text features to include. If None infer them.
- set_title_variable(title_variable: Optional[Union[Orange.data.variable.StringVariable, str]]) None [source]¶
Set the title attribute. Only one column can be a title attribute.
- Parameters
title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.
- store_tokens(tokens, dictionary=None)[source]¶
- Parameters
tokens (list) – List of lists containing tokens.
- property titles¶
Returns a list of titles.
- property tokens¶
A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.
- Type
np.ndarray