--- title: NLP keywords: fastai sidebar: home_sidebar summary: "API details." description: "API details." ---
{% raw %}
{% endraw %} {% raw %}
%load_ext autoreload
%autoreload 2
%matplotlib inline
{% endraw %} {% raw %}
{% endraw %} {% raw %}
# Only needed for testing.
import pandas as pd
from string import ascii_lowercase
{% endraw %} {% raw %}
# Nonsense sample text.
text = [
    f"Row {i}: I went, yesterday; she wasn't here after school? Today. --2"
    for i in range(25_000)
]
{% endraw %} {% raw %}
df = pd.DataFrame(text, columns=['a'])
df.tail()
a
24995 Row 24995: I went, yesterday; she wasn't here ...
24996 Row 24996: I went, yesterday; she wasn't here ...
24997 Row 24997: I went, yesterday; she wasn't here ...
24998 Row 24998: I went, yesterday; she wasn't here ...
24999 Row 24999: I went, yesterday; she wasn't here ...
{% endraw %} {% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}

tokenize[source]

tokenize(text, nlp=<spacy.lang.en.English object at 0x10ed70190>)

Word tokenize a single string.

Parameters
----------
x: str
    A piece of text to tokenize.
nlp: spacy tokenizer, e.g. spacy.lang.en.English
    By default, a spacy tokenizer with a small English vocabulary
    is used. NER, parsing, and tagging are disabled. Any spacy
    tokenzer can be passed in, but keep in mind other configurations
    may slow down this function dramatically.

Returns
-------
list[str]: List of word tokens from a single input string.
{% endraw %} {% raw %}
{% endraw %} {% raw %}

tokenize_many[source]

tokenize_many(rows, chunk=1000, nlp=<spacy.lang.en.English object at 0x10ed70190>)

Word tokenize a sequence of strings using multiprocessing. The max
number of available processes are used.

Parameters
----------
rows: Iterable[str]
    A sequence of strings to tokenize. This could be a list, a column of
    a DataFrame, etc.
chunk: int
    This determines how many items to send to multiprocessing at a time.
    The default of 1,000 is usually fine, but if you have extremely
    long pieces of text and memory is limited, you can always decrease it.
    Very small chunk sizes may increase processing time. Note that larger
    values will generally cause the progress bar to update more choppily.
nlp: spacy tokenizer, e.g. spacy.lang.en.English
    By default, a spacy tokenizer with a small English vocabulary
    is used. NER, parsing, and tagging are disabled. Any spacy
    tokenzer can be passed in, but keep in mind other configurations
    may slow down this function dramatically.

Returns
-------
list[list[str]]: Each nested list of word tokens corresponds to one
of the input strings.
{% endraw %} {% raw %}
# ~5-6 seconds
x = df.a.apply(tokenize)
{% endraw %} {% raw %}
# ~1-2 seconds
x = tokenize_many(df.a)

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class Vocabulary[source]

Vocabulary(w2idx, w2vec=None, idx_misc=None, corpus_counts=None, all_lower=True)

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class Embeddings[source]

Embeddings(mat, w2i, mat_2d=None)

Embeddings object. Lets us easily map word to index, index to
word, and word to vector. We can use this to find similar words,
build analogies, or get 2D representations for cdting.
{% endraw %} {% raw %}
{% endraw %} {% raw %}

back_translate[source]

back_translate(text, to, from_lang='en')

Parameters
----------

Returns
-------
str: Same language and basically the same content as the original text,
    but usually with slightly altered grammar, sentence structure, and/or
    vocabulary.
{% endraw %} {% raw %}
text = """
Visit ESPN to get up-to-the-minute sports news coverage, scores, highlights and commentary for NFL, MLB, NBA, College Football, NCAA Basketball and more.
"""
back_translate(text, 'es')
'Visit ESPN to get coverage of sports news, scores, highlights and comments from the NFL, MLB, NBA, college football, NCAA basketball and more.'
{% endraw %} {% raw %}
text = """
Visit ESPN to get up-to-the-minute sports news coverage, scores, highlights and commentary for NFL, MLB, NBA, College Football, NCAA Basketball and more.
"""
back_translate(text, 'fr')
'Visit ESPN for up-to-date sports information, scores, highlights and commentary for the NFL, MLB, NBA, college football, NCAA basketball and more.'
{% endraw %} {% raw %}
{% endraw %} {% raw %}

postprocess_embeddings[source]

postprocess_embeddings(emb, d=None)

Implements the algorithm from the paper:

All-But-The-Top: Simple and Effective Post-Processing
for Word Representations (https://arxiv.org/pdf/1702.01417.pdf)

There are three steps:
1. Compute the mean embedding and subtract this from the
original embedding matrix.
2. Perform PCA and extract the top d components.
3. Eliminate the principal components from the mean-adjusted
embeddings.

Parameters
----------
emb: np.array
    Embedding matrix of size (vocab_size, embedding_length).
d: int
    Number of components to use in PCA. Defaults to
    embedding_length/100 as recommended by the paper.
{% endraw %} {% raw %}
{% endraw %} {% raw %}

compress_embeddings[source]

compress_embeddings(emb, new_dim, d=None)

Reduce embedding dimension as described in the paper:

Simple and Effective Dimensionality Reduction for Word Embeddings
(https://lld-workshop.github.io/2017/papers/LLD_2017_paper_34.pdf)

Parameters
----------
emb: np.array
    Embedding matrix of size (vocab_size, embedding_length).
d: int
    Number of components to use in the post-processing
    method described here: https://arxiv.org/pdf/1702.01417.pdf
    Defaults to embedding_length/100 as recommended by the paper.

Returns
-------
np.array: Compressed embedding matrix of shape (vocab_size, new_dim).
{% endraw %}