selkie.dep
— Dependency conversion
Overview
The toplevel function for conversions among tree types
is convert()
. It takes optional
arguments giving the type of input
and the type of output
.
By default, input
is 'tree'
and output
is 'efstemma'
.
>>> from selkie.tree import parse_tree
>>> t = parse_tree('(S (NP (Pron this))'
... ' (VP (VBZ is)'
... ' (NP (DT a) (NN test))))')
...
>>> from selkie.dep import convert
>>> print(convert(t))
0 *root* _ _ _ _
1 this Pron _ _ 2
2 is VBZ _ root 0
3 a DT _ _ 4
4 test NN _ _ 2
To see how heads are assigned, one can specify 'headed'
output:
>>> print(convert(t, output='headed'))
0 (S
1 (NP
2 (Pron:head this))
3 (VP:head
4 (VBZ:head is)
5 (NP
6 (DT a)
7 (NN:head test))))
Or if one prefers a dependency tree to a stemma:
>>> print(convert(t, output='dep'))
0 (VBZ:root
1 (Pron this)
is
2 (NN
3 (DT a)
test))
The legal types input and output types are:
'tree'
for an unheaded constituency tree,
'headed'
for a headed constituency tree,
'dep'
for a dependency tree,
'stemma'
for a Sentence
possibly containing empty words,
'efstemma'
for an ε-free stemma.
These reflect the steps of the
conversion: mark_heads()
converts an unheaded tree to a headed
tree, dependency_tree()
converts a headed tree to a dependency
tree, stemma()
converts a dependency tree to a stemma, and
eliminate_epsilons()
eliminates empty words.
All steps except the first are non-destructive. If given an unheaded
tree as input, convert()
makes a copy before calling
mark_heads()
, unless the keyword argument destructive=True
is provided.
The keyword arguments projections
and reductions
may
optionally be provided; they are passed directly to
dependency_tree()
.
Usage
The central function provided by selkie.dep
is dependency_tree()
,
which converts a headed phrase-structure tree to a dependency tree.
(It signals an error if it encounters a headless node.)
>>> h = parse_tree('''
... (S (NP:subj (Det the) (N:head dog))
... (VP:head (V:head chased)
... (NP:obj (Det a) (N:head cat)))
... (Adv:mod quickly))
... ''')
>>> from selkie.dep import dependency_tree
>>> d = dependency_tree(h)
>>> print(d)
0 (V:root
1 (N:subj
2 (Det the)
dog)
chased
3 (N:obj
4 (Det a)
cat)
5 (Adv:mod quickly))
The function dependency_tree()
takes two keyword arguments:
projections
and reductions
. They are passed directly to
the tree()
method of Projection
, which is discussed
below.
It should be noted that the dependency tree may contain empty nodes.
The conversion treats all terminal nodes alike, whether they have a
string or None
as their value for .word
.
Projections
The dependency_tree()
function works by converting the tree first to
its projections, where a projection is defined as a list of
nodes, each being the head of the previous. There is one projection
for each leaf node. For example, in the
tree h above, “the” has projection (Det), “dog” has
projection (NP, N), “chased” has projection (S, VP, V),
“a” has projection (Det), “cat” has projection (NP, N), and
“quickly” has projection (Adv).
The left dependents of a projection are defined to be the concatenation of left dependents of the nodes it contains, from outermost to innermost. The right dependents are defined to be the concatenation of the right dependents of the nodes, from innermost to outermost. For example, the only left dependent of (S, NP, V) is the subject NP, and its right dependents are the object NP and the adverb.
The class Projection
represents a projection. One creates a
projection from a headed tree:
>>> from selkie.dep import Projection
>>> p = Projection(h)
This actually creates projections recursively for the entire tree.
Nodes.
The value of attribute nodes
is the list of nodes that make up
the projection:
>>> p.nodes
[<Tree S ...>, <Tree VP ...>, <Tree V chased>]
Ldeps, rdeps.
The attributes ldeps
and rdeps
contain the left and right
dependents, converted to projections:
>>> p.ldeps
[<Projection NP N dog>]
>>> p.rdeps
[<Projection NP N cat>, <Projection Adv quickly>]
Lr, parent, headsib.
Each non-root projection has values for lr
, parent
, and
headsib
, representing the configuration in which the root node
occurs in the original tree. This configuration is called the
“reduction” represented by attaching the root of projection to its parent.
For example, the projection for the
subject NP occurs as a left dependent in S, with head child VP.
Accordingly:
>>> sp = p.ldeps[0]
>>> sp.lr
'L'
>>> sp.parent
<Tree S ...>
>>> sp.headsib
<Tree VP ...>
(For the root projection, all three attributes have the value None
.)
Tree.
The method tree()
converts a projection into a dependency tree.
By default, the category of a projection is taken to be the part of
speech of the head node (that is, nodes[-1]
.cat), and the role
is the role (if any) of the root node (that is, nodes[0].role
).
There are two boolean keyword arguments that can be used to select
alternative definitions of category and role. If projections
is
true, then the category is the concatenation of all categories in the
projection. For example:
>>> print(p.tree(projections=True))
0 (S_VP_V:root
1 (NP_N:subj
2 (Det the)
dog)
chased
3 (NP_N:obj
4 (Det a)
cat)
5 (Adv:mod quickly))
If reductions
is true, then the role is represented by a
Reduction
object, which prints out as
the concatenation of lr
, nodes[0].cat
, parent.cat
,
and headsib.cat
. For example:
>>> print(p.tree(reductions=True))
0 (V:root
1 (N:'L_NP:subj_S_VP'
2 (Det:L_Det_NP_N the)
dog)
chased
3 (N:'R_NP:obj_VP_V'
4 (Det:L_Det_NP_N a)
cat)
5 (Adv:'R_Adv:mod_S_VP' quickly))
One can specify both projections
and reductions
, if desired.
Reduction
The class Reduction
represents the configuration, in the
original headed phrase structure tree, in which a dependent occurs.
It has four attributes:
lr
may be “L
,” for a dependent that precedes its
head sibling, or “R
,” for one that follows, or “root
,”
for the root node.
dep
is the category of the dependent.
parent
is the category of the parent node.
head
is the category of the head sibling.
Stemmas and governor arrays
A dependency stemma is represented by a Sentence
instance, which
contains Word
instances representing the individual words of the
sentence. A Sentence
may itself have an index()
, which is
intended to represent its position in a collection of sentences such
as a treebank. Otherwise, a Sentence
is simply a list of
Word
instances. The word at position 0 is a pseudo-word
representing the root.
To create a sentence with a known number of words, use make_sentence()
:
>>> from selkie.dep import make_sentence
>>> s = make_sentence(4, index='test')
>>> s[1].form = 'This'
>>> s[2].form = 'is'
>>> s[3].form = 'a'
>>> s[4].form = 'test'
>>> print(s)
0 *root* _ _ _ _
1 This _ _ _ 0
2 is _ _ _ 0
3 a _ _ _ 0
4 test _ _ _ 0
Alternatively, one can create an empty sentence and add words one at a
time. (Note that an “empty” sentence does contain a *root*
pseudo-word.)
>>> from selkie.dep import Sentence, Word
>>> s = Sentence()
>>> s.append(Word(form='hi'))
>>> s.append(Word(form='there'))
>>> print(s)
0 *root* _ _ _ _
1 hi _ _ _ 0
2 there _ _ _ 0
One can copy an existing word by using the copy()
method:
>>> s[1].copy()
<Word None hi govr=0>
The copy is identical to the original except that its sent
and
index
are both None
.
- class selkie.dep.Sentence
The methods of
Sentence
are as follows:- index()
Returns the index of the sentence.
- providence()
Returns the index as a string, or
None
.
- __len__()
Includes the root pseudo-word.
- __iter__()
Iterates over all words, including the root pseudo-word.
- __getitem__(i)
Returns the i-th word; the root pseudo-word is at 0.
- words()
Returns a list of word forms (strings), excluding the root pseudo-word.
- nwords()
Excludes the root pseudo-word.
- cmp(s, other)
Sentences are compared by comparing words from left to right until a difference is found. The root pseudo-words are assumed identical, and are not included in the comparison.
- append(w)
Adds w (not a copy) to the list of words.
- form(i)
Returns the form of the i-th word.
- cat(i)
Returns the category of the i-th word.
- cpos(i)
Returns the coarse category of the i-th word. This signals an error if the sentence is not a CoNLL sentence.
- lemma(i)
Returns the lemma of the i-th word.
- morph(i)
Returns the morph of the i-th word.
- govr(i)
Returns the governor of the i-th word.
- role(i)
Returns the role of the i-th word.
- column(c)
Returns the column named c, which should be one of
'form'
,'cat'
,'lemma'
,'morph'
,'govr'
, or'role'
. The column is a list of values, one for each word. It includes the root pseudo-word.
- class selkie.dep.Word
The members of
Word
are as follows:- index
The position of the word in the sentence; the root pseudo-word has index 0.
- form
The printed form of the word.
- cat
The part of speech. In sentences read from a CoNLL-format file, the cat is a pair (cpos, fpos).
- lemma
The lemma, i.e., the key to use for lexical access.
- morph
Morphological information.
- govr
The index of the governor.
- role
The role with respect to the governor.
The methods of
Word
are:- __lt__(other)
Comparison is done by comparing attribute values in the order
form
,cat
,lemma
,morph
,govr
,role
. The attributeindex
is intentionally omitted, with the consequence that words at different positions in the sentence may be equal. The attributecpos
is also omitted; it is assumed thatcpos
, if present, is uniquely determined bycat
.
- tagged_string()
Returns “form.cat”.
Conversion to Sentence
(stemma)
A stemma is a list of Word
objects, one for each word in
the sentence. The Word
class represents a word as the
dependent in a dependency link. The function stemma()
converts
a dependency tree into a stemma. For example:
>>> from selkie.dep import stemma
>>> s = stemma(d)
>>> print(s, end='')
0 *root* _ _ _ _
1 the Det _ _ 2
2 dog N _ subj 3
3 chased V _ root 0
4 a Det _ _ 5
5 cat N _ obj 3
6 quickly Adv _ mod 3
The columns are: index, word, part of speech, lemma, role, and governor. The value for governor is the index of the governor, not the governor itself.
One can access a stemma like a list:
>>> s[2]
<Word 2 dog/N:subj govr=3>
>>> s[2].role
'subj'
>>> s[2].govr
3
>>> s[3]
<Word 3 chased/V:root govr=0>
The length of the stemma is the number of words in the sentence plus one for the root:
>>> len(s)
7
The element at index 0 is a pseudo-word representing the root of the sentence.
>>> s[0]
<Word 0 *root*>
The method words()
returns a list of word forms (strings)
excluding the root pseudo-word.
>>> s.words()
['the', 'dog', 'chased', 'a', 'cat', 'quickly']
Governor array
A very compact representation of a dependency tree is the governor array. This is simply a list of numbers representing, for each word, the index of the governor of that word.
>>> from selkie.dep import governor_array
>>> governor_array(d)
[2, 3, 0, 5, 3, 3]
The argument to governor_array()
may be either a stemma or
something that can be converted to a stemma using the function stemma()
.
DepLists
A DepLists
object behaves as a list of lists. It is indexed by
word index i, and returns the list of indices of words dependent on
i. For example, in our example Sentence s
, word 3 (chased)
has dependents 2 (dog), 5 (cat), and 6
(quickly).
>>> from selkie.dep import DepLists
>>> deps = DepLists(s)
>>> deps[3]
[2, 5, 6]
>>> len(deps)
7
The DepLists
object prints out readably:
>>> print(deps)
[0] *root*
root: [3] chased
[1] the
[2] dog
None: [1] the
[3] chased
subj: [2] dog
obj: [5] cat
mod: [6] quickly
[4] a
[5] cat
None: [4] a
[6] quickly
It contains a pointer to the original sentence, which can be used for access to the identity of the dependents, etc.
>>> deps.sentence[2].form
'dog'
Lemmatization
The Sentence method lemmatize()
sets the lemma
, cpos
,
and morph
attributes for each word.
The value for lemma
is
the lemmatized word. The module selkie.stemmer
is used.
The value for cpos
is the part of speech with inflection
stripped. The known inflected tags are
'VBZ'
, 'VBG'
, 'VBN'
, 'VBP'
, 'VBD'
,
'NN'
, 'NNS'
, and the lemmatized versions are 'V'
or
'N'
.
The value for morph
is set to one of:
'3s'
, 'ing'
, 'en'
, 'pl'
, 'ed'
, 'sg'
,
'pl'
.
The method is destructive. It only works for English.
Eliminating epsilons
The Sentence method eliminate_epsilons()
eliminates empty words
(those whose form is None
). It is possible for empty words to
have dependents. Suppose word w has governor g, which is empty.
The new governor of w is defined to be its lowest non-empty
ancestor, where ancestor means the transitive closure of
governor-of.
>>> h = parse_tree('''
... (VP (V:head thought)
... (CP (C:head)
... (S
... (NP:subj (Name:head John))
... (VP:head (V:head left)))))
... ''')
>>> s = stemma(dependency_tree(h))
>>> print(s)
0 *root* _ _ _ _
1 thought V _ root 0
2 _ C _ _ 1
3 John Name _ subj 4
4 left V _ _ 2
>>> print(s.eliminate_epsilons())
0 *root* _ _ _ _
1 thought V _ root 0
2 John Name _ subj 3
3 left V _ _ 1
CoNLL Format
To get the raw contents of a file in CoNLL dependency format, use
selkie.io.iter_record_blocks()
.
>>> from selkie.io import iter_record_blocks
>>> from selkie.data import ex
>>> sent = next(iter_record_blocks(ex('depsent1')))
>>> sent[0]
['1', 'This', 'this', '_', 'pron', '_', '2', 'subj', '_', '_']
The fields are: index, form, lemma, cpos, fpos, morph, head, rel, phead, prel. The fields cpos, phead, and prel are considered “extra” information: they are optional, whereas fpos, head, and rel are obligatory. (Head and rel are obligatory, but need not be projective; phead and rel are optional, but must be projective.) Missing fields are represented with a single underscore character.
- iter_sentences(fn)
The function
iter_sentences()
reads a CoNLL-format file as a sequence ofselkie.dep.Sentence
instances. It takes a filename as input, with an optional “#proj
” or “#std
” suffix. The functionconll_sents()
is a synonym.The mapping between the raw fields and the Sentence attributes is done as follows. For each word, if both cpos and fpos are present, then the cat is fpos and
cpos
is added as an extra attribute. If only one is present, it becomes the cat. If the filename ends in#proj
, the phead and prel are used; otherwise, the head and rel are used. (The suffix “#std
” selects head and rel, but that is also the default.)>>> from selkie.dep import iter_sentences >>> s = next(iter_sentences(ex('depsent1'))) >>> print(s[1]) <Word 1 This/pron:subj (this) govr=2> >>> s[1].cat 'pron'
- load_sentences(fn)
Returns a list rather than an iteration.
- save_sentences(sents, fn)
Takes a list of sentences and a filename as input.
>>> from tempfile import TemporaryDirectory >>> from os.path import join >>> from selkie.dep import save_sentences, load_sentences >>> with TemporaryDirectory() as dfn: ... fn = join(dfn, 'sents') ... save_sentences([s], fn) ... sents = load_sentences(fn) ... print(sents[0]) ... 0 *root* _ _ _ _ 1 This pron this subj 2 2 is vb be mv 0 3 a dt a det 4 4 test n test prednom 2
If one loads a sentence and then saves it, the result may differ from the original. Namely, if the original records contain cpos but not fpos, the cpos will show up in the fpos position in the saved file.
Universal postag mapping
Das and Petrov (2011) [3145] introduced a set of universal
part-of-speech tags that were subsequently used in the McDonald et
al. delexicalized parsers. Petrov, Das & McDonald [3300]
describe a set of tag tables, which are installed in selkie.data
as conll/2006/universal-pos-tags
.
- load_umap(fn)
Loads a tag map from a file, returning a dict. (If given a relative pathname, it expands it relative to the
universal-pos-tags
directory.)>>> from selkie.dep import load_umap >>> map = load_umap('da-ddt.map') >>> map['VA'] 'VERB'
- apply_umap(tagmap, sent)
Takes a map and a sentence in which the word
cat
values are (cpos, fpos
) pairs, and it changes thecat
values to bemap[fpos]
.
- umapped_sents(fn, tagmap)
Takes a filename and a map, and generates a sequence of sentences in which the map has been applied to the parts of speech. It takes an optional flag
projective=True
whose meaning is the same as forconll_sents()
.The following example assumes that one has downloaded the CoNLL 2006 data and stored its location under the config key
data.conll
:>>> from selkie.config import config >>> from os.path import expanduser, join >>> conll = expanduser(config['data']['conll']) >>> fn = join(conll, '2006', 'danish', 'ddt', 'train', 'danish_ddt_train.conll') >>> from selkie.dep import umapped_sents >>> s = next(umapped_sents(fn, map)) >>> s[1].form 'Samme' >>> s[1].cat 'ADJ'