selkie.tree — Trees

The Tree class

class selkie.tree.Tree

The nodes of a tree are represented by instances of the class Tree. There is no separate node class: a node and the tree rooted at the node are both represented by a Tree instance.

word

A string or None.

cat

Category.

sem

Semantic translation or None.

id

An identifier.

parent

The parent node, if set_parents() was done.

__init__(**kwargs)

Members can be specified as keywords.

copy(**kwargs)

Members can be specified as keywords.

__getitem__(i)

I-th node in preorder walk.

__iter__()

Preorder walk.

__str__()

Pretty-printed.

Basic node types

We wish to accommodate the nodes that occur in three kinds of trees: unheaded phrase-structure trees, headed phrase-structure trees, and dependency trees. The attributes word, children, and role are used to distinguish among node types.

Interior and leaf. The children attribute distinguishes between interior nodes and leaf nodes. The former have children; the latter do not. Governors and phrasal nodes are interior nodes. An interior node is headed if one of its children has the role 'head', and it is unheaded otherwise.

Lexical and nonlexical. A node that has a (boolean True) value for word is lexical; otherwise the node is nonlexical.

Taking the cross product yields the following five node types:

Leaf

Interior

Headed

Unheaded

Nonlexical:

empty

headed phrase

unheaded phrase

Lexical:

word

governor

For a governor, a child with role 'head' has no special status.

Terminal and nonterminal. The interior-leaf distinction is not the same as the terminal-nonterminal distinction. The latter is a property of categories, as determined by a grammar. Terminal categories are not allowed to appear on the lefthand side of a rewrite rule, whereas nonterminal categories that are not useless appear on the lefthand side of at least one rewrite rule. It is possible to have leaf nodes with nonterminal categories: such nodes are null expansions. A tree generated by a constituent-structure grammar cannot have interior nodes labeled with terminal categories, but dependency grammars make no terminal-nonterminal distinction, and permit trees in which interior nodes are labeled with parts of speech, which would be terminal categories in a constituent-structure grammar.

Leaf words and empty leaves. The word attribute distinguishes between leaf words and empty leaves. The former have a non-null value for word, and the latter do not. Note that the expression leaf word is not redundant in a dependency tree: leaf words contrast with governors, which are interior nodes with a value for word. However, in a constituency tree, only leaf nodes have values for word, so in that context we can refer to leaf words simply as words.

Depending on the kind of category it has, an empty leaf may represent either a null terminal (like an empty complementizer) or a null expansion (corresponding to a rewrite rule with nothing on the right-hand side). We are careful not to refer to null terminals as “words,” reserving the term word for a node with a non-null value for word.

Governor versus phrase. The word attribute also distinguishes between governors and phrases. Both are interior nodes; the former has a non-null value for word, while the latter does not. That is, a governor is a node that has non-null values for both children and word. Governors are used in dependency trees; their children are called dependents. By contrast, a phrase or phrasal node has children but no word. Phrasal nodes are used in constituency trees.

Heads. We further subdivide phrasal nodes according to whether they have heads or not. The head of a phrasal node is defined to be a child whose role is “head.” A phrasal node with a head is a headed phrase, and a phrasal node without a head is an unheaded phrase. Governors and leaves are by definition headless.

Other attributes

Number of left dependents. The nld attribute is relevant only for governors. It indicates the number of left dependents. In the terminal string, the governor is ordered after its left dependents and before its right dependents. If a phrasal node or leaf has a value for nld, the value is ignored.

Parent. The parent attribute is set by the function set_parents(). It permits one to navigate not only down a tree, but also back up again.

Cat. The cat attribute represents the syntactic category of the tree. The category may be anything, though strings and Category instances are the commonest choices.

Role. The links in a dependency tree are often labeled. The link label indicates the relationship between governor and dependent, such as “subject” or “object.” The same relationship can be useful in constituent trees as indicating the role of a child relative to its parent (or the head of its parent).

As already mentioned, the role “head” has a special status if the parent is a phrasal node.

ID. Nodes are sometimes assigned identifiers, such as the indices used to encode movement relations or control.

Sem. The value for sem is the semantic translation of the node.

Example

Here is an example of constructing a tree manually, by constructing individual nodes. The first two arguments to the constructor are the category and a list of children. A word may be specified using the keyword “word.”

>>> from selkie.tree import Tree
>>> det = Tree('Det', word='the')
>>> n = Tree('N', word='dog')
>>> np = Tree('NP', [det, n])

Here are examples for the three main attributes.

>>> np.cat
'NP'
>>> np.children
[<Tree Det the>, <Tree N dog>]
>>> np.word
>>> det.word
'the'

One can set role and id:

>>> det.role = 'spec'
>>> n.role = 'head'
>>> np.id = 1

The nld attribute is only relevant for dependency trees: see below under “Dependents.”

One can print out the tree rooted at a node using a print statement:

>>> print(np)
0    (NP &1
1       (Det:spec the)
2       (N:head dog))

Notice that nodes are numbered. One can access them directly by number:

>>> np[2]
<Tree N dog>

This is particularly useful for large trees.

Copy

The method copy() makes a shallow copy of a node. If the original node has children, a fresh copy of the child list is made, but the child nodes themselves are not copied.

>>> y = np.copy()
>>> y is np
False
>>> y.children is np.children
False
>>> y.children == np.children
True
>>> y.children[0] is np.children[0]
True

One can modify any of the attributes cat, children, word, nld, role, or id when making the copy.

>>> z = np.copy(children=[])
>>> print(z)
0    (NP &1)

Node functions

The Tree class has few methods. Instead, there is a large collection of functions that are intended to work with any object (though not all of them are fully general yet).

getcat(n)

Category or None.

getchildren(n)

List or None.

getparent(n)

Node or None.

getword(n)

String or None.

getnld(n)

Integer or None.

getrole(n)

Role or None.

getid(n)

ID or None.

getsem(n)

Semantic translation or None.

is_interior(n)

Coerce children to boolean.

is_leaf(n)

Non-false n with boolean false children.

is_governor(n)

Non-false children, non-false word.

is_phrase(n)

Non-false children, boolean false word.

is_headed_phrase(n)

Phrase with head child.

is_unheaded_phrase(n)

Phrase with no head child.

is_leaf_word(n)

Leaf, non-false word.

is_empty_leaf(n)

Leaf, boolean false word.

is_unary(n)

Phrase, one child.

nodetype(n)

String representing node type.

head_child(n)

A node or None.

head_index(n)

An integer or None.

child_index(n, c)

Integer; c’s index in children.

left_dependents(n)

List of nodes.

right_dependents(n)

List of nodes.

expansion(n)

List of nodes or None.

delete_child(n, i)

Delete a child.

Accessors

Instead of using the attributes directly, it is best to use the accessor functions getword(), getchildren(), getnld(), getrole(), getcat(), getsem(), getid(), and getparent(). These functions can be applied to arbitrary objects, not just Tree instances. If called on something that lacks the attribute in question, they return None. There is one exception: if a string is passed to getword(), it returns the string itself. (Hence a string behaves like a leaf node that has a value for word but has no category.)

Some examples:

>>> from selkie.tree import getcat, getword
>>> getcat(np)
'NP'
>>> getcat('hi')
>>> getword(det)
'the'
>>> getword('hi')
'hi'

Predicates

Basic predicates. The following functions are available to test for properties of a node: is_interior(), is_leaf(), is_governor(), is_phrase(), is_headed_phrase(), is_unheaded_phrase(), is_leaf_word(), is_empty_leaf(), They have all been previously discussed. Some examples:

>>> from selkie.tree import is_interior, is_leaf, is_headed_phrase
>>> is_interior('hi')
False
>>> is_leaf(det)
True
>>> is_leaf('hi')
True
>>> is_headed_phrase(np)
True

Is empty. The function is_empty() tests whether a node is empty or not. This is technically not a property of the node itself, but of the tree rooted at the node: a node is empty just in case neither it nor any of its descendants has a value for word.

>>> from selkie.tree import is_empty
>>> is_empty(Tree())
True
>>> is_empty(Tree('NP', [Tree('N')]))
True
>>> is_empty(Tree('NP', [Tree('N', word='dog')]))
False

Is unary. The function is_unary() returns true just in case the node has exactly one child.

>>> from selkie.tree import is_unary
>>> is_unary(np)
False
>>> is_unary(det)
False
>>> is_unary(Tree('NP', [Tree('N', word='rice')]))
True

Node type. The function nodetype() returns one of the following: 'leaf', 'governor', 'unheaded phrase', or 'headed phrase'.

>>> from selkie.tree import nodetype
>>> nodetype(np)
'headed phrase'
>>> nodetype(det)
'leaf'
>>> nodetype('hi')
'leaf'

Structural access

Head child. The function head_child() returns the child whose role is “head,” if one exists. (If there is more than one, it returns only the first.)

>>> from selkie.tree import head_child
>>> head_child(np)
<Tree N dog>

Head index. The function head_index() returns the head child’s index in the children list. It returns -1 if there is no head child. Children are numbered from 0.

>>> from selkie.tree import head_index
>>> head_index(np)
1
>>> head_index('hi')
-1

Child index. The function child_index() takes two arguments, parent and child, and returns the index of the child in the parent’s children list. It returns -1 if the child is not found.

>>> from selkie.tree import child_index
>>> child_index(np, det)
0
>>> child_index(np, 'foo')
-1

Dependents. If the node has a value for nld, the function left_dependents() returns all children up to, but not including, nld. The function right_dependents() returns all remaining children. If the node has no value for nld, but it does have a head child, then left_dependents() returns all children preceding the head child, and right_dependents() returns all children following the head child. If the node has neither nld nor a head child, both functions signal an error.

>>> from selkie.tree import left_dependents, right_dependents
>>> left_dependents(np)
[<Tree Det the>]
>>> right_dependents(np)
[]
>>> sbj = Tree('N', word='dogs')
>>> v = Tree('V', word='chase')
>>> obj = Tree('N', word='cats')
>>> v.children = [sbj, obj]
>>> v.nld = 1
>>> left_dependents(v)
[<Tree N dogs>]
>>> right_dependents(v)
[<Tree N cats>]

Expansion. If a node has children, the function expansion() returns a tuple consisting of the node’s category followed by the categories of its children. Some of the categories may be None. If the node has no children, the return value is None.

>>> from selkie.tree import expansion
>>> expansion(np)
('NP', 'Det', 'N')

Destructive

The function delete_child() takes a node and a child index, and deletes the child at that index. The value for nld is adjusted, if necessary. There is no return value.

>>> from selkie.tree import delete_child
>>> delete_child(v,0)
>>> left_dependents(v)
[]
>>> right_dependents(v)
[<Tree N cats>]

Trees

The functions that treat a Tree instance as representing a complete tree (rather than just a node) are summarized as follows.

is_headed_tree(t)

All interior nodes have heads

is_unheaded_tree(t)

All interior nodes lack heads

is_dependency_tree(t)

All interior nodes have words.

treetype(n)

A string representing the type.

load_trees(fn)

Returns a list of trees.

parse_trees(s)

Same, but s is string.

parse_tree(s)

Single tree, else error.

iter_trees(fn)

Iteration over trees.

tree_string(t)

Pretty-printed.

Prints tree string.

save_trees(t, fn)

Saves tree string to file.

load_tabular_trees(fn)

List of trees.

iter_tabular_trees(fn)

Iteration over trees.

save_tabular_trees(ts, fn)

Write to file.

Tree types

The type of a tree is defined by the type of interior nodes it contains.

A tree is an unheaded phrase-structure tree if all its interior nodes are unheaded phrasal nodes. A tree is a headed phrase-structure tree if all its interior nodes are headed phrasal nodes. A tree is a dependency tree if all its interior nodes are governors.

A hybrid tree is one that satisfies none of these three definitions.

All three types of tree contain identical leaf nodes. They differ only in their interior nodes. Technically, a tree containing no interior nodes (i.e., consisting of a single terminal node) satisfies all three definitions.

The following functions test tree types:

>>> from selkie.tree import is_headed_tree, is_unheaded_tree, is_dependency_tree
>>> is_headed_tree(np)
True
>>> is_unheaded_tree(np)
False
>>> is_dependency_tree(v)
True

The function treetype() returns the tree type: 'headed phrase', 'unheaded phrase', or 'governor' (the lattermost for a dependency tree). It returns 'leaf' if the tree consists of a single leaf node, and None if the tree is hybrid.

>>> from selkie.tree import treetype
>>> treetype(np)
'headed phrase'
>>> treetype(v)
'governor'
>>> treetype(det)
'leaf'

Load and parse

Iter trees. The function iter_trees() reads trees in a lisp-like format from a file or string. Like all the load/save functions, iter_trees() takes its argument to name a file if it is a Fn, and to provide the contents, if it is a string. Here is an example:

>>> from selkie.data import ex
>>> from selkie.tree import iter_trees
>>> ts = iter_trees(ex('tree2'))
>>> next(ts)
<Tree S ...>
>>> print(_)
0    (S
1       (NP:subj &1
           foo
2          (Det the)
3          (N dog))
4       (VP:head &2
5          (V chased)
6          (NP:dobj
7             (Det the)
8             (N cat))))

Load and parse. The function load_trees() simply returns:

list(iter_trees(fn))

The function parse_trees() also dispatches to iter_trees(), but it wraps its arguments in a Contents instance (from selkie.io) so that the argument is interpreted as a string representing a tree, rather than a filename. The function parse_tree() returns a single tree instead of a list of trees; it signals an error if its argument does not parse as a single tree.

>>> from selkie.tree import parse_tree
>>> foo = parse_tree('''
...     (NP:subj&1 foo
...         (Det the)
...         (N:head dog))
... ''')
>>> print(foo)
0    (NP:subj &1
        foo
1       (Det the)
2       (N:head dog))

Tabular tree files

There is also a tabular format for representing trees in a file. An example is provided by the file t1, whose contents are:

>>> from selkie.io import contents
>>> print(contents(ex('t1')), end='')
[       S
[       NP
+       Det     the
+       N       cat
]
[       VP
+       V       chased
[       NP
+       Det     the
+       N       dog
]
]
]

A record may contain up to six fields:

1

Record type

Left bracket for the beginning of a nonterminal node, right bracket for the end of a nonterminal node, and plus for a terminal node.

2

Category

Syntactic category.

3

Word

It may not contain a tab or newline, but any other character (including space) is allowed.

4

Role

A symbol representing the relation between the node and its parent or governor.

5

Head

A numeric index, identifying either a particular child, or a position among the children.

6

ID

A numeric index for the node.

None of the fields is obligatory. Additional fields beyond these six are also permitted, but ignored.

The function iter_tabular_trees() can be used to read a file in tabular tree format. It is a generator over trees:

>>> from selkie.tree import iter_tabular_trees
>>> t1 = next(iter_tabular_trees(ex('t1')))
>>> print(t1)
0    (S
1       (NP
2          (Det the)
3          (N cat))
4       (VP
5          (V chased)
6          (NP
7             (Det the)
8             (N dog))))

The function load_tabular_trees() converts the generator into a list.

Conversely, the function save_tabular_trees() takes a tree iterator and a filename, and saves the trees in tabular format.

>>> from selkie.tree import save_tabular_trees
>>> with TemporaryDirectory() as dirname:
...     fn = join(dirname, 'foo')
...     save_tabular_trees([foo], fn)
...     for line in open(fn): print(line, end='')
...
[       NP      foo     subj    0       1
+       Det     the
+       N       dog     head
]

Drawing

The function draw_tree() draws a tree. It requires the package “graphviz” to be installed. It takes a second argument, which is the filename to write. If the filename is omitted, a temp file is written and displayed to the screen.

Tree iterations

If one iterates directly over a tree, one iterates over its nodes in preorder. The iteration functions and related functions are summarized as follows.

preorder(n)

An iteration over nodes.

textorder(n)

An iteration over nodes.

iter_nodes(t)

A preorder walk.

nodes(t)

A list, in preorder.

iter_edges(t)

Parents in preorder, children in order.

edges(t)

A list.

subtrees(t, f)

Highest nodes that satisfy f (list).

iter_subtrees(t, f)

Iteration.

subtree(t, f)

Error if not unique.

paths(t)

Paths in tree as slash-joined cats.

leaves(t)

All leaves.

words(t)

Non-empty leaves.

tagged_words(t)

List of (word, cat) pairs.

terminal_string(t)

Space-joined string.

is_empty(t)

Yields empty string.

is_efree_tree(t)

All nodes have children or word.

is_unaryfree_tree(t)

No unary expansions.

copy_tree(t)

Deep copy.

delete_nodes(t, cs)

Delete nodes with cat in cs.

eliminate_epsilons(n)

Elminate empty nodes, destructive.

set_parents(t)

Detructive.

getroot(n)

Root above node.

Preorder and text order walks

A walk is an iteration over the nodes of a tree. Two different walks are defined: preorder() and textorder(). In a preorder walk, one visits a node before any of its children. For phrase-structure trees, text order is identical to preorder, but for dependency trees, they differ. More precisely, in a text-order walk, any node that has a value for nld is visited after visiting its left dependents, but before visiting its right dependents.

To illustrate, we create two trees. The first is a headed phrasal tree:

>>> from selkie.tree import parse_tree
>>> h = parse_tree('''(S (NP (Det the) (N:head dog))
...                      (VP:head (V:head barked))
...                      (Adv loudly))''')
...

The second is a dependency tree:

>>> d = parse_tree('(V (N (Det the) dog) barked (Adv loudly))')

We can confirm that the preorder and text order walks for the phrasal tree are the same:

>>> from selkie.tree import preorder, textorder
>>> for node in preorder(h): print(repr(node))
...
<Tree S ...>
<Tree NP ...>
<Tree Det the>
<Tree N dog>
<Tree VP ...>
<Tree V barked>
<Tree Adv loudly>
>>> for node in textorder(h): print(repr(node))
...
<Tree S ...>
<Tree NP ...>
<Tree Det the>
<Tree N dog>
<Tree VP ...>
<Tree V barked>
<Tree Adv loudly>

But they differ for the dependency tree:

>>> for node in preorder(d): print(repr(node))
...
<Tree V barked ...>
<Tree N dog ...>
<Tree Det the>
<Tree Adv loudly>
>>> for node in textorder(d): print(repr(node))
...
<Tree Det the>
<Tree N dog ...>
<Tree V barked ...>
<Tree Adv loudly>

Nodes and edges

The __iter__() method of a tree, and the function iter_nodes(), are both synonyms for preorder(). The function nodes() turns the generator into a list.

An edge is a pair (p,c) where p is a parent node and c is one of its children. The function iter_edges() returns an iteration over the edges in a tree. The function edges() turns the iteration into a list.

Subtrees

Subtrees, iter subtrees. The function iter_subtrees() returns an iteration over subtrees that satisfy a given predicate. The function subtrees() turns the iteration into a list.

The difference between these functions and simply filtering the output of nodes() is that iter_subtrees() terminates the recursion whenever it finds a node matching the predicate.

>>> from selkie.tree import subtrees
>>> subtrees(h, lambda x: is_leaf(head_child(x)))
[<Tree NP ...>, <Tree VP ...>]

If the second argument is a string, it is taken to be the desired node category.

>>> subtrees(h, 'Adv')
[<Tree Adv loudly>]

Subtree. The function subtree() takes the list produced by subtrees() and returns its member, if there is exactly one. It signals an error if the list is not a singleton list.

Paths and leaves

Paths. The function paths() returns the list of paths through the tree. A path is represented by a string in which node categories are separted by “/.” The categories in the path are ordered from root to leaf.

>>> from selkie.tree import paths
>>> paths(h)
['S/NP/Det', 'S/NP/N', 'S/VP/V', 'S/Adv']

Leaves. The function leaves() returns the list of leaf nodes in a tree. The leaves are listed in preorder.

Words. The function words() differs from leaves() in two ways: it only includes leaves that have a value for word, and it uses a text-order walk.

>>> from selkie.tree import words
>>> words(h)
['the', 'dog', 'barked', 'loudly']

Tagged words. The function tagged_words() is like words(), except that it produces a list of pairs of form (word, cat).

>>> from selkie.tree import tagged_words
>>> tagged_words(h)
[('the', 'Det'), ('dog', 'N'), ('barked', 'V'), ('loudly', 'Adv')]

Terminal string. The function terminal_string() takes the output of words() and turns it into a string. The words are separated by spaces.

>>> from selkie.tree import terminal_string
>>> terminal_string(h)
'the dog barked loudly'

Predicates

Is e-free. The function is_efree_tree() returns true just in case the tree contains no empty nodes.

Is unary-free. The function is_unaryfree_tree() returns true just in case there are no unary-branching nodes in the tree.

Copy tree

The function copy_tree() does a deep copy of a tree. Unlike the node method copy(), copy_tree() does recurse through the whole tree, making copies of all nodes.

Transformations

The operations described in this section, as well as the transformations described in the chapters on head-marking and stemma conversion, are destructive. To protect a tree, make a copy before applying destructive operations.

Delete nodes. The function delete_nodes() deletes all nodes with a given category. (However, it never deletes the root node.)

Eliminate epsilons. The function eliminate_epsilons() eliminates all empty nodes from a tree. If the tree was initially headed, any heads that are empty will get deleted.

>>> from selkie.tree import eliminate_epsilons
>>> e = parse_tree('''
...   (S
...     (NP (N ))
...     (VP
...       (VBZ )
...       (RB surely)
...       (NP Fido)))
... ''')
>>> eliminate_epsilons(e)
>>> print(e)
0    (S
1       (VP
2          (RB surely)
3          (NP Fido)))

Set parents, get root. The function set_parents() destructively adds a parent attribute to every node in the tree, pointing back to the node’s parent.

After parents have been set in a tree, one can use the function getroot() to go from any node to the root node. It follows parent links up the tree to the root node.

Tree builder

class selkie.tree.TreeBuilder

A stack-like data structure for constructing a tree.

start(c, w[, r, i])

Start a new interior node. Cat, word, role, id.

middle()

This is the head position.

end()

Pop node.

leaf(c, w[, r, i])

Create a leaf node.

trees()

A list; error if not finished.

tree()

A tree; error if not finished and unique.

Here is an example of use:

>>> from selkie.tree import TreeBuilder
>>> tb = TreeBuilder()
>>> tb.start('NP')
<Tree NP>
>>> tb = TreeBuilder()
>>> tb.start('S')
<Tree S>
>>> tb.start('NP', role='subj')
<Tree NP>
>>> tb.leaf('Det', 'the')
<Tree Det the>
>>> tb.leaf('N', 'dog')
<Tree N dog>
>>> tb.end()
<Tree NP ...>
>>> tb.start('VP', role='head')
<Tree VP>
>>> tb.leaf('V', 'barks', role='head')
<Tree V barks>
>>> tb.end()
<Tree VP ...>
>>> tb.end()
<Tree S ...>
>>> tb.tree()
<Tree S ...>
>>> print(_)
0    (S
1       (NP:subj
2          (Det the)
3          (N dog))
4       (VP:head
5          (V:head barks)))

The methods for building a phrasal node are start() and end(). Both return the node.

To build a dependency node, one also uses the method middle() to mark the position at which the governor occurs. For example:

>>> tb.start('V', word='chase')
<Tree V chase>
>>> tb.leaf('N', 'dogs')
<Tree N dogs>
>>> tb.middle()
>>> tb.leaf('N', 'cats')
<Tree N cats>
>>> tb.end()
<Tree V chase ...>
>>> tb.tree()
<Tree V chase ...>
>>> print(_)
0    (V
1       (N dogs)
        chase
2       (N cats))

The builder allows one to construct multiple trees; it saves them on a list until one calls either tree() or trees(). The latter returns the list of trees constructed. The former returns a single tree, and signals an error if there is not exactly one tree on the list. Both methods signal an error if there is an incomplete tree in progress. Both methods restore the builder to its empty state.