A lexical entry has type Lexicon.Entry. It consists of a word, a part of speech, and an optional semantic translation.
>>> from seal.nlp.features import C >>> from seal.nlp.grammar import Lexicon >>> ent = Lexicon.Entry('dog', C('n'), 'DOG') >>> ent.word 'dog' >>> ent.pos n >>> ent.sem 'DOG'
A Lexicon consists of a set of lexical entries. The basic method is define(); it takes a word, a part of speech (category), and an optional semantic value.
>>> lex = Lexicon() >>> lex.define('cat', C(['n','sg'])) >>> print(lex) cat n[sg]
The lexicon can be accessed by word. The value is a list of entries.
>>> lex['cat'] [<Entry cat n[sg]>]
An error is signalled if the word is not present.
The length of the lexicon is the number of entries.
>>> len(lex) 1
For purposes of iteration, the elements of a lexicon are entries.
>>> list(lex) [<Entry cat n[sg]>]
Grammar rules are represented by instances of the class Rule. A Rule has five attributes: lhs, rhs, bindings, variables, and sem. The lhs is a single category, and the rhs is a list of categories. The value for bindings is a list containing *'s, one for each variable used in the rule. The value for variables is a list of string representations for the variables, or None. The value for sem is an expression.
The constructor takes a lhs, rhs, sem, and a symbol table. The symbol table is a dict that maps variable names to integers from 0 to the size of the table. The symbol table is optional; if omitted, variables are anonymous. The length of the bindings list is the size of the symbol table, if provided. Otherwise, it is one greater than the largest numeric variable occurring in either the lhs or rhs.
>>> from seal.nlp.grammar import Rule >>> r = Rule('vp', ['v', 'np'], 'foo') >>> r.lhs 'vp' >>> r.rhs ['v', 'np'] >>> r.bindings [] >>> r.sem 'foo'
The Grammar class has a similar structure to the Lexicon class. Internally, it maintains two indices. A rule of form X -> Y1 ... Yn is indexed by X in the lefthand side index, and it is indexed by Y1 in the righthand side index.
The basic method is define(). It takes a lhs, rhs, an optional semantic translation, and an optional symbol table.
>>> from seal.nlp.grammar import Grammar >>> g = Grammar() >>> g.define(C('s'), [C('np'), C('vp')]) >>> g.define(C('vp'), [C('v'), C('np')]) >>> print(g) Start: s Rules: [0] s -> np vp [1] vp -> v np
The attribute start contains the start category. It defaults to the lhs of the first rule defined.
>>> g.start s
The method expansions() takes a string X and returns the list of rules of form X -> Y1 ... Yn. Note that the input is just a string, not a full category.
>>> g.expansions('vp') [<vp -> v np>]The method continuations() returns the list of rules whose righthand side begins with a given symbol. For example:
>>> g.continuations('v') [<vp -> v np>]
A grammar also has attributes declarations and lexicon. The value of declarations is generally None, unless the grammar is created by the grammar loader from a file that contains declarations.
The GrammarLoader reads a grammar file. Here is a simple example of the format. This is the contents of ex.g9.g. In the section headers (e.g., "% Features"), the space following the percent sign is optional, and the capitalization of the section name does not matter.
% Features nform = sg/pl vform = nform/ing trans = i/t bool = +/- default - % Categories S [] NP [form:nform, wh:bool] VP [form:vform] V [form:vform, trans:trans] N [form:nform] Det [form:nform] % Rules S -> NP[_f] VP[_f] NP[_f] -> Det[_f] N[_f] VP[_f] -> V[_f,i] VP[_f] -> V[_f,t] NP % Lexicon the Det a Det[sg] cat N[sg] dog N[sg] dogs N[pl] barks V[sg,i] chases V[sg,t]
The grammar loader is called by the Grammar constructor when a filename is provided. For example:
>>> from seal.core.io import ex >>> g = Grammar(ex.g9) >>> print(g) Start: S Rules: [0] S -> NP[_f,-] VP[_f] [1] NP[_f,-] -> Det[_f] N[_f] [2] VP[_f] -> V[_f,i] [3] VP[_f] -> V[_f,t] NP[pl/sg,-] Lexicon: a Det[sg] barks V[sg,i] cat N[sg] chases V[sg,t] dog N[sg] dogs N[pl] the Det[pl/sg]
The usual way to run gdev is from the shell:
python -m seal.gdev
When it starts up, it prints out the usage message, followed by a prompt (>). The commands are as follows.
When one calls seal.gdev from the shell, it instantiates the class Dev and calls its run() method. The run() method repeatedly reads a line from stdin and passes it to the com() method. Here is an example. First we instantiate Dev:
>>> from seal.nlp.gdev import Dev >>> d = Dev()Load grammar g9, along with its example sentences:
>>> d.com('g9')Show the sentences. The numbers not in brackets indicate how many parses the grammar assigns to the sentence.
>>> d.com('s') [0] 1 a cat barks [1] 0 *a dogs barks [2] 1 the cat chases the dogShow the parse tree(s) for the current sentence:
>>> d.com('c') [0] a cat barks #Parses: 1 Parse 0: 0 (S 1 (NP[sg,-] 2 (Det[sg] a) 3 (N[sg] cat)) 4 (VP[sg] 5 (V[sg,i] barks)))
When the command is the name of a grammar file, Dev expects two files to exist: prefix.g should contain a grammar, and prefix.sents should contain a list of sentences. Each line of the sentence file is considered to be a sentence, except that empty lines and lines beginning with # are ignored. Leading and trailing whitespace is ignored. If the first non-whitespace character is *, it indicates that the example is ungrammatical. For example:
>>> from seal.core.io import contents >>> print(contents(ex.g9.sents), end='') a cat barks *a dogs barks the cat chases the dog
Dev creates a parser from the grammar file, and uses it to parse each of the sentences in the sentence file. The predicted label is 'OK' if the parser deems the sentence to be grammatical, and '*' if the parser rejects it. The predicted labels are compared to the true labels, and the results are printed out.
This is currently broken.
Random generation from a feature grammar is a little tricky. It is not currently implemented, but the algorithm is described here.
To do random generation with feature grammars, we require both a downward and upward pass, analogous to the upward chart-filling step followed by the downward unwinding step in parsing. In parsing, we ``enter'' an expansion from the lower left corner (the first) child, and proceed from child to child, finally ``exiting'' at the top. The ``enter'' steps involve unification of a node with a child category, and the ``exit'' step involves instantiation of the lefthand side category. In generation, we enter from the top, unifying the parent node with the lefthand side category.
In random generation, we fully instantiate rules as we generate a tree. Full instantiation means eliminating not only variables but also disjunctions, leaving a unique value for each attribute. The choice among disjuncts is made stochastically.
We begin by fully instantiating the start category. We create a root node for the start category, but leave the children as yet unspecified.
Then, at each point, we have a parent node with a fully instantiated category, and we find the rules that could expand it. Rules are indexed by symbol in the grammar, not by complete feature sets, so we must scan through a candidate list to determine which ones actually match the given parent category. The result of each successful match is an updated symbol table. We make a list of the matching rules, and the symbol table for each.
We make a stochastic choice among the rules that match the parent category. We use the updated symbol table to fully instantiate each righthand-side category in turn, keeping track of further symbol-table updates as we go. As we instantiate each child category, we create a node possessing the category and insert it into the parent's child array. Then we recursively expand each child in turn.
It is possible for generation to fail. For example, consider the following little grammar.
S -> A[f ?x] B[f ?x]; S -> A[f ?x]; A[f 1] -> foo; A[f 2] -> bar; B[f 2] -> baz;
The f attribute ranges over 1 and 2; but if one choose the first expansion for S, then one must choose value 2. If one happens to choose 1, though, the problem is not detected until one attempts to generate a subtree from B[f 1].
A node represents the results of generating from a pair (cat, sem). We may consider indexing nodes: a given node may well be needed multiple times, because many choices of first child may lead to the same requirements for the next child. The calling state is recorded with the node. If additional states call for the same node, they will also be recorded as callers.
To expand the node, we find all rules whose lhs is consistent with (cat, sem), and for each rule we create a new state. The node is passed along as states are expanded.
A state is like a parser edge. It represents a partial state of generating from a given rule. The first i children have been generated. Their categories have been merged into the current bindings, and their semantics has been unified into the rule semantics.
To advance a state, we substitute the current bindings into the category of the next child, and we unify the appropriate sub-avs of the current one with the semantics of the next child. Then we generate from that (cat, sem) pair. The result is a list of trees. For each tree, we create a new state in which the tree category is used to update the bindings, and the tree semantics is unified into the semantics of the previous state to create the new state's semantics.
When a state has generated all children for its rule, a new tree is created, whose category comes from substituting the bindings into the rule lhs, and whose semantics is the state's semantics. That tree is passed back to the node, which passes it to all of its callers.
We keep a stack of active states. The discipline is depth-first, so that we generate a complete tree as quickly as possible.
>>> from seal.nlp.parser import Parser >>> from seal.nlp.gen import Generator >>> g = Grammar(ex.tinygen.g) >>> p = Parser(g) >>> ts = p('fido barks') >>> print ts[0] 0 (s : [subj fido; type bark] 1 (np : fido 2 (name fido)) : fido 3 (vp : [type bark] 4 (vi barks))) : [type bark] >>> sem = ts[0].sem >>> gen = Generator(g) >>> iter = gen(sem) >>> t = iter.next() >>> print t 0 (s : [subj fido; type bark] 1 (np : fido 2 (name fido)) : fido 3 (vp : 6.0 4 (vi barks))) : [type bark]