GLab
About GLab
GLab is a language for talking about languages. The seal.glab module provides an interpreter, which may be called interactively:
$ glab
[glab]
Or it may be called on a file:
$ glab -g test.gn
| a
a
A notebook-based web application is also provided with CLD. It displays notebooks consisting of alternating user-provided expressions of the GLab language (“inputs”), and the results of evaluating those expressions (“outputs”). The inputs are editable text blocks. The outputs are not editable, but are automatically updated whenever an input is editted.
Tutorial
Start the interactive GLab interpreter simply by
typing glab
.
Variables and symbols
An atom represents a symbol. It evaluates to itself:
[glab] a
a
Variables begin with underscore. The operator :=
is used to
set the value of a variable:
[glab] _x := hi
[glab] _x
hi
Note that operator is a specialization of symbol, so any “stray” operators that are not part of a well-formed operator expression are treated as symbol literals:
[glab] *
<Op *>
Note that =
is used for equality testing only, not assignment:
[glab] _x = lo
False
[glab] _x
hi
Sequences, strings, sets
A sequence literal is marked with angle brackets. The elements may be separated by commas, though commas may be omitted if no ambiguity results. (Commas are generally optional wherever they occur.) Sequences evaluate to themselves:
[glab] <hi, there>
<hi, there>
[glab] <hi there>
<hi, there>
A double-quoted string is interpreted as a sequence of symbols separated by whitespace. A single-quoted string is interpreted as a sequence of characters:
[glab] "hi there"
<hi, there>
[glab] 'hi'
<h, i>
Braces introduce set literals:
[glab] {a,b,c}
{a, b, c}
Operator expressions
Dot is used for concatenation:
[glab] <a,b> . <c,d>
<a, b, c, d>
The operator @
is used for set membership:
[glab] _S := {a,b,c,d,e}
[glab] a @ _S
True
Plus is used for set union:
[glab] _S := {a,b,c}
[glab] _S + {b,c,d}
{a, b, c, d}
Expressions
Examples:
[glab] _s := {a,b,c}
[glab] |_s|
3
[glab] |"dog"|
1
[glab] |'dog'|
3
[glab] _A = {'a', 'b c', 'c a'}
[glab] {_x in _A where |_x| = 2}
[glab] (_A x _B)
[glab] (_A . _e_)
[glab] _B := /a + e + i + o + u/
[glab] _C := {a, e, i, o, u}
[glab] _B = _C
[glab] /g . o:e . o:e . s . e/
[glab] {{0}}
Finite-state automata
The function new_fsa()
creates an FSA and makes it current:
[glab] _a1 := new_fsa()
The function E()
adds an edge to the current FSA:
[glab] E(1 the 2)
[glab] E(2 big 3)
[glab] E(2 red 3)
[glab] E(3 cat 4)
[glab] E(3 dog 4)
The function F()
declares a state final:
[glab] F(4)
One may call an FSA as a function to determine whether it accepts a string:
[glab] _a1
(An FSA containing 4 states)
[glab] _a1("the big cat")
True
The function computation()
prints out the computation.
[Currently buggy]:
[glab] computation(_a1, "the big cat")
Debugging
Debugging functions:
[glab] trace(syntax)
[glab] untrace(syntax)
The GLab Language
Syntax
An atom is one of the following:
An operator, as listed in the following table. Example:
+
.A variable, which must begin with underscore. Example:
_a1
.A string in single or double quotes. There is no significance to the choice between single quotes and double quotes, though the start and end quotes must of course match. Example:
'foo bar'
.A symbol literal, which is any unquoted word that is not an operator or variable. Example:
a
.
The following table lists the operators, from highest (1) to lowest (5) precedence classes. Higher precedence operators “bind more tightly.” Operators in the same precedence class group left to right.
1 |
|
Cross-product |
2 |
|
Kleene star (suffix operator) |
3 |
|
Concatenation |
3 |
|
Cross-product |
4 |
|
Addition |
4 |
|
Subtraction |
4 |
|
Set difference |
5 |
|
Equality |
5 |
|
Set membership |
Note that there are two cross-product operators, differing only in precedence. That is intentional. The colon operator is for letter pairs in transductions, whereas the times operator forms the cross product of longer strings.
Atoms are grouped into expressions. The following are the expression types:
Infix expression. Two subexpressions with an infix operator between them, representing an operator expression with two operands. Example:
a.b
.Postfix expression. A subexpression followed by a postfix operator, representing an operator expression with one operand. Example:
a*
.Size expression. A list of subexpressions in vertical bars. Example:
|{a,b}|
.Function call. A symbol followed by a parenthesized list of subexpressions, separated optionally by commas. Example:
f(_x, _y)
.Category literal. A symbol followed by a bracketed list of subexpressions. Example:
VP[sg]
.Sequence literal. A list of subexpressions in angle brackets. Example:
<c,a,t>
.Set literal. A list of subexpressions in braces. Example:
{a,b}
.Language literal. A list of subexpressions in slashes. Example:
/a . b/
.
Semantically, expressions of types (1)-(4) represent functions applied to arguments. The remaining expression types represent literal objects: categories, sequences, sets, or languages.
A command statement consists of a command and some number of argument expressions.
There are two types of command statement:
Prefix command statement. The first expression is a symbol representing a prefix command, as listed in the top half of the following table, and the remaining expressions in the line are its arguments.
Infix command statement. The second expression is an infix command, as listed in the bottom half of the following table. The first expression, along with the third and following expressions, represent arguments of the command.
Commands are designated symbols, as listed in the following table.
Prefix commands |
|
|
Set the value of a variable |
|
Include another notebook |
|
Increment a variable |
|
Show the value of a variable |
|
Parse a sentence |
|
Turn on tracing |
|
Mark a sentence as good |
|
Mark a sentence as bad |
|
Show the results of parsing |
Infix commands |
|
|
Define a grammar rule |
|
Define a lexical entry |
|
(I forget) |
At the highest level, a notebook consists of newline-terminated lines.
A line beginning with #
is a comment. The title of the
notebook must be the first line and begin with #T
followed by a
space and the actual title. Every other line is a statement,
which may be either a command statement or an expression.
List of constants
The constants are:
_0_
— the empty set
_e_
— the empty string
_else_
— used in FSTs
Top
— the top of a lattice
Bottom
— the bottom of a lattice
Global variables
The global variables are:
_fsa_
- the current FSA/FST
_corpus_
- the current corpus
*notebook-dir*
- the current notebook directory
*output*
- the current output stream
*trace*
- a set of things to be traced
List of functions
Functions have a minimum number of arguments, a maximum number of arguments, and a flag indicating whether or not they take the symtab as a hidden argument (ENV), in order to get the values of global variables. A few of the functions have types associated with their parameters; that is not indicated in the table. All functions are built-in, in the sense that their implementation is provided by a Python function.
=>(*things)
— Expand
apply(fsa,x,ENV)
— Applies an FSA to an input
bad(s,ENV)
— Marks a sentence as bad
abs(x)
— Absolute value of number, or size of a container
accepts(fsa,seq)
— Whether or not an FSA accepts a sequence
check(x,ENV)
— Check something
compose(fst1,fst2,ENV)
— Compose FSTs
computation(fsa,seq)
— Show the computation of an FSA on a sequence
concat(*fsas)
— Concatenate FSAs
cross(*fsts)
— Take the cross product of FSTs.
E(src,dst,lab,[olab],ENV)
— Add an edge to the current FSA.
equals(x,y)
— Equality
ex(s,ENV)
— Add a sentence to the list of examples.
exp(x,y)
— Take x to the y
F(q,ENV)
— Declare state q final, in the current FSA
first(x)
— Return the first element
fsa(*edges)
— Create an FSA.
good(s,ENV)
— Declare a sentence to be good.
gt(x,y)
— Greater than
include(s,ENV)
— Include the notebook with the given name, searching in the current notebook directory.
incr(x,amt)
— Increment x by an amt
intersection(x,y)
— Intersect two sets
io(x,y)
— Not sure
islang(x)
— Whether x is something that can be used as a language
ismember(x,y)
— Whether x is a member of y
isstring(x)
— Whether x is a string
L(x)
— Convert x to a language
lang(x,ENV)
— Print the language
lt(x,y)
— Whether x is less than y
makecat(*ftrs)
— Turn a set of features into a category
minus(x,y)
— Subtraction
new_fsa(ENV)
— Start a new FSA
new_grammar(ENV)
— Start a new grammar
new_union(ENV)
— Start a new union
pair(x,y)
— Create a pair
parse(s,ENV)
— Parse a sentence
plus(x,y)
— Addition
rel(fst,[n],ENV)
— List [at most n tuples from] the relation computed by an FST
results(ENV)
— Run regression tests
seq(*elts)
— Create a sequence
set(*elts)
— Create a set
show(x,ENV)
— Print details
size(x)
— The size of a container
start(ENV)
— Get the start symbol
trace(what,ENV)
— Turn tracing on
type(x)
— The type of x
untrace(what,ENV)
— Turn tracing off
List of macros
A genuine macro is a function that suppresses evaluation of its arguments and constructs a new expression that is evaluated in the place of the macro expression. GLab does not support genuine macros, but it does permit functions to suppress evaluation of some or all of their arguments.
makelexent(w,pos,ENV)
— Evaluates pos but not w.
makerule(*cats,ENV)
— Cats are not evaluated.
new(type,ENV)
— Type is not evaluated. May set a global variable.
regex(x,ENV)
— x is not evaluated.
quote(x)
— Suppresses evaluation.
setvalue(var,val,ENV)
— Neither argument is evaluated.
List of setters
A “setter” is a function that can be placed on the left-hand side of
an assignment. There is only one: start(fsa) = q
expands into
Python as set_start(fsa,q)
.
GLab Implementation
Command-line invocation
GLab will use the current working directory as its working directory. If you create a notebook, it will be placed in a subdirectory whose name is your user name. The subdirectory will be automatically created, if necessary. To launch GLab:
$ python -m seal.glab
To use it, visit http://localhost:8000/
.
WSGI web application
To run GLab under a web server (e.g., Apache) as a WSGI application,
create a file glab.wsgi
along the lines of the following:
import site
site.addsitedir('/home/me/mypython/lib/python2.6/site-packages')
import seal.glab
application = seal.glab.make_application('/home/me/myglabdir')
The WSGI script must have permissions 744.
This assumes that /home/me/mypython
was created using
virtualenv:
$ python virtualenv.py mypthon
and that Seal was installed in mypython
.
CGI web application
To run GLab as a CGI script, create a file called e.g. glab
in
the cgi-bin
directory, with contents along the lines of the following:
#!/home/me/mypython/bin/python
import site
site.addsitedir('/home/me/seal-0.11.x/python')
import seal.glab
application = seal.glab.make_application('/home/me/myglabdir')
from wsgiref.handlers import CGIHandler
CGIHandler().run(application)
The CGI script should have permissions 744.
Batch mode
The first line of a .gl
file is a notebook name prefixed with
#T
, and each subsequent line is a glab statement.
For example, ex.notebook.gl
contains:
#T My Notebook
set _x <a,b,c>
_x . <b,a>
It is interpreted as follows:
>>> from seal import ex
>>> interpret_file(ex.notebook.gl, show_syntax=True)
| #T My Notebook
| _x := <a,b,c>
: setvalue(_x, seq(a, b, c))
| _x . <b,a>
: concat(_x, seq(b, a))
<a, b, c, b, a>
The original line of text is echoed with |
as prompt, and
the parsed expression is echoed with :
as prompt. Then any
return value or error is printed.
By default, echo is on, meaning that each statement and value is
printed. It also means that errors are printed instead of terminating
processing. Echo can be turned off by providing
echo=False
. In that case, the only printing is what is
explicitly done with show
statements, and any exceptions
immediately terminate processing.
Parsing
Tokenization
The function tokenize()
takes a string and returns an iteration
containing tokens.
Tokenize assumes that its input represents a
single line of input. If any newlines happen to be present, they are
treated like spaces.
The tokens in the iteration are instances
of Token
, which is a specialization
of str
.
A Token has a member type
giving its token type,
and members filename
, line
, and offset
,
indicating exactly where the token occurred. String tokens also have
member quotes
indicating the quote character:
>>> for tok in tokenize('_a1 = foo("hi\\tbye")\n_a1+s'):
... print('{:6} {}'.format(tok.type, repr(tok)))
...
word '_a1'
= '='
word 'foo'
( '('
string 'hi\\tbye'
) ')'
word '_a1'
+ '+'
word 's'
In detail, the kinds of token are as follows:
Word. A maximal sequence of word characters (alphanumerics plus underscore). The value for
tok.type
is'word'
.String. Surrounded by paired single quotes or double quotes. The value for
tok.type
is'string'
, and the value fortok.quotes
is either a single quote or a double quote character.Special. The type of a special token is the token itself. There are two cases:
Multi-character special. One of:
->
,<-
,=>
,:=
,@<
,@>
.Single-character special. Any single character that is not a word character or whitespace.
Whitespace is not returned as a token, but it does separate words. Backslash is never interpreted as an escape character; it is treated like any other punctuation character.
Grouping
After tokenization, pairs of grouping characters are mated to create a
syntactic skeleton. After grouping, the atoms are still tokens
(words, strings, or specials), but complex expressions belong to the
following classes, which are subclasses of tuple
:
BracketExpr
—[...]
(paired square brackets)
ParenExpr
—(...)
(paired parentheses)
BraceExpr
—{...}
(paired braces)
SeqExpr
—<...>
(paired angle brackets)
AbsExpr
—|...|
(paired vertical bars)
ToplevelExpr
— wrapped around the expression as a whole
Example:
>>> exp = group(tokenize('g.[f, {a,b}]'))
>>> pprint(exp)
ToplevelExpr {
g
.
BracketExpr {
f
,
BraceExpr {
a
,
b
}
}
}
Normalization
The function normalize()
takes the output of grouping and
converts it into a fully parsed expression. It processes the skeletal
expression recursively, bottoming out with the tokens.
The auxiliary function normalize_token()
handles the individual
tokens. It replaces the atoms with atoms of the following types,
which are specializations of str
:
Var
— a variable. Created fromword
tokens that begin with underscore.
Symbol
— a symbol. Created fromword
tokens that do not begin with underscore. Also, quoted strings are converted to sequences of symbols.
Op
— an operator. Created from specials. There is a table of operators that is used to do the conversion. If the token is not in the table, aSyntaxError
is signalled.
SeqExpr
— Quoted strings are shorthands for angle-bracket expressions. A double-quoted string is split at whitespace, and a single-quoted string is exploded into its characters. The resulting list is converted to aSeqExpr
containingSymbol
instances.
A complex expression (produced by grouping) is normalized as follows.
First, each of the elements is normalized.
Then the function op_parse()
, which does
operator-precedence parsing, is called on the
normalized elements. Finally, commas are deleted. (They serve as
lowest-precedence separators, if present.)
Associated with each operator is a function name, and the combination
of the operator with its arguments is replaced by a Funcall
expression:
Funcall
— represents a function and its arguments
Function calls of the usual sort are also recognized and replaced with
Funcall
expressions. The arguments may be surrounded either by
parentheses or by square brackets:
>>> g = group(tokenize('''_x := {"a b".'!?'}'''))
>>> pprint(g)
ToplevelExpr {
_x
:=
BraceExpr {
a b
.
!?
}
}
>>> n = normalize(g)
>>> pprint(n)
ToplevelExpr {
Funcall {
setvalue
Var _x
BraceExpr {
Funcall {
concat
SeqExpr {
Symbol a
Symbol b
}
SeqExpr {
Funcall {
sym
33
}
Funcall {
sym
63
}
}
}
}
}
}
Here are a few points to note:
The double-quoted string
"a b"
is treated as consisting of whitespace-separated symbols, whereas the single-quote string'!?'
is treated as consisting of single-character symbols.The non-word symbols
!
and?
are replaced with calls to the functionsym
.The operators have been replaced with function names: for example,
:=
has been replaced withsetvalue
. A table of operators is given below, after we discuss operator-precedence parsing.
Operator-precedence parsing
Operator-precedence parsing does the real work of normalization. The operator-precedence parser works as follows. The input sequence consists of the contents of a single group expression, and consists of tokens and subexpressions. Each element is assigned one or more categories, as follows:
A comma token has category
,
.An operator has two categories:
O
(operator) and the syntactic type of the operator, which is eitherI
(infix) orS
(suffix).A
Funcall
created by reducing an infinite-arity infix operator has categoriesL
(list) andA
(argument).A
ParenExpr
has categoryP
(parenthesized expression).A
BracketExpr
has categoryP
.A
Symbol
has categoriesA
(argument) andY
(function symbol)Everything else has category
A
(argument).
The parser passes through the element sequence, applying the following
rules. The pattern is a sequence of categories, and the action is
taken if the first n words have the categories given. To “reduce”
means to remove the indicated elements and replace them with a Funcall
headed by the operator’s equivalent function, destructively changing
the sequence of elements.
If
AIAO
, then compare the precedence of the two operators (I
andO
). If the second has higher precedence, temporarily shift two elements to the right. Otherwise, reduce the first three elements.If
LAO
, do the same, but if a reduction is done, it consists in adding the A to the L’s argument list.If
AIAP
, then temporarily shift two elements to the right. (TheAP
may be a function call, and function call has highest precedence. [Shouldn’t this be AIYP?]If
LAP
, then temporarily shift one element to the right.If
AIA
, then reduce the first three elements and terminate the most recent shift, if any.If
LA
, then add the A to the L’s argument list and terminate the most recent shift, if any.If
YP
, then reduce the first two elements and terminate the most recent shift, if any.If
AS
, then reduce the first two elements and terminate the most recent shift, if any.If none of the above rules apply, then cancel any temporary shifts, advance one element to the right, and restart.
Digesting
The function digest()
simplifies the syntax by replacing all
expressions, including complex literals, with Funcall
objects.
In particular, the expressions introduced by grouping that have not
been previously eliminated are eliminated now:
BracketExpr
— Brackets that are not recognized as part of a function call are grouping brackets. If there is only one sub-expression, it is returned. Otherwise an error is signalled.
ParenExpr
— Parentheses that are not recognized as part of a function call are grouping parentheses. If there is only one sub-expression, it is returned. Otherwise an error is signalled.
BraceExpr
— The return is aFuncall
whose function isset
and whose arguments are theBraceExpr
.
SeqExpr
— The return is aFuncall
whose function isseq
and whose arguments are theSeqExpr
.
AbsExpr
— The return is aFuncall
whose function isabs
and whose arguments are theAbsExpr
.
ToplevelExpr
— If there is only one sub-expression, it is returned. Otherwise an error is signalled.
Example:
>>> print(digest(n))
setvalue(_x, set(concat(seq(a, b), seq(sym(33), sym(63)))))
In the final result, the expression tree consists only of the
following types: Funcall
, Var
, Symbol
, int
.
Parsing
The function parse()
performs the sequence of steps just
discussed: tokenization, grouping, normalization, and digesting:
>>> expr = parse('_x := {<a>.<b>}')
>>> print(expr)
setvalue(_x, set(concat(seq(a), seq(b))))
List of Operators
The following is the complete list of operators, along with their
precedence and the corresponding function
name. Infinite arity is represented by ...
. The comma is
“inert” in the sense that it never actually combines with arguments;
it is only used as a separator. Since commas
are deleted after operator-precedence parsing, there is also no
corresponding function.
Op |
Prec |
Function |
x |
7 |
|
x |
6 |
|
x |
5 |
|
x |
4 |
|
a |
4 |
|
x |
4 |
|
x |
3 |
|
x |
3 |
|
x |
3 |
|
x |
3 |
|
x |
2 |
|
x |
2 |
|
x |
2 |
|
x |
2 |
|
x |
1 |
|
x |
1 |
|
x |
1 |
|
|
0 |
Evaluation
Overview
There are four interrelated functions: evaluate()
,
apply()
, symeval()
, and setvalue()
. All take an
env
argument, which is simply a dict mapping names to values.
Variables, constants, and function names are all included in env
.
They are easy to tell apart because variables begin with underscore,
constants are nonalphabetic, and function names are alphabetic. The
user can only change the values of variables.
Of those four functions, the only one of any complexity is
apply()
. It takes a function name and an argument list. It
goes through the following steps:
The function name is looked up in the environment to get the actual function
f
. An error is signalled if the name is not found, or if its value is not a function. It is also permissible for the function “name” to be an actualFunction
object, in which case no lookup is done.Checks are done to make sure that the argument list includes at least
f.min_narg
arguments, but not more thanf.max_narg
arguments. (The latter may have the valueUnlimited
.)Each argument is evaluated, unless
f.eval
exists and has the valueFalse
for the argument position in question.If
f.types
exists, the types of the arguments are checked.If
f.envarg
isTrue
, the environment itself is added to the argument list as a new final argument.
f.implementation
is called on the argument list, and the result is returned.
To get an environment populated with the standard functions, call Environment()
:
>>> env = Environment()
>>> expr = parse('_x := {<a>.<b>}')
>>> print(evaluate(expr, env))
None
>>> env['_x']
{<a, b>}
Interpreter
Evaluator.
An Evaluator
instance behaves like a function with an internal
environment. It
can be used to evaluate a sequence of statements:
>>> e = Evaluator()
>>> e('_x := <a,b,c>')
>>> e('_x')
<a, b, c>
When initialized, it uses Environment()
to create an environment,
and each time it is called it uses parse()
to turn the string
into an expression and evaluate()
to evaluate it in the environment.
Interpreter.
An Interpreter
also evaluates statements. Unlike an Evaluator,
it traps exceptions and captures the output of commands that do direct
printing, like show
. It also echoes the input statements, and
if created with the setting show_syntax=True
, it
also echoes the parsed version of each input line (for debugging).
The return value is a string containing all output:
>>> i = Interpreter(echo=True)
>>> i('_x := <d,o,g>')
'| _x := <d,o,g>\n'
>>> i('_x')
'| _x\n<d, o, g>\n'
>>> i('_y')
'| _y\nERROR: Unbound variable: _y\n'
It can either be called with a single string (as in the examples just shown), or with an iteration over strings, such as an open file:
>>> with open(ex.notebook.gl) as file:
... print(i(file), end='')
...
| #T My Notebook
| _x := <a,b,c>
| _x . <b,a>
<a, b, c, b, a>
The Interpreter calls two lower-level functions:
- interpret_file(file,output,env)
File may be a filename or an iterator over strings (e.g., an open file). The strings are parsed as input lines and evaluated. Processing continues even if an exception is encountered. All output is trapped and returned at the end as a string.
- parse_file(strs)
This is used by
interpret_file()
to parse the input. It takes an iterator over strings as input, and returns an iterator over triples, one for each input line. If the input line is empty or a comment, the triple is (None, None, line). If there is an error during parsing, the triple is (None, excep, line). Otherwise, the triple is (expr, None, line).
Customization
Adding an operator
The operators are listed in _operators
. The key is the
operator, and the value is a makeop
expression. The arguments
to makeop
are: the operator string (identical to the key), the
precedence, the syntactic type (I
for infix or S
for
suffix), and the name of the GLab function that the operator should be
replaced with. A named GLab function is a Function
object that
is the value of a key in the environment symtab; the key is the
function’s name.
To add a multi-character operator, one must also add the operator to the list of
multi-character specials in the definition of _syntax
.
Adding a function
Add an entry to the environment symtab whose value is a Function
object. The arguments to the Function
constructor are as
follows:
imp
— a Python function that implements the GLab function.
min_nargs
— the minimum number of arguments thatimp
requires.
max_nargs
— the maximum number of arguments thatimp
accepts.None
means thatimp
is declared*args
.
types
— a list giving the required types for the first n arguments, where n is the length oftypes
.
eval
— a list of booleans indicating which arguments should be evaluated. If None (the default), all arguments are evaluated.
envarg
— whether or not the Environment should be provided as a keyword argument. The default is False.
One must also define a Python function to serve as the
implementation. It will receive only positional arguments, with the
exception of the keyword argument env
, if envarg
is True.
Note that Python permits one to declare a function that accepts a variable
number of positional arguments as well as an env
keyword argument:
def foo (*args, env=None): ...