<%inherit file="/base/index.html" /> <%namespace file="/base/javascriptDefs.html" name="javascriptDefs" \ import="getMorphologicalParseScript, savePhonologyScript, generateMorphotacticsScript, generateMorphophonologyScript, generateProbabilityCalculatorScript, compilePhonologyScript, evaluateParserScript, applyPhonologyScript, applyPhonologyToDBScript"/> <%namespace file="/base/searchFields.html" name="searchFields" \ import="formSearchFields"/> <%def name="heading()">

Analysis

<%def name="analysis()">

stuff about the analysis

<%def name="phonology()">

stuff about the phonology

<%def name="morphotactics()">

stuff about the morphotactics

<%def name="morphophonology()">

stuff about the morphophonology

<%def name="probabilitycalculator()">

stuff about the probability calculator

<%def name="morphologicalparser()">

stuff about the phonology

<%def name="therest()">

On this page you can configure a morphological parser for your language. A morphological parser is a program that takes a word as input and returns a morphological analysis as output.

parser image

Configuring a morphological parser on the OLD entails specifying a phonology and a morphotactics as finite state transducers (FSTs) and composing them into a single morphophonology FST. A morphophonology specified as an FST functions symmetrically as both a parser and a generator: it can return the set of morpheme sequences compatible with a phonetic representation as well as the set of phonetic representations compatible with a sequence of morphemes. When parsing, a probability calculator (based on a bigram language model of morpheme sequences) chooses the best parse from the set of candidates.

parser image

As concerns language documentation and analysis, a morphological parser and its subcomponents can be used to:

Contents:

There are three primary components to a morphological parser as implemented on an OLD application:

  1. a phonology (encoded as a finite state transducer, i.e., FST)
  2. a morphotactics (encoded as an FST)
  3. a morphological language model (assigns probabilities to morphological analyses)

The first two components (phonology and morphotactics) are composed into a single FST that we might call the morphophonology. The morphophonology FST is a program that takes a word as input and returns all morphological analyses compatible with the morphotactics and the phonology.

The morphological language model takes the output of the morphophonology, i.e., a list of possible morphological analyses, and returns the most probable such analysis.

Only the phonology must be specified by the users of the system. The morphotactics and morphological language model (i.e., the word probability calculator) can be induced from the data present in the database, i.e., the Forms that contain morphologically analyzed words. When no phonology is specified, the parser assumes that there are no phonological alternations. Also, when the database lacks words morphologically analyzed by the users, there will be neither morphotactics nor a language model.

Parser requirements

The morphological parser requires that foma (and its command line utility counterpart, flookup) be installed on the server. It also requires that the files 'phonology.foma', 'morphotactics.foma', 'morphophonology.foma', 'morphophonology.foma.bin' and 'probabilityCalculator.pickle' be present in the 'parser' directory. This section indicates whether these requirements are met. (See below for how to install foma and generate the requisite files.)

Installing foma

The program foma must be installed on the server in order for the morphological parser to function.

Your system administrator must install foma. This involves installing libreadline, zlib1g-dev, foma and flookup. On a debian-based system, try the following


apt-get install libreadline-dev
apt-get install zlib1g-dev
wget http://dingo.sbs.arizona.edu/~mhulden/foma-0.9.15alpha.tar.gz
tar -xzvf foma-0.9.15alpha.tar.gz
cd foma
touch *.c *.h
make
make install

The above installed foma but not the flookup utility. In order to install flookup, I downloaded the flookup binary (available here) and copied it to /usr/local/bin.

Phonology

The "phonology" of your language on an OLD application is the finite state transducer (FST) that represents the relation between the underlying phonemic shape of your lexical items and their surface realization in context.

The system assumes lexical items to be those Forms that lack space characters and are not tagged with a morphosyntactic category in the set {'S', 's', 'sent', 'sentence', 'Sentence'}. It further assumes that these lexical items are transcribed (in the transcription field) with an underlying phonemic transcription. In contrast, sentences, phrases, words and any other poly-morphemic Forms are assumed to have surface transcriptions.

A finite state transducer (FST) is a computational formalism that can be thought of as representing the relation between two languages, where "language" is understood in the computationo-logical sense as a set of strings. FSTs can be used to, among other things, represent the phonology of a language, i.e., the relationship between the underlying phonemic shape of a word and its surface realization.

There are several computer programs that can be used to create FSTs and do conversions on strings. Probably the most widely known of these is XFST, the Xerox Finite State Tool. The OLD uses the open source program foma. Other similar tools are SFST, HFST & PC-Kimmo. Programs like XFST and foma define languages that facilitate the specification of FSTs. The FST specification language implemented in foma allows one to generate FSTs via scripts containing SPE-style context-sensitive rewrite rules. It is in writing such a script that one specifies a phonology for a language being analyzed and documented on an OLD application.

Specifying a phonology

Use the text box below to encode the phonology of your language as an FST using foma's FST creation language. Requirements of the phonology:

See the 'foma scripting basics' and 'example phonology script' subsections below.

Resources:

Write the phonology

Clicking "Save & Compile Phonology" saves the phonology script to a file and generates a binary foma FST file so that the phonology can be tested.

Foma scripting basics

Example phonology script

Here is an example foma script which implements a phonology of the Blackfoot language. The rules are taken (with some modification and interpretation) from Frantz (1997).

toggle script

Phonology tester script

Below is provided a simple Python script that can be used to test a phonology against a series of word/analysis pairs. To get the script, click the "toggle script" link below, copy and paste its contents into a text editor and save the resulting file as 'phonology_tester.py'. The script contains its own usage instructions.

toggle script

Test the phonology

Here you can test your phonology FST in two ways:

  1. apply phonology to a token
  2. apply phonology to subset of the Forms in the database

Apply the phonology to a token

Enter a string of morphemes (i.e., a morphological analysis of a word) in the text box below and click 'Phonologize Token'. This will use your phonology FST script to map an underlying morphophonemic representation to a surface phonetic one and is a good way of seeing whether your phonology is doing what you want it to. (This is the output of running apply down morpheme-string from within foma or running echo "morpheme-string" | flookup -x -i phonology.foma.bin at the command line.)


Apply the phonology to a subset of the database

To specify the set of Forms to which the phonology will be applied, enter search criteria by clicking 'Toggle interface for filtering Forms'. Then click the 'Phonologize DB Subset' button and the system will apply the phonology to the morpheme break line of each word of each Form in your search results.

Toggle interface for filtering Forms.

Morphotactics

The morphotactics is the set of valid words, that is, strings of morphemes, of your language.

The morphotactics is deduced from the analyzed words and lexical items that the users of this OLD application enter. Specifically, the system searches the syntactic category string field of all Forms for valid syntactic category words. For example, a Form might have the syntactic category string "D N-? V-Agr". The syntactic category words in this example would be "D", "N-?" and "V-Agr" and the "N-?" one would be invalid (OLD applications use "?" as the category for unrecognized morphemes). From such a Form the system would conclude that a morpheme of category D can form a valid word as can a morpheme of category V followed by one of category Agr.

Using this morphotactic and lexical information, an OLD application can automatically generate a finite state transducer that represents the set of valid words of your language.

Generate the morphotactics

Click the Generate Morphotactics button to have the system generate a morphotactic FST for your language based on the data in the database.

Morphophonology

The morphophonology is an FST created by composing the morphotactics FST with the phonology FST. This morphophonology FST is then written as a binary file that the flookup command line program can use to parse words.

Generate the morphophonology

Click the Generate Morphophonology button to have the system generate the FST encoding the morphophonology of your language. Warning: depending on the size of your lexicon and the complexity of your phonology, this can take quite a long time (up to 5 minutes) and can monopolize your server's resources. It is suggested that you (a) be patient and (b) run this command during a low usage period.

Probability calculator

The morphological language model assigns a probability to each word, i.e., to each sequence of morphemes. These probabilities are used to create a probability calculator that can rank morphological analyses.

Generate the probability calculator

Click the Generate Probability Calculator button to have the system generate the probability calculator based on the morphological language model.

Test the parser

Here you can test the parser that you have configured in a number of ways. First, you can test it on words that you enter. Second, you can have the system divide the analyzed words in the database into training and test sets and generate an F-score for the parser. Third, you can have the system run the parser on the unanalyzed words present in the database.

Parse a word

Enter a word in the text box and click 'Parse' to parse it.

Evaluate the parser

Click the Evaluate Parser button to have the system divide the analyzed words in the database into training and test sets and generate an F-score for the parser as a measure of its accuracy.

(Note: this function creates a new probability calculator (based on a bigram language model) that draws on a randomly selected 90% of the analyzed words, i.e., the training set, and tests accuracy against the other 10%, i.e., the test set.)

Issues

Here I list some problems with the morphological parser implementation, some being general and some specific to my test language, Blackfoot.

${getMorphologicalParseScript()} ${savePhonologyScript()} ${applyPhonologyScript()} ${applyPhonologyToDBScript()} ${generateMorphotacticsScript()} ${generateMorphophonologyScript()} ${generateProbabilityCalculatorScript()} ${compilePhonologyScript()} ${evaluateParserScript()}