The module seal.data.brown behaves like an NLTK corpus, and indeed it dispatches to nltk.corpus.brown in most cases. However, it provides an alternative reduced tagset.
>>> from seal.data import brown >>> brown.tagged_words(tagset='base')[:3] [('The', 'AT'), ('Fulton', 'NP'), ('County', 'NN')]
Contrast this with the default tagset:
>>> brown.tagged_words()[:3] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]
NLTK provides the Brown corpus, though the version in seal.brown is tweaked. The two basic functions are:
>>> brown.words() >>> brown.tagged_words()The latter takes an optional argument tagset. If absent or equal to "original," the full Brown tags are returned. If equal to "base," the prefix FW- and the suffixes (-NC, -TL, and -HL) are removed from the tags. There are a few places where -T occurs in the original as an error for -TL; these are also stripped.
One can also call the function brown.base() on a tag to strip its prefixes and suffixes, if any. In addition, the function brown.ispunct() indicates whether a tag is a punctuation tag or not, and brown.isproper() indicates whether a tag is a proper name tag or not.
Both brown.words() and brown.tagged_words() can be called with optional parameters categories or fileids, with the same interpretation as in NLTK.
There are 188 base tags, which break down as follows:
1 | NIL | ||
96 | compound tags | ||
91 | simple tags | ||
9 | punctuation tags | ||
4 | proper-noun tags | ||
78 | regular word tags | ||
21 | tags for unique lexical items | ||
39 | closed-class tags | ||
18 | open-class tags |
NIL. There are 157 tokens in the original that are tagged "NIL." This appears to be simply a gap in the tagging. They are not removed from the output of stripped().
Compound tags. The compound tags are for contracted word pairs; four of them are actually contracted word triples. They exclude possessives, inasmuch as the possessive marker is not an independent word. The majority of contractions involve either the combination of a verb tag with *, which represents the contraction "n't"; or the combination of a noun or pronoun tag with a verb or auxiliary tag. There are, however, a fair number of other cases as well. The simple tags making up compound tags all occur independently, with the exception of "PP," which occurs in a compound tag but not standing alone. It is probable that this is an error for PPS or PPO, particularly since it occurs at the end of a long tag that may have gotten truncated.
Punctuation tags. Of the 91 simple tags, nine are punctuation tags:
' '' ( ) , -- . : ``
Proper-noun tags. Four tags represent proper nouns:
NP NP$ NPS NPS$
The tag NP includes titles such as "Mr." and "Jr.," as well as place names, month names, and the like. NPS includes words like "Republicans."
Unique lexical items. There are 21 tags that represent unique lexical items. We ignore spelling variation, nonstandard dialect forms, and foreign words. A few of the possessive tags, namely DT$, JJ$, AP$, CD$, RB$, appear on only one word each, but those represent rare constructions or questionable tagging decisions, and are listed elsewhere.
|
|
|
|
Closed-class tags. There are 39 closed-class tags:
Conjunctions | |
---|---|
CC | and, but, or, nor, either, yet, neither, plus, minus, though |
CS | complementizers |
Specifiers | |
ABL | such, quite, rather |
ABN | all, half, many, nary |
AP | other, many, more, same, ... |
AP$ | other's |
AT | the, a(n), no, every |
DTI | some, any |
DTS | these, those |
DTX | either, neither, one |
DT | this, that, each, another |
DT$ | another's |
QLP | enough, indeed, still |
Numbers | |
CD | cardinal numbers |
CD$ | 1960's, 1961's |
OD | ordinal numbers |
Pronouns | |
PPS | he, it, she |
PPSS | I, they, we, you |
PPO | it, him, them, me, her, you, us |
PP$ | his, their, her, its, my, our, your |
PP$$ | his, mine, ours, yours, theirs, hers |
PPL | himself, itself, myself, herself, yourself, oneself |
PPLS | themselves, ourselves, yourselves |
PN | one; (some-, no-, any-, every-) + (-thing, -body) |
PN$ | one's, anyone's, everybody's, ... |
RN | here, then, afar |
Interrogatives | |
WDT | which, what, whichever, whatever |
WPS | who, that, whoever, what, whatsoever, whosoever |
WPO | whom, that, what, who |
WP$ | whose, whosever |
WRB | when, where, how, why, plus many variants |
Other Closed Classes | |
MD | modals |
NR | adverbial nouns: days of the week, cardinal directions, etc. |
NRS | plural adverbial nouns |
NR$ | possessive adverbial nouns |
QL | qualifiers (adverbs that modify quantifiers) |
IN | prepositions |
RP | particles |
UH | interjections |
Open-class tags. There are 18 open-class tags, of which two (JJ$ and RB$) appear to be the result of phrasal use of the possessive, and should probably be placed in the class of compound tags.
Nouns | |
---|---|
NN | singular |
NNS | plural |
NN$ | possessive |
NNS$ | possessive plural |
Verbs | |
VBZ | third-person singular |
VBD | past tense |
VB | uninflected form |
VBG | present participle |
VBN | past participle |
Adjectives | |
JJ | positive |
JJR | comparative |
JJS | intrinsically superlative |
JJT | morphologically superlative |
JJ$ | Great's |
Adverbs | |
RB | adverb |
RBR | comparative |
RBT | superlative |
RB$ | else's |
Another source of trees is the Penn treebank, represented by the module ptb. It contains functions to access the Penn Treebank and its parts.
One may specify in the Seal configuration the pathname for the contents of LDC99T42.
The treebank consists of 2312 files divided into 25 sections. There is a traditional division into train, test, dev train, dev test, and reserve test parts:
Division | Sections | Files |
---|---|---|
dev_train | 00-01 | 0-198 |
train | 02-21 | 199-2073 |
reserve_test | 22 | 2074-2156 |
test | 23 | 2157-2256 |
dev_test | 24 | 2257-2311 |
The functions follow the conventions of the NLTK corpus readers. The function fileids() returns a list of file identifiers, which are actually numbers in the range [0,2312). One can also specify one or more categories. Category names are either WSJ section names, in the form '00', '01', up to '24', or one of the following: 'train', 'test', 'dev_train', 'dev_test', 'reserve_test'. One can get a list of the fileids in a given category, or the categories that a given file belongs to:
>>> from seal.data import ptb >>> len(ptb.fileids()) 2312 >>> len(ptb.fileids(categories='train')) 1875 >>> ptb.fileids('dev_train')[-5:] [194, 195, 196, 197, 198] >>> ptb.categories(0) ['00', 'dev_train'] >>> ptb.categories(2311) ['24', 'dev_test'] >>> for c in sorted(ptb.categories()): ... if c.islower(): ... print(c, len(ptb.fileids(c))) ... dev_test 55 dev_train 199 reserve_test 83 test 100 train 1875
One can obtain the filename for a given fileid:
>>> ptb.orig_filename(199)[-15:] '02/wsj_0200.mrg'
Reverse look-up is also possible:
>>> ptb.orig_to_fileid('0200') 199
The reverse look-up table is loaded the first time that orig_to_fileid() is called.
The method trees() returns a list of all the individual trees in the treebank or a slice of it:
>>> trees = ptb.trees(0) >>> print(trees[0]) 0 ( 1 (S 2 (NP:SBJ 3 (NP 4 (NNP Pierre) 5 (NNP Vinken)) 6 (, ,) ... >>> len(ptb.trees(categories='dev_test')) 1346
There is also a function iter_trees() that returns iterations rather than lists.
In the original treebank, typical empty nodes look like this:
(NP-SBJ (-NONE- *-1) ) (SBAR (-NONE- 0) (S (-NONE- *T*-1) ))
We omit "-NONE-" and treat "*," "0," or "*T*" as the category. The word and children are both None. For example:
>>> trees = ptb.trees(categories='dev_test') >>> tree = trees[30] >>> np = tree[18] >>> print(np) 0 (NP:SBJ 1 (*T* &1)) >>> t = np.children[0] >>> t.cat '*T*' >>> t.word '' >>> tree = trees[86] >>> s1 = tree[36] >>> print(s1) 0 (SBAR 1 (0) 2 (S 3 (*T* &1))) >>> s1.children[0].cat '0' >>> s = s1.children[1] >>> s.children[0].cat '*T*'
The module ptb is summarized in the following table. The optional f and c are optional and can also be provided by keyword: fileids and categories, respectively.
fileids(c) | The file IDs in categories c |
categories(f) | The categories for fileids f |
trees(f,c) | The trees in the given files/categories |
words(f,c) | The words |
sents(f,c) | Sentences (lists of words) |
raw_sents(f,c) | Sentence strings |
abspath(f) | The absolute pathname for the fileid |
text_filename(f) | Pathname for the text file |
orig_filename(f) | The original pathname |
fileid_from_orig(o) | Convert original ID (4 digits) |
text_files(f,c) | List of text filenames |
orig_files(f,c) | List of original filenames |
The function fileid_from_orig() takes an original file identifier. It strips a trailing file suffix, if any, and then ignores everything except the last four characters, which should be digits, such as "0904," which represents file 04 in WSJ section 09. Accordingly, "parsed/mrg/wsj/09/wsj_0904.mrg," "wsj_0904.mrg," and simply "0904" are treated as synonymous.
Bikel [2767] reports a number of statistics for the standard training slice (sections 02--21) of the Penn Treebank. We can compute our own statistics and compare, as follows. (Be warned, the calls that iterate over trees take on the order of minutes to return.)
Number of sentences. Bikel counts 39,832 sentences. Our count agrees:
>>> count(ptb.trees(categories='train')) 39832
Number of word tokens. Bikel counts 950,028 word tokens (not including null elements). Our count agrees:
>>> count(n for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isword()) 950028 >>> count(ptb.words(categories='train')) 950028
Number of word types. Bikel counts 44,114 unique words (not including null elements). Our count is slightly higher. I do not know why there is a discrepancy.
>>> len(set(n.word for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isword())) 44389 >>> len(set(ptb.words(categories='train'))) 44389
Number of words with a count greater than 5. Bikel reports that 10,437 word types occur 6 times or more. Our count is again a little higher:
>>> count(w for w in wcts if wcts[w] >= 6) 10530
Number of interior nodes. Bikel reports 904,748 brackets. Our count is quite a bit lower:
>>> count(n for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isinterior()) 792794
Number of nonterminal categories. Bikel reports 28 basic nonterminals, excluding roles ("function tags," in his terms) and indices. Including roles and indices, he reports 1184 full nonterm labels.
>>> ntcats = set(n.cat for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isinterior()) >>> len(ntcats) 27 >>> sorted(ntcats) [ADJP, ADVP, CONJP, FRAG, INTJ, LST, NAC, NP, NX, PP, PRN, PRT, PRT|ADVP, QP, RRC, S, SBAR, SBARQ, SINV, SQ, UCP, VP, WHADJP, WHADVP, WHNP, WHPP, X]It is not clear what Bikel's extra category is. Possibly he went beyond the training data.
Actually, we should probably replace "\verb.PRT|ADVP." with either PRT or ADVP. That would leave only 26 categories.
Number of terminal categories. Bikel reports 42 unique part of speech tags. We count 55.
>>> parts = set(n.cat for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isleaf()) >>> len(parts) 55 >>> sorted(parts) [#, $, '', *, *?*, *EXP*, *ICH*, *NOT*, *PPA*, *RNR*, *T*, *U*, ,, -LRB-, -RRB-, ., 0, :, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, ``]Eliminating empty leaves reduces the number of parts of speech to 45:
>>> parts = set(n.cat for t in ptb.trees(categories='train') ... for n in t.nodes() ... if n.isleaf() and not n.isempty()) >>> len(parts) 45 >>> sorted(parts) [#, $, '', ,, -LRB-, -RRB-, ., :, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, ``]
Number of roles. Bikel does not count roles separately. We can:
>>> roles = set(imap(Node.role, trn.nodes())) >>> roles set([TMP, DIR, PRP-CLR, SBJ-TTL, LOC-HLN, TPC, CLR-TPC, CLF, CLF-TPC, PUT-TPC, PRD-TPC, NOM-TPC, LGS, PRP-TPC, PRD-TTL, TPC-TMP, MNR, TPC-PRD, LOC-PRD-TPC, DIR-PRD, LOC-TMP, SBJ, TMP-TPC, MNR-PRD, HLN, MNR-CLR, BNF, LOC-MNR, PRD-LOC-TPC, LOC-CLR, TTL, NOM-SBJ, CLR-LOC, NOM, DIR-TPC, TPC-CLR, PRD-TMP, CLR, TTL-PRD, TMP-CLR, TMP-HLN, LOC-TPC-PRD, PRP-PRD, LOC-TPC, None, LOC-CLR-TPC, VOC, EXT, MNR-TMP, PRD, NOM-LGS, CLR-TMP, TMP-PRD, ADV, DTV, NOM-PRD, TTL-SBJ, TPC-LOC-PRD, LOC-PRD, PRD-LOC, ADV-TPC, CLR-MNR, DIR-CLR, PUT, TTL-TPC, PRP, LOC, CLR-ADV, MNR-TPC]) >>> len(roles) 69
The categories occurring in the treebank can be divided into three groups: nonterminal categories, parts of speech, and empty categories.
Nonterminal categories label interior nodes, that is, nodes that have children. (In the treebank, no interior nodes are labeled with words.) There are 28 nonterminal categories, as follows.
|
|
Parts of speech label nodes that have words. There are 45 parts of speech, as follows.
|
|
Empty categories label empty leaf nodes, that is, nodes that have neither children nor words. There are 10 empty categories, listed in the following table.
* | PRO or trace of NP-movement; preterminal cat is NP |
*?* | Elipsis |
*EXP* | Pseudo-attachment: extraposition |
*ICH* | Pseudo-attachment: "interpret constituent here" (discontinuous dependency) |
*NOT* | "Anti-placeholder" in template gapping |
*PPA* | Pseudo-attachment: "permanent predictable ambiguity" |
*RNR* | Pseudo-attachment: right-node raising |
*T* | Trace of wh-movement |
*U* | Unit |
0 | Null complementizer |
NX is generally used in coordinate structures. It may be used for N-bar coordination: "the [NX red book] and [NX yellow pencils]." It is also used in non-constituent coordination structures such as "20 thin [NX] and 10 fat [NX] [NX dogs]," where "dogs" is treated as a right-node raised node. It is also used for book/movie titles that have premodifiers.
Lists of the categories are found in the following variables.
>>> len(ptb.nonterminal_categories) 28 >>> len(ptb.parts_of_speech) 45 >>> len(ptb.empty_categories) 10
These lists were constructed using the function collect_categories(). It returns a list containing three sets: nonterminal categories, parts of speech, and empty categories. A category is defined to be nonterminal if it appears on a node with children, a part of speech if it appears on a node with a word, and an empty category otherwise. Note that the empty string is included as an extra nonterminal category: there are some nonterminal nodes (root nodes) without a category.
The roles that occur in the PTB are listed in the following table.
ADV | Adverbial | form vs function | Used on NP or SBAR, but not ADVP or PP. Subsumes more-specific adverbial tags. |
BNF | Benefactive | adverbial | May be used on indirect object. |
CLF | Cleft | misc | It clefts. Marks the whole sentence; not actually a role. |
CLR | Closely related | misc | Intermediate between argument and modifier. |
DIR | Direction | adverbial | May be multiple: from, to. |
DTV | Dative | grammatical role | Only used if there is a double-object variant. Also ablative meaning: ask a question [of X]. But anything with for is BNF. Not used on indirect object! |
EXT | Extent | adverbial | Distance, amount. Not for obligatory complements, e.g. of weigh. |
HLN | Headline | misc | Marks the whole phrase; not actually a role. |
LGS | Logical subject | grammatical role | The NP in a passive by-phrase. |
LOC | Locative | adverbial | |
MNR | Manner | adverbial | |
NOM | Nominal | form vs function | Marks headless relatives behaving as substantives. Not actually a role. Co-occurs with SBJ and other argument roles. |
PRD | Predicate | grammatical role | Any predicate that is not a VP. Also, the so in do so. |
PRP | Purpose or reason | adverbial | |
PUT | Locative of put | grammatical role | |
SBJ | Subject | grammatical role | |
TMP | Temporal | adverbial | |
TPC | Topicalized | grammatical role | Only if there is a trace or resumptive pronoun after the subject. |
TTL | Title | misc | The title of a work, implies NOM. Marks the whole phrase; not actually a role. |
VOC | Vocative | grammatical role |
The module perseus contains small Latin and Greek treebanks from Project Perseus. The main method for these treebanks is stemmas(), which returns an iterator over the stemmas in the treebank. (Yes, "stemmata" is the correct plural, but it seems excessively pedantic, so we have anglicized.)
>>> from seal.data import perseus >>> stemmas = list(perseus.latin.stemmas()) >>> len(stemmas) 3473 >>> print(stemmas[0]) 0 *root* _ _ _ _ 1 In r-------- in1 AuxP 4 2 nova a-p---na- novus1 ATR 7 3 fert v3spia--- fero1 PRED 8 4 animus n-s---mn- animus1 SBJ 2 5 mutatas t-prppfa- muto1 ATR 6 6 dicere v--pna--- dico2 OBJ 2 7 formas n-p---fa- forma1 OBJ 5 8 corpora n-p---na- corpus1 OBJ 0
A dataset has a language and a version. Languages are specified as ISO 639-3 codes. There are currently four different versions, as follows. The original CoNLL treebanks from the 2006 shared task have version orig. Datasets converted to the Das-Petrov universal tagset (DPU) have version umap. The Universal Dependency Treebank (UDT) with standard encoding has version uni. The Universal Dependency Treebank with content-head encoding (ch). The Penn Treebank (PTB) converted to dependencies using my adaptation of the Magerman-Collins (MC) rules has version dep. The same converted to the Das-Petrov tagset has version umap. The following table lists the currently available datasets. (DPU = Das-Petrov Universal tagset; UDT = Universal Dependency Treebank.}
Name | Lg | Ver | Description |
---|---|---|---|
arb.orig | arb | orig | CoNLL-2006 Arabic |
arb.umap | arb | umap | CoNLL-2006 + DPU, Arabic |
bul.orig | bul | orig | CoNLL-2006 Bulgarian |
bul.umap | bul | umap | CoNLL-2006 + DPU, Bulgarian |
ces.orig | ces | orig | CoNLL-2006 Czech |
ces.umap | ces | umap | CoNLL-2006 + DPU, Czech |
dan.orig | dan | orig | CoNLL-2006 Danish |
dan.umap | dan | umap | CoNLL-2006 + DPU, Danish |
deu.ch | deu | ch | UDT, content-head, German |
deu.orig | deu | orig | CoNLL-2006 German |
deu.umap | deu | umap | CoNLL-2006 + DPU, German |
deu.uni | deu | uni | UDT, German |
eng.dep | eng | dep | Penn Treebank, MC heads |
eng.umap | eng | umap | Penn Treebank, MC heads + DPU |
fin.ch | fin | ch | UDT, content-head, Finnish |
fra.ch | fra | ch | UDT, content-head, French |
fra.uni | fra | uni | UDT, French |
ind.uni | ind | uni | UDT, Indonesian |
ita.uni | ita | uni | UDT, Italian |
jpn.uni | jpn | uni | UDT, Japanese |
kor.uni | kor | uni | UDT, Korean |
nld.orig | nld | orig | CoNLL-2006 Dutch |
nld.umap | nld | umap | CoNLL-2006 + DPU, Dutch |
por.orig | por | orig | CoNLL-2006 Portuguese |
por.umap | por | umap | CoNLL-2006 + DPU, Portuguese |
por.uni | por | uni | UDT, Portuguese |
slv.orig | slv | orig | CoNLL-2006 Slovenian |
slv.umap | slv | umap | CoNLL-2006 + DPU, Slovenian |
spa.ch | spa | ch | UDT, content-head, Spanish |
spa.orig | spa | orig | CoNLL-2006 Spanish |
spa.umap | spa | umap | CoNLL-2006 + DPU, Spanish |
spa.uni | spa | uni | UDT, Spanish |
swe.ch | swe | ch | UDT, content-head, Swedish |
swe.orig | swe | orig | CoNLL-2006 Swedish |
swe.umap | swe | umap | CoNLL-2006 + DPU, Swedish |
swe.uni | swe | uni | UDT, Swedish |
tur.orig | tur | orig | CoNLL-2006 Turkish |
tur.umap | tur | umap | CoNLL-2006 + DPU, Turkish |
The name of a dataset is language-dot-version, for example dan.orig. The function dataset() gives access to a dataset by name:
>>> from seal.data import dep >>> dep.dataset('dan.orig') <Dataset dan.orig>
The function datasets() gives access to sets of datasets. Language or version may be specified:
>>> dep.datasets(lang='dan') [<Dataset dan.orig>, <Dataset dan.umap>] >>> len(dep.datasets(version='orig')) 18 >>> len(dep.datasets()) 52
The class Dataset represents a treebank. There are two specializations, UMappedDataset and FilterDataset. Each dataset has a name, a description, a language represented as an ISO 639-3 code, and a version.
>>> ds = dep.dataset('dan.orig') >>> ds.name 'dan.orig' >>> ds.desc 'Danish, CoNLL-2006' >>> ds.lang 'dan' >>> ds.version 'orig'
Simple datasets also have a training file pathname, a test file pathname, and (sometimes) a dev file pathname. (To be precise, datasets in the uni and ch collections have a dev file pathname, but orig datasets do not.) The pathnames are also available for umapped datasets, but the files contain the original (unmapped) trees. Filter datasets do not have pathnames.
>>> ds.train[ds.train.find('conll'):] 'conll/2006/danish/ddt/train/danish_ddt_train.conll' >>> ds.test[ds.test.find('conll'):] 'conll/2006/danish/ddt/test/danish_ddt_test.conll' >>> ds.dev >>>
A dataset instance has a sents() method that generates sentences for a specified section of the treebank. All treebanks have 'train' and 'test' sections. In addition, uni and ch datasets have a 'dev' section, and the English datasets have 'dev_train', 'dev_test', and 'reserve_test' sections.
>>> sents = list(ds.sents('train')) >>> len(sents[0]) 14
A convenience function called sents() is also available to retrieve the sentences for a particular segment of a dataset directly:
>>> sents = list(dep.sents('dan.orig', 'train'))
A sentence can be viewed as a list of records. Word~0 is always the root pseudo-word. "Real" words start at position 1. The length of the sentence includes the root, so the last valid index is the length minus one.
>>> s = sents[0] >>> s[0] <Word 0 *root*> >>> s[1] <Word 1 Samme/AN:ROOT (/A.degree=po...) govr=0> >>> s[13] <Word 13 ./XP:pnct (/X) govr=1>
The Sentence and Word classes were discussed earlier. Each record is represented by a Word instance, with ten fields: i, form, lemma, cpos, cat, morph, govr, role, pgovr, and prole. The field cpos represents the coarse part of speech, and cat represents the fine part of speech. The fields pgovr and prole represent the word's governor and role in the projective stemma. They may not be available. The fields govr and role are always available, but they are not guaranteed to be projective.
All fields except i, govr, and pgovr are string-valued. If not available, their value is the empty string. The values for i, govr, and pgovr are integers. If they are not available, their value is None. The fields i and govr are always available, except that word 0 has no govr.
The values for govr and pgovr can be used used as an index into the sentence, with the value 0 representing the root.
One can get just a list of word forms (strings) using the method words(). This provides suitable input for a standard parser. The root pseudo-word is not included. The method nwords() returns the number of words excluding the root.
>>> ws = s.words() >>> ws[:3] ['Samme', 'cifre', ','] >>> len(ws) 13 >>> s.nwords() 13
A sentence provides separate methods for each of the word attributes, indexed by the word number, with 0 being the root pseudo-word.
>>> s.form(0) '*root*' >>> s.form(1) 'Samme' >>> s.form(13) '.'
The attributes are as listed above: form, lemma, cpos, cat, morph, govr, role, pgovr, and prole.
>>> s.form(2) 'cifre' >>> s.lemma(2) '' >>> s.cpos(2) 'N' >>> s.cat(2) 'NC' >>> s.morph(2) 'gender=neuter|number=plur|case=unmarked|def=indef' >>> s.govr(2) 1 >>> s.role(2) 'nobj'
Word forms need not be ascii.
>>> from seal.core.misc import as_ascii >>> as_ascii(s.form(12)) 'v{e6}rtsnation'
Without as_ascii, the form would print as "værtsnation."
One can fetch a column as a tuple using the method column().
>>> g = s.column('govr') >>> g[:5] (None, 0, 1, 1, 7)
If desired, one can create a Sentence as follows.
>>> from seal.nlp.dep import Sentence, Word >>> s = Sentence() >>> s.append(Word(1, 'This', ('PRON', 'PRON'), 'this', '', 2, 'subj')) >>> s.append(Word(2, 'is', ('VB', 'VB'), 'be', '', 0, 'mv')) >>> s.append(Word(3, 'a', ('DT', 'DT'), 'a', '', 4, 'det')) >>> s.append(Word(4, 'test', ('N', 'N'), 'test', '', 2, 'prednom'))
The numbers must be sequential from 1; they provide a quality check.
On disk, the training and test files are in CoNLL dependency format. The sents() method uses seal.dep.conll_sents() to read them:
>>> from seal.nlp.dep import conll_sents >>> f = conll_sents(ds.train) >>> s = next(f) >>> len(s) 14
The file seal.ex.depsent1 provides an example of the file format:
1 This this pron pron _ 2 subj 2 subj 2 is is vb vb _ 0 mv 0 mv 3 a a dt dt _ 4 det 4 det 4 test test n n _ 2 prednom 2 prednom
Each sentence is (obligatorily) terminated by an empty line. Fields are separated by single tab characters. There are ten fields: id, form, lemma, cpos, fpos, morph, govr, role, pgovr, prole.
The 'umap' versions of the treebanks are mapped from the 'orig' versions using the tag tables of Petrov, Das & McDonald [3300]. They are instances of UMappedDataset, which uses UMappedDepFile.
>>> ds = dep.dataset('dan.umap') >>> s = next(ds.sents('train')) >>> s[1].form 'Samme' >>> s[1].cat 'ADJ'
The BioNLP dataset contains biomedical texts with annotations.