For purposes of export and import, a corpus is treated as a set of content items, of the following types:
Each content item has an ID that consists of a container name and a local ID that identifies it within the container. Containers include languages, the romanization registry, and GLab. The name of a container is unique within the corpus. The name for a language is the standard 3-letter ID, the name of the registry is roms, and the name for GLab is glab. (Uniqueness is guaranteed because neither 'roms' nor 'glab' is a valid language name.) An item's local ID is unique within its container. A local ID is not necessarily unique within the corpus, but the combination of container name and local ID yields a global ID that is unique. The local ID for a GLab notebook combines the user name and notebook number, e.g. abney-1. (User names are restricted to alphanumeric characters and underscore: see UserList.)
An instance of Container represents a view of a directory as a named container, and an instance of Item represents a view of either a file or directory as an item. Both Containers and Items may have children, which are Items. The descendants of a Container are computed as the transitive closure of its children, and the descendants of an Item are computed as the reflexive-transitive closure of its children.
As a unit of export, an item is identified with a list of kernel files. Two items may have overlapping sets of descendants, but no two items have any kernel files in common.
The files constituting the kernels are as follows. The C column indicates whether the item may have children.
Item | Kernel Files | C |
---|---|---|
Language | info | Y |
Lexicon | lexicon | - |
Text | orig, trans, media, clips, paras, transcript, toc | Y |
Romanization | rom | - |
Notebook | gn | - |
The cld script is used for corpus management, including export and import. Exporting creates a serialized ASCII representation of the items, or some subset of the items, of the corpus. Importing reads the serialized representation and creates or appends to files in the corpus.
The list, export, and delete commands take a list of selectors to indicate which items are to be included. Possible selectors are:
In this way, processing the list of selectors yields a list of selected named containers, and within each named container, a list of selected items. It is possible to select a container without selecting any items in it. In that case, the item wild-card is supplied, selecting all items in the container.
The commands are list, export, delete, and import. The first three accept a selection.
A language serves both as a Container and as an Item. Hence we must be able to distinguish between selecting a language as container (which includes all its descendants) versus selecting the language as item (just the info file, not the descendants). The language name denotes the container. The item has name 'language' within the container.
The Item.com_export method receives an ExportStream as its first argument. The sole method of interest for ExportStream is write_record. It takes any number of arguments, and writes them to the stream, separating them with tabs and terminating the line with a newline. (One should include neither tabs nor newlines in the call to write_record.) The first field should always be the record type; see the tables above.
The method Item.start_block writes the 'I' record that starts the item block.
A call to Item.com_import receives two arguments: an ImportStream and a type, which is the second field of the 'I' record.
When importing from a stream, an issue that arises is what to do when the relpath given in the stream is already occupied by an item in the corpus. The answer depends on the item type and the value of the onconflict import argument.
In the case of texts and notebooks, during import a new item is always allocated. The relevant containers (language, glab) maintain a mapping from stream-internal IDs to corpus IDs. The intern_item methods of the containers consult and update the table.
An complication arises in the case of texts. Unlike with other items, the position of a text in the corpus hierarchy is not uniquely determined by its container name plus ID. The information is encoded in two places in the export stream. A "relpath" entry is added to the text info file. Toc items list their children in the textinfo file, which determines not only the hierarchy but also the order.
An export file conventionally has the file suffix 'ef'. It is a tabular ASCII file; records are terminated with newline and fields are separated by tabs.
The first field of each record contains the record type, which determines the number and interpretation of the remaining fields. The following table lists the record types, along with the additional fields that a record of that type contains:
Record Type | Additional Fields | |
---|---|---|
A | Media | media-suffix, user-name, filename |
'__default__', media-suffix | ||
C | Snippet ("clip") | start, end, new-sentence-flag |
E | Translation ("English") | translation |
F | Lexical field name | index, long-name |
I | Starts a new item | item-type, container-name, local-id, item-subtype, path |
L | Lexical entry | form, sense-number, lexical-field* |
M | Metadata | attribute, value |
N | Notebook line | notebook-line |
P | Property | attribute, value |
R | Romanization record | romanized, unicode |
S | New translation unit | |
T | Token | form, sense-number, lpunc, rpunc |
Records are grouped into item blocks. An item block contains all the information for a single item. It begins with an 'I' record and continues up to the next 'I' record or end of file. There are five types of item block, as follows:
The item-type is 'text', the container-name is the language ID, the local-ID is numeric. The item-subtype is one of: 'orig', 'media', 'toc', 'stub'. The path starts with the language ID, and gives the location of the text in the hierarchy of texts.
The records that represent the contents of the text come in the following order:
Not all texts have all blocks. Only a media file has 'A' and 'C' records. Media files and original texts both have 'S', 'T', and 'E' records. The 'S' and 'T' records are obligatory for an original text. For a media file, they are optional, and represent the snippet contents, not translation units. 'E' records are always optional. Texts of type 'toc' or 'stub' contain only 'P' records.
The item-type is 'language', the container-name is the language ID, the local-ID is 'language'. Item-subtype and path are empty. The contents consist of:
The item-type is 'rom', the container-name is 'roms', and the local-ID is the name of the romanization. Item-subtype and path are empty. The contents consist of:
The item-type is 'notebook', the container-name is 'glab', the local-ID has the form user/number. The contents are: