The seal.cld.db package provides a database implemented as a directory hierarchy whose files and subdirectories represent database objects.
A File represents a persistent object, that is, one whose contents live on disk. Files support multiple users, including a permissions system and file locking and backup. The main complexity lies in synchronizing changes between the object in memory and the disk representation.
One does not directly instantiate File, rather, one defines specializations of File. When defining a specialization, the most basic information that is required is how to write the contents of the object to a stream, and how to read the contents from a stream. The methods __write__ and __read__ implement those operations, and must be provided by the specialization.
One also does not directly instantiate the specialization. Rather, a File is obtained from a persistent Directory, either by calling the Directory's __getitem__ method to obtain an existing object or by calling its new_child method to create a new object. (The root object is created by calling open_database, discussed below.) Each child has a name (a string) that uniquely identifies it relative to the parent directory.
When instantiating a child, the Directory determines which class to instantiate by determining the child's typename. The environment, which is accessible as the File member env, provides a one-one mapping between typenames (which are strings) and classes. The child's typename is determined as follows:
A File's representation on disk is created by the method __create__. The new_child method immediately calls __create__. The function create_database is used to create root File. It calls open_database to instantiate the root file, and then it calls the file's create method (no underscores). The create method is only available for root Files.
When a File is instantiated, its contents are not immediately loaded. The reason is that one should be able to list the contents of a directory without loading into memory the entire hierarchy, or even all the files in the directory. To break the recursion, Files are initially instantiated as unloaded stubs.
The possible states that a File may be in are as follows:
The following methods change the object's state:
Some of the actions signal an error unless the user has proper permissions. In particular:
To change the permissions of a File or Directory, the user must have admin permission.
When defining a specialization of File, it is the specializer's responsibility to manage the contents. To assure that the object behaves as expected, one should adhere to the following discipline:
Contents reside in private members, and the only access to the contents or to any part of them is via the object's access methods (like __getitem__) and update methods (like __setitem__).
Contents and parts of contents are either immutable, or they are constructed from "virtual view" classes that dispatch to the access and update methods of the File.
Each access method definition begins with a call to require_load, assuring that the content members have been set.
Each update method definition is protected by being wrapped in with self.writer(). The writer immediately calls require_load, so one can count on the content members existing. An update method is not obliged to modify the contents, but if it does, it should call modified.
The __read__ method is given an input stream, and it should set the content members. The first time it is called, the input stream will be empty; it should accommodate that case. It should not count on the content members being undefined: it is also used to re-initialize the File after a failed save.
The __init_contents__ method is called by __create__. The default implementation is a no-op, but specializations may override it to initialize the content members. It is an update method, meaning that it should either be protected with a call to with self.writer() or it should only call protected update methods. Since the writer calls require_load, the __read__ method actually gets called before __init_contents__ is called.
The __write__ method is given an output stream as argument. It should write the contents in the form expected by __read__.
One may specify that this File requires one or more other Files. Then any time a writer is created for this file, the required files are automatically added to it. (See section XX below.)
In addition to the usual contents, a File may contain metadata. For example, any specialization that sets __has_permissions__ to True will have a permissions metadata item. Metadata items are a special form of content, and are treated accordingly. They reside in private members; accessors should call require_load; and updaters should be protected by a writer.
A metadata item should be a specialization of Metadata, not of File. Among other things, a metadata item is not independently loadable and cannot be moved or deleted. The File that the metadata item belongs to is called the host. There is an implementation issue that is handled by the Metadata class: the require_load and writer methods do not directly call __load__ and __save__ (which do not exist for Metadata), but rather dispatch to the corresponding methods of the host.
A specialization of Metdata should have __read__ and __write__ methods, like a File. When the host's contents are read or written, calls are also placed to the __read__ or __write__ method of all of its metadata items.
There are two constraints, which are due to my laziness. The __write__ method must not write the line "##EOM", which is used as a separator between metadata sections. And the last line that is written must end in a newline.
To add metadata items, a specialization should set the class member __metadata__ to be a tuple that extends the parent class's value. Elements are pairs (member, class). For example:
class MyObject (Directory): __metadata__ = Directory.__metadata__ + (('_foo', Foo),)
The following provides a summary of the members and methods of File. Some of the following have not yet been introduced, but will be introduced in later sections.
Members:
Instantiation:
Creation:
Reading:
Writing:
Hierarchy modification:
A Directory behaves like a dict that maps names to child Files. I have already mentioned that the method __getitem__ accesses an existing child, and new_child adds a new child to the Directory.
Directory is a specialization of File, so one Directory may be a sub-Directory of another. The absolute location of a File can be given as a path, which is a sequence of names. The method follow takes a path as argument. If the path contains only one name, it simply calls __getitem__ with that name. Otherwise, it calls __getitem__ with the first name to get a subdirectory, and passes the remaining names to the subdirectory's follow method.
We have already encountered the methods that modify the directory hierarchy:
The set of children represents the Directory's primary contents. The children are read by the __read__ method and written by the __write__ method, and they are stored in the private member _children. Hence, specializations do not need to define __read__ and __write__.
Summary of members and methods of Directory:
The module seal.cld.db.core contains two functions for opening a database. An existing database is opened by calling open_database, and a new database is created by calling create_database. A database is really just a root File, that is, a File whose parent is None. The main thing that sets a database apart from any other File is that it creates an Environment for itself and its descendants. A root File also automatically includes a GroupsFile metadata item for use of the permissions system.
The Environment is a dict-like object containing global information. It contains a pointer to the database, under the key 'root', and it contains the tables that map between typenames and classes. If there is an index, it also contains a pointer to the index. (Environments are discussed in more detail later.)
Files other than the root are allowed to create an index, by having a list of typenames in the class variable __indexed__. A File with an index creates a fresh copy of its parent's environment, and sets 'index_root' to itself.
If a File is neither a root nor indexed, it simply uses its parent's environment unmodified.
The motivation for having indices is as follows. Organizing Files into a hierarchy, rather than using the usual relational representation, has the advantage that entire subhierarchies can be moved or deleted as a unit. It has the disadvantage that one requires a path, and not just a name, to find an object. Indices make it possible to access Files by identifier instead of path, where an identifier is a pair (typename, name).
A simple identifier suffices for Files that are indexed in the root directory. When there are multiple indices, we must also include the index name in identifiers. A global identifier is a triple (indexname, typename, name).
When writing specializations of Directory, one may also write specializations of Environment. To link the two, define the File method create_env. The default implementation instantiates and returns Environment, provided that the File is either the root or indexed. Otherwise, it returns None, which indicates that the parent environment should be used.
To delete a database that one has opened, call its delete method. A function delete_database is also available that takes just a filename.
A Directory is implemented as a disk directory containing a distinguished file called _children, which contains the actual contents. The distinction between the disk directory and the file _children is hidden in the implementation.
The distinction resides primarily in associating two different disk filenames with a Directory object. The method _filename returns the absolute pathname of the disk representation for the sake of relocating or deleting the File. In the case of a Directory, it returns the filename of the disk directory. The method _contents_filename returns the absolute pathname of the disk representation for the sake of loading and saving the File. In the case of a Directory, it returns the pathname of the _children file. In the case of a File that is not a Directory, _filename and _contents_filename are the same.
Because a File may be moved, we would like to minimize the number of things that need to be updated if a File's location in the hierarchy changes. There are three aspects to the issue: a File should be self-contained, a File should not cache context-dependent information, and external information that needs updating when the File moves or is deleted should be minimized.
Self-containedness. All pieces of the File should be containined within the physical file that represents it. To give an example: we wish to be able to identify a file's type by inspection, and one way of doing that would be to place type information in the parent's list of children. However, doing so would reduce self-containedness: the type information would reside outside the child file and would need to be updated if the child moved. Instead, in the current implementation, the filename on disk combines the File's name and typename.
This consideration also motivates the current implementation of metadata (including permissions), in which metadata is represented on disk as a section of the same file that contains the File contents. A less acceptable approach would be to store permission information in a sibling file.
No context-dependent cached information. To give an example, it would be natural to cache a File's physical disk location, but that information would need to be updated if the File is moved. In the current implementation, the pathname is always computed rather than cached.
One unavoidable form of context caching is the env member, which contains global information. The current approach is to insist that env be essentially immutable, and hence to restrict a File from being attached to a new parent whose value from env differs from the File's.
Minimizing location-dependent external records. If the File has an indexed type, then information about the location of a file is stored in an index. That is hardly avoidable, and must be updated if the File is relocated. In the current implementation, that is the main case of external information that must updated, apart from the obvious modifications to the new and old parents.
There are a couple of additional dependencies that arise in CLD.
user.media
PropDict maps suffixed names to
text IDs. If the text is deleted, the PropDict needs to be updated.A Lexicon contains lists of references to locations where tokens of a given lemma occur. If a TokenFile is deleted, references to it need to be deleted as well.
To manage updates to external records, there are three methods of File that can be used to notify resources of relevant events:
Created. A file is created by the Directory's new_child method or by its own create method (in the case of a root File). A newly created file receives a created call, which creates an entry in the index, if appropriate, and may be wrapped by specializations.
Moved. A file is relocated by its own reparent method. The file and all its descendants receive a moved call, which updates the index entries and may be wrapped by specializations.
Deleted. A file is deleted by its own delete method. The file and all its descendants receive a deleted call, which deletes the index entries and may be wrapped by specializations.
When an entire directory is moved or deleted, the reparent or delete method walks the subhierarchy and notifies each descendant of the change.
In the current implementation, metadata is stored in the same disk file as the File contents. For this reason, all Files are opened as text files using UTF-8 character encoding. Binary files such as audio and video are stored separately; they cannot directly provide the contents of a File.
The input stream passed to __read__ is not a genuine stream, but rather a "pseudo-stream" (of type MetadataInputStream) that is merely an iterator over lines of text. One MetadataInputStream is created for each section of the file metadata header.
When a File is initialized, the Environment and Metadata items are instantiated. To keep things from becoming an unmanageable snarl, the sequence is strictly as follows:
The exact sequence is as follows:
One uses open_database to instantiate the root File. It takes the class of the root File and a filename as arguments. It sets the env variable 'filename' to the filename and updates env with any additional keyword arguments. If the disk representation does not already exist, one should use create_database instead of open_database.
A specialization of Environment is instantiated within create_env, and keys are subsequently added to it. The __init__ method should not expect all keys to be present. The File initializer adds some keys, and open_database adds some more.
When Environment is instantiated, the File already exists; it is called the host. Environment.__init__ accesses the host's types variable and default_types variable to set up type tables.
An import paradox arises in setting the default_types in seal.cld.db.env. The env module is imported by file, which is imported by dir. But the classes needed to set default_types reside in file and dir. The solution is to set seal.cld.db.env.default_types in the __init__.py file.
Read checks are performed when one does require_load() - specifically, in File.__load__, just before calling the __read__ methods of the metadata items and the File itself. Write checks are performed when one does "with self.writer()" - specifically, in the Writer.lock_all method, which is called when the writer is entered, or when a File is added to an active Writer. The write check is done just before locking the file, which is the first step in writing it.
Permissions are stored in the File member _perm. A File may have independent Permissions, or it may have InheritedPermissions that always just defer to the parent. The class variable __has_permissions__ determines whether it has independent permissions or not. If it does have independent permissions, '_perm' is added at the beginning of the __metadata__ list (in __init__).
Permission-system items are metadata items of class Permissions or GroupsFile. They require special treatment.
(1) Permission-system items require admin checks for writing. Those checks are placed in the writer method of PermItem (of which both Permissions and GroupsFile are specializations); the writer method is called by all methods that modify contents.
One might expect the check to be placed in the __write__ method, but that would be incorrect. The Permissions metadata item gets written any time one writes its host, and to do that, one only needs write permission, not admin permission.
(2) Anyone should be allowed to examine permissions, even if they do not have read access to the Files that host the permission-system items. It is not enough to postpone the read check till after reading the permission-system items: one should not signal an error at all for reading permission-system items, only for attempting to read the File contents or other metadata files.
To deal with that issue, permission-system items have a require_load method that does not simply dispatch to their host.
The new_child method is the standard way to create a new file. There is also a method called need_child that returns an existing child, if available, and otherwise dispatches to new_child. (The child name is obligatory for need_child, but not for new_child.)
The new_child method is called with some subset of the keyword arguments name, suffix, and cls. Its behavior is influenced by two class variables: signature and childtype. The signature maps child names to classes, and childtype provides a default class. Both are optional.
There are four cases:
env['suffixes']
. If not found, an error is
signalled.env['types']
is used to get the class. If there
is no entry, an error is signalled.env['suffixes']
is used to get the suffix. If there is no entry, an error is
signalled.env['types']
maps the suffix to the given class,
otherwise an error is signalled.If the name is not provided, a call is placed to the allocate_name method of the index table. If there is no index table, an error is signalled.
New_child then does the following:
The File method writer dispatches to the function writer of seal.cld.db.disk. The function accepts any number of Files as arguments. It tosses out any that are already protected, a File being protected if it has a value for _writer. If any unprotected Files remain, a Writer instance is created and each unprotected File is added to it. The Writer instance is returned, or a dummy writer, if there are no unprotected Files on the list.
A Writer instance maintains a list of files. Adding a file consists of the following steps. It is possible to add additional Files to an existing Writer, so a check is first done to see if the File is protected; if so, no further action is taken. Otherwise, the File is appended to files, its _writer member is set to the Writer, and each of the Files that it requires is added to the Writer. Finally, if the Writer is already active, any new unlocked Files are locked.
A Writer should be created in the context of a with-statement. Accordingly, it expects a call to __enter__, and later a call to __exit__. The file becomes active when __enter__ is called, and becomes inactive again when __exit__ is called. When a Writer becomes active, all Files on the files list are locked. (Any Files added while the Writer is active are locked when they are added.)
Locking a File consists of the following steps. The Writer has a locks member that is None when the Writer is inactive, and a list when the Writer is active. To lock a File, a check is first done that it is writable, then its require_load method is called, and finally a Lock object is created for it and placed on the locks list. The Lock object locks the file on disk. While the Writer is active, each of the the locked files is said to be under the control of the writer.
On __exit__, each of the files is saved. (Unless an error has occurred, in which case each of the files is reloaded.) The method __save__ opens a temporary file for writing, and passes the resulting output stream to __write__. If writing completes without error, the current version of the file becomes a backup and the temporary file is moved into its place. After saving, each of the locks is released.
If an error occurs at any point - whether the error arises in the write-permission check, or in the body of the with statement, or during __write__ - then all of the files are restored to their saved state by calling _reload, which calls the __read__ method to restore the content members.
For some file classes, every time a file of the class is placed under the control of a writer, there is a related file that should be put under the control of the same writer. For example, when editing a TokenFile, one also needs the associated Lexicon to be editable, so that one can intern new word forms. There is a File method requires that can be overridden to provide an iteration over the required files. (The default implementation returns the empty list.) When the File is added to the Writer, its requires method is called, and each File that is returned is also added to the Writer.
Reparenting consists of the following steps:
The delete method does the following:
The Directory method reorder_children takes a list of child indices, and a target index. The target child is the child at the target index, or end-of-list. The children indicated by the indices are removed from the child list, then reinserted as a group, in the order given, just before the target child.
This only involves the _children list; it does not call reparent.
The Directory method reparent_children takes the same arguments: a list of child indices and a target index. The target index identifies the target child, which must be a real child and not end-of-list.
The target child's attachment_target method is called to determine the new parent. The default implementation of attachment_target returns the child itself, but classes may override it. (The method need_child is useful here.) The new parent's attachment_target method must return the new parent itself; otherwise an error is signalled.
To give a concrete example from CLD, a Text is initially created as a stub. Conceptually, one text attaches to another, but to be precise, a complex Text contains a Toc, and the child Text attaches to the Toc. Nonetheless, we may provide a Text as the target of reparent. The target Text's attachment_target method returns the Toc, creating it if necessary.
The children indicated by the indices are processed in the order given.
For each, child.reparent(newparent)
is called. If the
child originally preceded the new parent, it is attached at the
beginning of the new parent's children, and if it followed the new
parent, it is attached at the end. If multiple children are attached
at beginning or end, they occur in the order their indices are
listed.
The Directory method delete_children is called with a list of indices. The children at those indices are deleted by calling their delete method.
A method delete_child is also provided as a convenience. It takes a single index.
A CLD corpus provides an example of a Database. First, let us create a CLDManager:
>>> from seal.cld.toplevel import CLDManager >>> mgr = CLDManager('/tmp/foo.cld')
A string argument is interpreted as the application filename, which for CLD is the corpus. The corpus does not yet exist; we can create a corpus for testing using the 'create_test' command:
>>> mgr('create_test') Create corpus '/tmp/foo.cld' ...
A lot of output is generated as each file is created. Incidentally, create_test also creates a file called media in the same directory as the corpus (namely, /tmp).
Now we can load the corpus:
>>> corpus = mgr.corpus() >>> corpus <Corpus /tmp/foo.cld>
A Corpus object is a specialization of Database, which is a specialization of Directory. It behaves like a dict in which the keys are names of children:
>>> corpus['langs'] <LanguageList langs>
Like a dict, its length is the number of keys, and converting it to a list returns a list of keys:
>>> len(corpus) 4 >>> list(corpus) ['langs', 'users', 'roms', 'glab']
In this particular case, all four children are subdirectories. If one recurses far enough, one reaches a file:
>>> orig = corpus['langs']['deu']['texts']['1']['toc']['2']['orig'] >>> orig <TokenFile orig>
One can distinguish a file from a directory using the method is_directory():
>>> corpus.is_directory() True >>> orig.is_directory() False
Instead of repeatedly accessing members, one may use the follow() method, which takes a pathname:
>>> corpus.follow('/langs/deu/texts/1/toc/2/orig') <TokenFile orig>
Files may also behave like dicts or lists, depending on their contents. In this particular case, a TokenFile behaves like a list of TokenBlocks (essentially, tokenized sentences):
>>> list(orig) [<TokenBlock ('2', 0)>, <TokenBlock ('2', 1)>]
One may also go up the hierarchy using the method parent():
>>> orig.parent() <Text 2>