gentrie package

Module contents

Package providing a generalized trie implementation.

This package includes classes and functions to create and manipulate a generalized trie data structure. Unlike common trie implementations that only support strings as keys, this generalized trie can handle various types of tokens, as long as they are hashable.

Usage:

Example 1:

from gentrie import GeneralizedTrie, TrieEntry

trie = GeneralizedTrie()
trie.add(['ape', 'green', 'apple'])
trie.add(['ape', 'green'])
matches: set[TrieEntry] = trie.prefixes(['ape', 'green'])
print(matches)

Example 1 Output:

{TrieEntry(ident=2, key=['ape', 'green'])}

Example 2:

from gentrie import GeneralizedTrie, TrieEntry

# Create a trie to store website URLs
url_trie = GeneralizedTrie()

# Add some URLs with different components (protocol, domain, path)
url_trie.add(["https", "com", "example", "www", "/", "products", "clothing"])
url_trie.add(["http", "org", "example", "blog", "/", "2023", "10", "best-laptops"])
url_trie.add(["ftp", "net", "example", "ftp", "/", "data", "images"])

# Find all https URLs with "example.com" domain
suffixes: set[TrieEntry] = url_trie.suffixes(["https", "com", "example"])
print(suffixes)

Example 2 Output:

{TrieEntry(ident=1, key=['https', 'com', 'example', 'www', '/', 'products', 'clothing'])}}

Example 3:

from gentrie import GeneralizedTrie, TrieEntry

trie = GeneralizedTrie()
trie.add('abcdef')
trie.add('abc')
trie.add('qrf')
matches: set[TrieEntry] = trie.suffixes('ab')
print(matches)

Example 3 Output:

{TrieEntry(ident=2, key='abc'), TrieEntry(ident=1, key='abcdef')}
gentrie.GeneralizedKey

A GeneralizedKey is an object of any class that is a Sequence and that when iterated returns tokens conforming to the Hashable protocol.

Examples

  • str

  • bytes

  • list[bool]

  • list[int]

  • list[bytes]

  • list[str]

  • list[Optional[str]]

  • tuple[int, int, str]

alias of Sequence[Hashable | str]

class gentrie.GeneralizedTrie

Bases: object

A general purpose trie.

Unlike many trie implementations which only support strings as keys and token match only at the character level, it is agnostic as to the types of tokens used to key it and thus far more general purpose.

It requires only that the indexed tokens be hashable. This is verified at runtime using the gentrie.Hashable protocol.

Tokens in a key do NOT have to all be the same type as long as they can be compared for equality.

It can handle a Sequence of Hashable conforming objects as keys for the trie out of the box.

You can ‘mix and match’ types of objects used as token in a key as long as they all conform to the Hashable protocol.

The code emphasizes robustness and correctness.

Warning

GOTCHA: Using User Defined Classes As Tokens In Keys

Objects of user-defined classes are Hashable by default, but this will not work as naively expected. The hash value of an object is based on its memory address by default. This results in the hash value of an object changing every time the object is created and means that the object will not be found in the trie unless you have a reference to the original object.

If you want to use a user-defined class as a token in a key to look up by value instead of the instance, you must implement the __eq__() and __hash__() dunder methods in a content aware way (the hash and eq values must depend on the content of the object).

add(key: Sequence[Hashable | str]) int

Adds the key to the trie.

Parameters:

key (GeneralizedKey) – Must be an object that can be iterated and that when iterated returns elements conforming to the Hashable protocol.

Raises:

InvalidGeneralizedKeyError ([GTA001]) – If key is not a valid GeneralizedKey.

Returns:

Id of the inserted key. If the key was already in the trie, it returns the id for the already existing entry.

Return type:

TrieId

clear() None

Clears all keys from the trie.

Usage:

trie_obj.clear()
items() Generator[tuple[int, TrieEntry], None, None]

Returns an iterator for the trie.

The generator yields the TrieId and TrieEntry for each key in the trie.

Returns:

Generator for the trie.

Return type:

Generator[tuple[TrieId, TrieEntry], None, None]

keys() Generator[int, None, None]

Returns an iterator for all the TrieId keys in the trie.

The generator yields the TrieId for each key in the trie.

Returns:

Generator for the trie.

Return type:

Generator[TrieId, None, None]

prefixes(key: Sequence[Hashable | str]) set[TrieEntry]

Returns a set of TrieEntry instances for all keys in the trie that are a prefix of the passed key.

Searches the trie for all keys that are prefix matches for the key and returns their TrieEntry instances as a set.

Parameters:

key (GeneralizedKey) – Key for matching.

Returns:

set containing TrieEntry instances for keys that are prefixes of the key. This will be an empty set if there are no matches.

Return type:

set[TrieEntry]

Raises:

InvalidGeneralizedKeyError ([GTM001]) – If key is not a valid GeneralizedKey (is not a Sequence of Hashable objects).

Usage:

from gentrie import GeneralizedTrie, TrieEntry

trie: GeneralizedTrie = GeneralizedTrie()
keys: list[str] = ['abcdef', 'abc', 'a', 'abcd', 'qrs']
for entry in keys:
    trie.add(entry)
matches: set[TrieEntry] = trie.prefixes('abcd')
for trie_entry in sorted(list(matches)):
    print(f'{trie_entry.ident}: {trie_entry.key}')

# 2: abc
# 3: a
# 4: abcd
remove(ident: int) None

Remove the key with the passed ident from the trie.

Parameters:

ident (TrieId) – id of the key to remove.

Raises:
  • TypeError ([GTR001]) – if the ident arg is not a TrieId.

  • ValueError ([GTR002]) – if the ident arg is not a legal value.

  • KeyError ([GTR003]) – if the ident does not match the id of any keys.

suffixes(key: Sequence[Hashable | str], depth: int = -1) set[TrieEntry]

Returns the ids of all suffixes of the trie_key up to depth.

Searches the trie for all keys that are suffix matches for the key up to the specified depth below the key match and returns their ids as a set.

Parameters:
  • key (GeneralizedKey) – Key for matching.

  • depth (int, default=-1) – Depth starting from the matched key to include. The depth determines how many ‘layers’ deeper into the trie to look for suffixes.: * A depth of -1 (the default) includes ALL entries for the exact match and all children nodes. * A depth of 0 only includes the entries for the exact match for the key. * A depth of 1 includes entries for the exact match and the next layer down. * A depth of 2 includes entries for the exact match and the next two layers down.

Returns:

Set of TrieEntry instances for keys that are suffix matches for the key. This will be an empty set if there are no matches.

Return type:

set[TrieId]

Raises:

Usage:

from gentrie import GeneralizedTrie, TrieEntry

trie = GeneralizedTrie()
keys: list[str] = ['abcdef', 'abc', 'a', 'abcd', 'qrs']
for entry in keys:
    trie.add(entry)
matches: set[TrieEntry] = trie.suffixes('abcd')

for trie_entry in sorted(list(matches)):
    print(f'{trie_entry.ident}: {trie_entry.key}')

# 1: abcdef
# 4: abcd
values() Generator[TrieEntry, None, None]

Returns an iterator for all the TrieEntry entries in the trie.

The generator yields the TrieEntry for each key in the trie.

Returns:

Generator for the trie.

Return type:

Generator[TrieEntry, None, None]

class gentrie.Hashable(*args, **kwargs)

Bases: Protocol

Hashable is a protocol that defines key tokens that are usable with a GeneralizedTrie.

The protocol requires that a token object be hashable. This means that it implements both an __eq__() method and a __hash__() method.

Some examples of built-in types suitable for use as tokens in a key:

Note: frozensets and tuples are only hashable if their contents are hashable.

User-defined classes are hashable by default.

Usage:

from gentrie import Hashable
if isinstance(token, Hashable):
    print("token supports the Hashable protocol")
else:
    print("token does not support the Hashable protocol")
exception gentrie.InvalidGeneralizedKeyError

Bases: TypeError

Raised when a key is not a valid GeneralizedKey object.

This is a sub-class of TypeError.

exception gentrie.InvalidHashableError

Bases: TypeError

Raised when a token in a key is not a valid Hashable object.

This is a sub-class of TypeError.

class gentrie.TrieEntry(ident: int, key: Sequence[Hashable | str])

Bases: NamedTuple

A TrieEntry is a NamedTuple containing the unique identifer and key for an entry in the trie.

ident: int

TrieId Unique identifier for a key in the trie. Alias for field number 0.

key: Sequence[Hashable | str]

GeneralizedKey Key for an entry in the trie. Alias for field number 1.

gentrie.TrieId

Unique identifier for a key in a trie.

gentrie.is_generalizedkey(key: Sequence[Hashable | str]) bool

Tests key for whether it is a valid GeneralizedKey.

A valid GeneralizedKey is a Sequence that returns Hashable protocol conformant objects when iterated. It must have at least one token.

Parameters:

key (GeneralizedKey) – Key for testing.

Returns:

True if a valid GeneralizedKey, False otherwise.

Return type:

bool

gentrie.is_hashable(token: Hashable) bool

Tests token for whether it is a valid Hashable.

A valid Hashable is a hashable object.

Examples: bool, bytes, float, frozenset, int, str, None, tuple.

Parameters:

token (GeneralizedKey) – Object for testing.

Returns:

True if a valid Hashable, False otherwise.

Return type:

bool