Orthography Information

The Unicode CLDR data has three categories of character support for each orthography: basic, optional, and punctuation.

class jkUnicode.orthography.Orthography(info_obj, code, script, territory, info_dict)

The Orthography object represents an orthography. You usually don’t deal with this object directly, it is used internally by the jkUnicode.orthography.OrthographyInfo object.

Parameters
  • info_obj (jkUnicode.orthography.OrthographyInfo) – The parent info object.

  • code (str) – The orthography code.

  • script (str) – The script code of the orthography.

  • territory (str) – The territory code of the orthography.

  • info_dict (dict) – The dictionary which contains the rest of the information about the orthography.

almost_supported_basic(max_missing=5)

Is the orthography supported with a maximum of max_missing base characters for the current parent cmap?

almost_supported_full(max_missing=5)

Is the orthography supported with a maximum of max_missing characters (base, optional and punctuation characters) for the current parent cmap?

almost_supported_punctuation(max_missing=5)

Is the orthography supported with a maximum of max_missing punctuation characters for the current parent cmap?

fill_from_default_orthography()

Sometimes the base unicodes are empty for a variant of an orthography. Try to fill them in from the default variant.

Call this only after the whole list of orthographies is present, or it will fail, because the default orthography may not be present until the whole list has been built.

forget_cmap()

Forget the results of the last cmap scan.

from_dict(info_dict)

Read information for the current orthography from a dictionary. This method is called during initialization of the object and fills in a number of instance attributes:

name

The orthography name.

unicodes_base

The set of base characters for the orthography.

unicodes_optional

The set of optional characters for the orthography.

unicodes_punctuation

The set of punctuation characters for the orthography.

unicodes_any

The previous three sets combined.

property info

The parent OrthographyInfo object (read-only).

property name

The name of the orthography (read-only).

scan_cmap()

Scan the orthography against the current parent cmap. This fills in a number of instance attributes:

missing_base

A set of unicode values that are missing from the basic characters of the orthography.

missing_optional

A set of unicode values that are missing from the optional characters of the orthography.

missing_punctuation

A set of unicode values that are missing from the punctuation characters of the orthography.

missing_all

A set of all the previous combined.

num_missing_base, num_missing_optional, num_missing_punctuation, num_missing_all

The number of missing characters for the previous attributes

base_pc, optional_pc, punctuation_pc

The percentage values of support for the categories basic, optional, and punctuation characters.

The names of these attributes can be used in jkUnicode.orthography.OrthographyInfo.print_report.

support_basic()

Is the orthography supported (base and punctuation characters) for the current parent cmap?

support_full()

Is the orthography supported (base, optional and punctuation characters) for the current parent cmap?

support_minimal()

Is the orthography supported (base characters) for the current parent cmap?

support_minimal_inclusive()

Is the orthography supported (base characters only) for the current parent cmap?

uses_unicode_any(u)

Is the unicode used by this orthography in any set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters

u (int) – The codepoint.

uses_unicode_base(u)

Is the unicode used by this orthography in the base set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters

u (int) – The codepoint.

class jkUnicode.orthography.OrthographyInfo

The main Orthography Info object. It reads the information for each orthography from the files in the json subfolder. The JSON data is generated from the Unicode CLDR data via included Python scripts.

This object is expensive to instantiate due to disk access, so it is recommended to instantiate it once and then reuse it.

build_reverse_cmap()

Build a map from each unicode to a list of indices into the orthographies list for all orthographies that are using it as base or punctuation character.

property cmap

The unicode to glyph name mapping, a dictionary. When you set the cmap, it is scanned against all orthographies belonging to the Orthography Info object.

get_almost_supported(max_missing=5)

Return a list of almost supported orthographies for the current cmap.

Parameters

max_missing (int) – The maximum allowed number of missing characters.

get_language_name(code)

Return the nice name for a language by its code.

Parameters

code (str) – The language code.

get_orthographies_for_char(char)

Get a list of orthographies which use a supplied character at base level.

Parameters

char (char) – The character.

get_orthographies_for_unicode(u)

Get a list of orthographies which use a supplied codepoint at base level.

Parameters

u (int) – The codepoint.

get_orthographies_for_unicode_any(u)

Get a list of orthographies which use a supplied codepoint at any level.

Parameters

u (int) – The codepoint.

get_script_name(code='DFLT')

Return the nice name for a script by its code.

Parameters

code (str) – The script code.

get_supported_orthographies(full_only=False)

Get a list of supported orthographies for a character list.

Parameters

full_only (bool) – Return only orthographies which have both basic and optional characters present for the current cmap.

get_supported_orthographies_minimum()

Get a list of orthographies with minimal support for the current cmap only.

get_supported_orthographies_minimum_inclusive()

Get a list of orthographies with minimal or better support for the current cmap.

get_territory_name(code='dflt')

Return the nice name for a territory by its code.

Parameters

code (str) – The territory code.

orthography(code, script='DFLT', territory='dflt')

Access a particular orthography by its language, script and territory code.

Parameters
  • code (str) – The language code.

  • script (str) – The script code.

  • territory (str) – The territory code.

print_report(otlist, attr)

Print a formatted report for a given list of orthographies.

Parameters
  • otlist (list) – The list of orthographies.

  • attr (str) – The name of the attribute of the orthography object that will be shown in the report (missing_base, missing_optional, missing_punctuation, missing_all, num_missing_base, num_missing_optional, num_missing_punctuation, base_pc, optional_pc, punctuation_pc, unicodes_base, unicodes_optional, unicodes_punctuation).

report_missing_punctuation()

Print a report of orthographies which have all basic letters present, but are missing puncuation characters.

report_near_misses(n=5)

Print a report of orthographies which a maximum number of n characters missing.

report_supported(full_only=False)

Print a report of supported orthographies for the current cmap.

Parameters

full_only – Only report orthographies which have both basic and

optional characters present :type full_only: bool

report_supported_minimum()

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters present).

report_supported_minimum_inclusive()

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters required).

jkUnicode.orthography.cased(codepoint_list)

Return a list with its Unicode case mapping toggled. If a codepoint has no lowercase or uppercase mapping, it is dropped from the list.

Parameters

codepoint_list (list) – The list of integer codepoints.