Recognized Document Structure

<- go to the main page 

The ParsedDocument structure is returned, the structure is returned as a Json format.

Dedoc supports linear and tree format. In linear format every documents line is a child of the root. In tree structure dedoc tries to reconstruct logical structure of the document as tree

ParsedDocument

  1. version: str (required field) - Dedoc version
  2. warnings: List[str] (required field) - any warning, occur in the process of document processing
  3. metadata: DocumentMetadata (required field) - document meta-information
  4. content: DocumentContent (required field) - parsed document structure
  5. attachments: List[ ParsedDocument ] (optional field) - attached documents, returned only if the condition for processing attached files is set. See the "with_attachment" parameter in the POST-request

DocumentMetadata

Contains document meta-information (for example file name, size, access time).

  1. uid: str (required field) - unique document identifier (example: "doc_uid_auto_ba73d76a-326a-11ec-8092-417272234cb0")
  2. file_name: string (required field) - file name (example: "example.odt")
  3. bucket_name: str (optional field) - bucket name in which the file is located. Included when analyzing a file from a cloud storage (example: "dedoc")
  4. size: integer (required field) - file size in bytes (example: 20060)
  5. modified_time: integer (required field) - modification date of the document in the format UnixTime (example: 1590579805)
  6. created_time: integer (required field) - creation date of the document in the format UnixTime (example: 1590579805)
  7. access_time: integer (required field) - file access date in format UnixTime (example: 1590579805)
  8. file_type: string (optional field) - mime-type file (example: "application/vnd.oasis.opendocument.text")
  9. other_fields: dict (optional field) - each file type has its own set of meta information, here is a detailed description of the other_fields

DocumentContent

Contains document content structure

  1. tables: List[Table] (required field) - list of tables in a document
  2. structure: TreeNode (required field) - document tree structure

Table

Detected and parsed table in a document
  1. cells: List[List[string]] (required field) - list of table cell lists. The cell contains text.
  2. metadata: TableMetadata (required field) - table meta information

TableMetadata

Contains table meta information

  1. uid: str (required field) - unique identifier.
  2. page_id: integer (optional field) - page number on which the table begins. Can be null.
  3. is_inserted: bool (optional field) - was table inserted into document.

TreeNode

Contains document tree structure

  1. node_id : string (required field) - document item identifier. It is unique in for one tree (i.e. in this tree there will not be another such node_id, but in attachment it can occur). The identifier has the form 0.2.1 where each number symbolizes a serial number at an appropriate level иерархии.
    For example node_id 0.2.1 means that this element is the second subhead of the third chapter. The first number is the root of the document. Numbering goes from 0.
  2. text: string (required field) - element text;
  3. annotations: List[ Annotation ] (required field) - the field describes any properties of the text, for example, boldness, font size, etc.
  4. metadata: ParagraphMetadata (required field) - meta-information relevant to the entire subparagraph, such as page number and position on this page.
  5. subparagraphs: List[ TreeNode ] (required field) - "children" of the current item (for example, sub-chapters for a chapter). The structure of "children" is similar to the current one.

Annotation

Contains text annotations

  1. start : integer (required field) - annotation start index.
  2. end : integer (required field) - annotation end index. The index of the last character (associated with this annotation) + 1. For example, if the first character is annotated, then start = 0, end = 1; if all line is annotated, then start = 0, end = length of s.
  3. name : string (required field) - annotation type (size, italic etc).
  4. value : string (required field) - annotation value (to learn more see Concrete types of annotations).

Concrete types of annotations.

ParagraphMetadata

Contains paragraph meta information

  1. paragraph_type : string (required field) - paragraph type (paragraph, list item and so on). Possible values depend on the type of document. Default values: ['root', 'paragraph', 'raw_text', 'list', 'list_item', 'named_header'] Values for document_type 'law': ['root', 'raw_text', 'struct_unit', 'item', 'article', 'subitem', 'footer', 'header', 'title', 'part']
  2. predicted_classes : Dict[str -> float] (optional field) - classifier results, paragraph type is the probability that the paragraph is of this type, the list of keys depends on the type of document.
  3. page_id : integer (optional field) - page on which this paragraph begins.
  4. line_id : integer (optional field) - The line number on which this paragraph begins.