Recognized Document Structure
<- go to the main page
The ParsedDocument structure is returned, the structure is returned as a Json format.
Dedoc supports linear and tree format. In linear format every documents line is a child of the root. In
tree structure dedoc tries to reconstruct logical structure of the document as tree
ParsedDocument
- version: str (required field) -
Dedoc version
- warnings: List[str] (required field) -
any warning, occur in the process of document processing
- metadata: DocumentMetadata (required field) -
document meta-information
- content: DocumentContent (required field) - parsed document structure
- attachments: List[ ParsedDocument ] (optional field) -
attached documents, returned only if the condition for processing attached files is set. See the "with_attachment" parameter in the POST-request
Contains document meta-information (for example file name, size, access time).
- uid: str (required field) - unique document identifier (example: "doc_uid_auto_ba73d76a-326a-11ec-8092-417272234cb0")
- file_name: string (required field) - file name (example: "example.odt")
- bucket_name: str (optional field) - bucket name in which the file is located.
Included when analyzing a file from a cloud storage (example: "dedoc")
- size: integer (required field) - file size in bytes (example: 20060)
- modified_time: integer (required field) - modification date of the document in the format UnixTime (example: 1590579805)
- created_time: integer (required field) - creation date of the document in the format UnixTime (example: 1590579805)
- access_time: integer (required field) - file access date in format UnixTime (example: 1590579805)
- file_type: string (optional field) - mime-type file (example:
"application/vnd.oasis.opendocument.text")
- other_fields: dict (optional field) - each file type has its own set of meta information,
here is a detailed description of the other_fields
DocumentContent
Contains document content structure
- tables: List[Table] (required field) - list of tables in a document
- structure: TreeNode (required field) - document tree structure
Table
Detected and parsed table in a document
- cells: List[List[string]] (required field) - list of table cell lists. The cell contains text.
- metadata: TableMetadata (required field) - table meta information
Contains table meta information
- uid: str (required field) - unique identifier.
- page_id: integer (optional field) - page number on which the table begins. Can be null.
- is_inserted: bool (optional field) - was table inserted into document.
TreeNode
Contains document tree structure
- node_id : string (required field) - document item identifier. It is unique in for one tree (i.e. in this tree there will not be another such node_id, but in attachment it can occur).
The identifier has the form 0.2.1 where each number symbolizes a serial number at an appropriate level
иерархии.
For example node_id 0.2.1 means that this element is the second subhead of the third chapter. The first number is the root of the document. Numbering goes from 0.
- text: string (required field) - element text;
- annotations: List[ Annotation ] (required field) - the field describes any properties of the text, for example, boldness, font size, etc.
- metadata: ParagraphMetadata (required field) -
meta-information relevant to the entire subparagraph, such as page number and position on this page.
- subparagraphs: List[ TreeNode ] (required field) -
"children" of the current item (for example, sub-chapters for a chapter). The structure of "children" is similar to the current one.
Annotation
Contains text annotations
- start : integer (required field) - annotation start index.
- end : integer (required field) - annotation end index.
The index of the last character (associated with this annotation) + 1.
For example, if the first character is annotated, then start = 0, end = 1;
if all line is annotated, then start = 0, end = length of s.
- name : string (required field) - annotation type (size, italic etc).
- value : string (required field) - annotation value
(to learn more see Concrete types of annotations).
Concrete types of annotations.
- AlignmentAnnotation : text alignment. The value of "name" field = "alignment".
"value" field may be "left" (left alignment), "right" (right alignment),
"both" (alignment on both sides), "center" (center alignment).
- BoldAnnotation : bold text. The value of "name" field = "bold".
"value" may be "True", if the font is bold and "False" otherwise.
- IndentationAnnotation : indent from the left edge of the page.
For documents in docx format, we mean the indentation from the inner borders of the document
(margin is not taken into account). The value of "name" field = "indentation".
"value" may be any real number converted to a string.
For docx documents, the indent is measured in twentieths of a point (1/1440 of an inch).
- ItalicAnnotation : italic text. The value of "name" field = "italic".
"value" may be "True", if the text is in italics and "False" otherwise.
- SizeAnnotation : font size in points (1/72 of an inch). The value of "name" field = "size".
"value" may be any real number converted to a string.
- StyleAnnotation : the name of the style applied to the text, for example, "heading 1".
The value of "name" field = "style".
"value" may be any string.
- Table Annotation : table reference. The value of "name" field = "table".
"value" field = unique identifier of the table.
- UnderlinedAnnotation : underlined text. The value of "name" field = "underlined".
"value" may be "True", if the text is underlined and "False" otherwise.
Contains paragraph meta information
- paragraph_type : string (required field) - paragraph type (paragraph, list item and so on).
Possible values depend on the type of document.
Default values: ['root', 'paragraph', 'raw_text', 'list', 'list_item', 'named_header']
Values for document_type 'law': ['root', 'raw_text', 'struct_unit', 'item', 'article', 'subitem',
'footer', 'header', 'title', 'part']
- predicted_classes : Dict[str -> float] (optional field) - classifier results, paragraph type
is the probability that the paragraph is of this type, the list of keys depends on the type of document.
- page_id : integer (optional field) - page on which this paragraph begins.
- line_id : integer (optional field) - The line number on which this paragraph begins.