selkie.xml
— XML files
The module selkie.xml
provides a convenient XML parser.
The standard Python library provides XML parsing,
but the facilities it provides generally signal an error when
encountering ill-formed XML. Unfortunately, a great deal of XML on the
web is ill-formed, and aborting is not a very graceful way of dealing
with it. The XML parser in selkie.xml
is designed to be very
robust and lightweight.
Entities
The tag iterator has a method decode_xml_entities()
that converts XML
entities like “&” to characters (”&”).
The codes are listed in TagIterator.entity_table
.
For example:
>>> from selkie.xml import TagIterator
>>> TagIterator.entity_table['amp']
'&'
One can override entity_table
in a given instance of TagIterator
to add new tags. If TagIterator is instantiated with entities=False
,
entity replacement is suppressed.
XML trees
- load_xml(fn)
The main function. It reads an XML file and converts it to a tree. Optional keyword arguments are:
entities=False
leaves all HTML entities as-is.multiple=True
causes the return value to be a list of trees. By default, only one tree is returned and it is an error if there is more than one in the file.
To give an example, consider the file
xml1
:>>> from selkie.io import contents >>> print(contents(ex('xml1')), end='') <html> <body foo="hi&bye" bar=16> A "little" <b>example</b>. </body> </html>
We read it as a tree:
>>> from selkie.xml import load_xml >>> xml1 = load_xml(ex('xml1')) >>> print(xml1) 0 (html 1 ('#CDATA' '\n') 2 (body 3 ('#CDATA' '\nA "little" ') 4 (b 5 ('#CDATA' example)) 6 ('#CDATA' '.\n')) 7 ('#CDATA' '\n'))
Each XML node is represented by a Tree instance. It has one
nonstandard member, namely, ftrs
.
node.cat
- a string, the tag label or'#CDATA'
node.word
- a string, the contents of a CDATA node
node.ftrs
- a list of (att, value) pairs
node.children
- a list of Tree instances, or None
The functions from selkie.tree
can be useful with XML trees.
For example, one can pick out subtrees using the function subtrees()
. It
takes a second argument which is either a category or a predicate.
>>> from selkie.tree import subtrees
>>> subtrees(xml1, 'b')
[<Tree b ...>]
>>> subtrees(xml1, lambda x: x.cat == '#CDATA' and x.word != '\n')
[<Tree #CDATA '\nA "little" '>, <Tree #CDATA example>, <Tree #CDATA '.\n'>]
The function subtree()
is just like subtrees()
, except
that it returns a tree; it signals an error if the specified tree does
not exist or is not unique.
>>> from selkie.tree import subtree
>>> body = subtree(xml1, 'body')
The function terminal_string()
returns the string contents of a node.
>>> from selkie.tree import terminal_string
>>> terminal_string(body)
'\nA "little" example .\n'
The subtrees()
function will not recurse inside any node that it
returns. For example:
>>> subtrees(xml1, lambda x: not x.word)
[<Tree html ...>]
To retrieve all nodes matching a given criterion, use the function nodes()
and
list comprehension:
>>> from selkie.tree import nodes
>>> [n for n in nodes(xml1) if not n.word]
[<Tree html ...>, <Tree body ...>, <Tree b ...>]
It is worth noting that the iter_xml_trees()
function handles
ill-formed XML gracefully. For example:
>>> print(contents(ex('bad.html')), end='')
<html>
<body>
<p>
This is an example with lots
of missing end tags.
<table>
<tr><th>Name <th>Rank <th>SerialNo</tr>
<tr><td>Smith <td>Corporal <td>1234567</tr>
<tr><td>Jones <td>Private <td>7654321</tr>
<tr><td>Howard <td>Major General <td>0000001</tr>
</table>
</body>
The missing end tags are inserted automatically:
>>> bad = load_xml(ex('bad.html'))
>>> rows = subtrees(bad, 'tr')
>>> for row in rows: print(terminal_string(row))
...
Name Rank SerialNo
Smith Corporal 1234567
Jones Private 7654321
Howard Major General 0000001
End tags are paired with the nearest matching start tag. If there is no start tag with the same label, the end tag is silently ignored.
Within the region spanned by a matching tag pair, there may be
unmatched start tags. They are dealt with in a right-to-left pass, as follows.
First, every category has a “precedence” assigned to it.
(The precedence is only meaningful for HTML categories; it is
a fixed constant for any non-HTML categories.) If the precedence is
zero (for categories br
and img
), the unmatched start tag
is converted to an empty tag. Otherwise, the “span” of the
unmatched start tag is grown to
cover as much material as possible, until it reaches the end of the
region defined by the encompassing explicit tag pair, or until it
encounters a node of equal or higher precedence. In other words, no
element created from an unmatched start tag will contain a child whose
precedence is equal to or greater than its own.
To get the value for an attribute, use the function getvalue()
.
>>> from selkie.xml import getvalue
>>> getvalue(body, 'foo')
'hi&bye'
Tidy
When there are unmatched start tags, iter_xml_trees()
calls the
function tidy()
to decide where the end tags should go. It
uses a table encoding intuitive operator precedences for HTML tags.