HTMLElement class

This class can be used for parsing or for creating DOM manually.

DOM building

If you want to create DOM from HTMLElements, you can use one of theese four constructors:

HTMLElement()
HTMLElement("<tag>")
HTMLElement("<tag>", {"param": "value"})
HTMLElement("tag", {"param": "value"}, [HTMLElement("<tag1>"), ...])

Tag or parameter specification parts can be omitted:

HTMLElement("<root>", [HTMLElement("<tag1>"), ...])
HTMLElement(
   [HTMLElement("<tag1>"), ...]
)

Examples

Blank element

>>> from dhtmlparser import HTMLElement
>>> e = HTMLElement()
>>> e
<dhtmlparser.HTMLElement instance at 0x7fb2b39ca170>
>>> print e

>>>

Usually, it is better to use HTMLElement("").

Nonpair tag

>>> e = HTMLElement("<br>")
>>> e.isNonPairTag()
True
>>> e.isOpeningTag()
False
>>> print e
<br>

Notice, that closing tag wasn’t automatically created.

Pair tag

>>> e = HTMLElement("<tag>")
>>> e.isOpeningTag() # this doesn't check if tag actually is paired, just if it looks like opening tag
True
>>> e.isPairTag()    # this does check if element is actually paired
False
>>> e.endtag = HTMLElement("</tag>")
>>> e.isOpeningTag()
True
>>> e.isPairTag()
True
>>> print e
<tag></tag>

In short:

>>> e = HTMLElement("<tag>")
>>> e.endtag = HTMLElement("</tag>")

Or you can always use string parser:

>>> e = d.parseString("<tag></tag>")
>>> print e
<tag></tag>

But don’t forget, that elements returned from parseString() are encapsulated in blank “root” tag:

>>> e = d.parseString("<tag></tag>")
>>> e.getTagName()
''
>>> e.childs[0].tagToString()
'<tag>'
>>> e.childs[0].endtag.tagToString() # referenced thru .endtag property
>>> e.childs[1].tagToString() # manually selected entag from childs - don't use this
'</tag>'
'</tag>

Tags with parameters

Tag (with or without <>) can have as dictionary as second parameter.

>>> e = HTMLElement("tag", {"param":"value"})  # without <>, because normal text can't have parameters
>>> print e
<tag param="value">
>>> print e.params  # parameters are accessed thru .params property
{'param': 'value'}

Tags with content

You can create content manually:

>>> e = HTMLElement("<tag>")
>>> e.childs.append(HTMLElement("content"))
>>> e.endtag = HTMLElement("</tag>")
>>> print e
<tag>content</tag>

But there is also easier way:

>>> print HTMLElement("tag", [HTMLElement("content")])
<tag>content</tag>

or:

>>> print HTMLElement("tag", {"some": "parameter"}, [HTMLElement("content")])
<tag some="parameter">content</tag>

HTMLElement class API

HTMLElement class used in DOM representation.

dhtmlparser.htmlelement.NONPAIR_TAGS = ['br', 'hr', 'img', 'input', 'meta', 'spacer', 'frame', 'base']

List of non-pair tags. Set this to blank list, if you wish to parse XML.

class dhtmlparser.htmlelement.HTMLElement(tag='', second=None, third=None)[source]

This class is used to represent single linked DOM (see makeDoubleLinked() for double linked).

childs

list

List of child nodes.

params

dict

SpecialDict instance holding tag parameters.

endtag

obj

Reference to the ending HTMLElement or None.

openertag

obj

Reference to the openning HTMLElement or None.

find(tag_name, params=None, fn=None, case_sensitive=False)[source]

Same as findAll(), but without endtags.

You can always get them from endtag property.

findB(tag_name, params=None, fn=None, case_sensitive=False)[source]

Same as findAllB(), but without endtags.

You can always get them from endtag property.

findAll(tag_name, params=None, fn=None, case_sensitive=False)[source]

Search for elements by their parameters using Depth-first algorithm.

Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

List of HTMLElement instances matching your criteria.

Return type:

list

findAllB(tag_name, params=None, fn=None, case_sensitive=False)[source]

Simple search engine using Breadth-first algorithm.

Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

List of HTMLElement instances matching your criteria.

Return type:

list

wfind(tag_name, params=None, fn=None, case_sensitive=False)[source]

This methods works same as find(), but only in one level of the childs.

This allows to chain wfind() calls:

>>> dom = dhtmlparser.parseString('''
... <root>
...     <some>
...         <something>
...             <xe id="wanted xe" />
...         </something>
...         <something>
...             asd
...         </something>
...         <xe id="another xe" />
...     </some>
...     <some>
...         else
...         <xe id="yet another xe" />
...     </some>
... </root>
... ''')
>>> xe = dom.wfind("root").wfind("some").wfind("something").find("xe")
>>> xe
[<dhtmlparser.htmlelement.HTMLElement object at 0x8a979ac>]
>>> str(xe[0])
'<xe id="wanted xe" />'
Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

Blank HTMLElement with all matches in childs property.

Return type:

obj

Note

Returned element also have set _container property to True.

match(*args, **kwargs)[source]

wfind() is nice function, but still kinda long to use, because you have to manually chain all calls together and in the end, you get HTMLElement instance container.

This function recursively calls wfind() for you and in the end, you get list of matching elements:

xe = dom.match("root", "some", "something", "xe")

is alternative to:

xe = dom.wfind("root").wfind("some").wfind("something").wfind("xe")

You can use all arguments used in wfind():

dom = dhtmlparser.parseString('''
    <root>
        <div id="1">
            <div id="5">
                <xe id="wanted xe" />
            </div>
            <div id="10">
                <xe id="another wanted xe" />
            </div>
            <xe id="another xe" />
        </div>
        <div id="2">
            <div id="20">
                <xe id="last wanted xe" />
            </div>
        </div>
    </root>
''')

xe = dom.match(
    "root",
    {"tag_name": "div", "params": {"id": "1"}},
    ["div", {"id": "5"}],
    "xe"
)

assert len(xe) == 1
assert xe[0].params["id"] == "wanted xe"
Parameters:absolute (bool, default None) – If true, first element will be searched from the root of the DOM. If None, _container attribute will be used to decide value of this argument. If False, find() call will be run first to find first element, then wfind() will be used to progress to next arguments.
Returns:List of matching elements (blank if no matchin element found).
Return type:list
isTag()[source]
Returns:True if the element is considered to be HTML tag.
Return type:bool
isEndTag()[source]
Returns:True if the element is end tag (</endtag>).
Return type:bool
isNonPairTag(isnonpair=None)[source]

True if element is listed in nonpair tag table (br for example) or if it ends with /> (<hr /> for example).

You can also change state from pair to nonpair if you use this as setter.

Parameters:isnonpair (bool, default None) – If set, internal nonpair state is changed.
Returns:True if tag is nonpair.
Return type:book
isPairTag()[source]
Returns:True if this is pair tag - <body> .. </body> for example.
Return type:bool
isOpeningTag()[source]

Detect whether this tag is opening or not.

Returns:True if it is opening.
Return type:bool
isEndTagTo(opener)[source]
Parameters:opener (obj) – HTMLElement instance.
Returns:True, if this element is endtag to opener.
Return type:bool
isComment()[source]
Returns:True if this element is encapsulating HTML comment.
Return type:bool
tagToString()[source]

Get HTML element representation of the tag, but only the tag, not the childs or endtag.

Returns:HTML representation.
Return type:str
toString()[source]

Returns almost original string (use original = True if you want exact copy).

If you want prettified string, try prettify(). :returns: Complete representation of the element with childs, endtag and so on.

Return type:str
getTagName()[source]
Returns:Tag name or while element in case of normal text (not isTag()).
Return type:str
getContent()[source]
Returns:Content of tag (everything between opener and endtag).
Return type:str
prettify(depth=0, separator=' ', last=True, pre=False, inline=False)[source]

Same as toString(), but returns prettified element with content.

Note

This method is partially broken, and can sometimes create unexpected results.

Returns:Prettified string.
Return type:str
containsParamSubset(params)[source]

Test whether this element contains at least all params, or more.

Parameters:params (dict/SpecialDict) – Subset of parameters.
Returns:True if all params are contained in this element.
Return type:bool
isAlmostEqual(tag_name, params=None, fn=None, case_sensitive=False)[source]

Compare element with given tag_name, params and/or by lambda function fn.

Lambda function is same as in find().

Parameters:
  • tag_name (str) – Compare just name of the element.
  • params (dict, default None) – Compare also parameters.
  • fn (function, default None) – Function which will be used for matching.
  • case_sensitive (default False) – Use case sensitive matching of the tag_name.
Returns:

True if two elements are almost equal.

Return type:

bool

replaceWith(el)[source]

Replace value in this element with values from el.

This useful when you don’t want change all references to object.

Parameters:el (obj) – HTMLElement instance.
removeChild(child, end_tag_too=True)[source]

Remove subelement (child) specified by reference.

Note

This can’t be used for removing subelements by value! If you want to do such thing, try:

for e in dom.find("value"):
    dom.removeChild(e)
Parameters:
  • child (obj) – HTMLElement instance which will be removed from this element.
  • end_tag_too (bool, default True) – Remove also child endtag.

Table Of Contents

Previous topic

SpecialDict class

Next topic

Quoter submodule

This Page