Module orchid :: Class OrchidExtractor
[show private | hide private]
[frames | no frames]

Class OrchidExtractor


A class responsible for parsing and analyzing html content and extracting various forms of links from it. The goal is to provide the user with a datastructure representing the parsed HTML and a set of links that appear on the given page. Supported links type are:
  1. simple <a href="...">link</a> type links (done)
  2. relative links
  3. BASE tag (done)
  4. LINK tag (done)
  5. FRAMESET links (done)
  6. IMG links (done)
  7. client side image maps (done)
  8. server side image maps (unsupported)
  9. Object links
  10. JavaScript links
  11. Meta refresh tags
  12. iframes
  13. form links

Method Summary
  __init__(self)
Creates a new link extractor.
  extract(self)
Extracts all the links in the page according to the patterns specified in LINK_PATTERNS.
  getLinks(self)
Returns a map from link type to a list of links of that type that appeared in the page.
  getParsedContent(self)
Returns the BeautifulSoup datastructure of the HTML of the site that was set using setSite .
  getRawContent(self)
  setSite(self, stringUrl, content)
Sets the current site url and content for the extractor.

Class Variable Summary
list LINK_PATTERNS = [('regular', <_sre.SRE_Pattern object at...

Method Details

__init__(self)
(Constructor)

Creates a new link extractor. Should be followed by a call to setSite

extract(self)

Extracts all the links in the page according to the patterns specified in LINK_PATTERNS. The links are stored in a map (link type -> url list) called links (accessible by 'extractor.links' where extractor is an instance of HtmlLinkExtractor)

getLinks(self)

Returns a map from link type to a list of links of that type that appeared in the page.

getParsedContent(self)

Returns the BeautifulSoup datastructure of the HTML of the site that was set using setSite .

setSite(self, stringUrl, content)

Sets the current site url and content for the extractor.
Parameters:
stringUrl - The url of the site being analyzed.
content - The html content of the site.

Class Variable Details

LINK_PATTERNS

Type:
list
Value:
[('regular',
  <_sre.SRE_Pattern object at 0xb7b7fb80>,
  {'href': <_sre.SRE_Pattern object at 0xb79d43e0>},
  'href'),
 ('base',
  <_sre.SRE_Pattern object at 0xb7b5ccc8>,
  {'href': <_sre.SRE_Pattern object at 0xb79d43e0>},
  'href'),
...                                                                    

Generated by Epydoc 2.1 on Mon Dec 12 14:30:34 2005 http://epydoc.sf.net