Module orchid :: Class NaiveAnalyzer
[show private | hide private]
[frames | no frames]

Type NaiveAnalyzer

object --+        
         |        
  _Verbose --+    
             |    
        Thread --+
                 |
                NaiveAnalyzer

Known Subclasses:
Malcontent

This is an interface for the analyzers. To use the Ugrah crawler subclass this class with you logic of analyzing the gathered data and choosing which links to crawl.
Method Summary
  __init__(self, linksToFetchAndCond, siteQueueAndCond, db)
Creates a new analyzer.
  addSiteToFetchQueue(self, lfs)
Adds links to the fetch queue.
  analyzeSite(self, db, site)
Processes the site and adds it to the db.
  getNumSitesProcessed(self)
Returns the number of sites this analyzer has processed
  reorganizeByDomain(self, listOfLinks)
Returns a map which maps domain names to links inside the domain.
  report(self)
A real analyzer should override this method.
  run(self)
Performs the main function of the analyzer.
  selectNextUrl(self)
Chooses the next url to crawl to.
  setStopCondition(self, val)
Sets the stop condition to the specified value.
    Inherited from Thread
  __repr__(self)
  getName(self)
  isAlive(self)
  isDaemon(self)
  join(self, timeout)
  setDaemon(self, daemonic)
  setName(self, name)
  start(self)
    Inherited from object
  __delattr__(...)
x.__delattr__('name') <==> del x.name
  __getattribute__(...)
x.__getattribute__('name') <==> x.name
  __hash__(x)
x.__hash__() <==> hash(x)
  __new__(T, S, ...)
T.__new__(S, ...) -> a new object with type S, a subtype of T
  __reduce__(...)
helper for pickle
  __reduce_ex__(...)
helper for pickle
  __setattr__(...)
x.__setattr__('name', value) <==> x.name = value
  __str__(x)
x.__str__() <==> str(x)

Method Details

__init__(self, linksToFetchAndCond, siteQueueAndCond, db)
(Constructor)

Creates a new analyzer. There can be as many analyzers as you like, depending on the type of processing of data you wish to do.
Parameters:
linksToFetchAndCond - A tuple of the form (map, Condition) where the map is the map of domains to links to be fetched (which is updated by the analyzer) and the condition is a threading.Condition object which is used to synchronize access to the list.
siteQueueAndCond - A tuple (siteQueue, siteCond). The siteQueue is a queue unto which new sites (with their links) are inserted. The siteCond is a Condition object used to lock the queue and signal the analyzer to wake up.
db - a tuple of the form (siteDb, siteDbLock) where the linkDb is the global link database, siteDb is the global site database and the lock is used to synchronize access to both of these databases.
Overrides:
threading.Thread.__init__

addSiteToFetchQueue(self, lfs)

Adds links to the fetch queue. A real analyzer should override this method.

analyzeSite(self, db, site)

Processes the site and adds it to the db. Any real analyzer should override this method with it's own logic.

getNumSitesProcessed(self)

Returns the number of sites this analyzer has processed

reorganizeByDomain(self, listOfLinks)

Returns a map which maps domain names to links inside the domain.

report(self)

A real analyzer should override this method. Outputs the results of the analysis so far.

run(self)

Performs the main function of the analyzer. In this case, just adds all the hyperlinks to the toFetch queue.
Overrides:
threading.Thread.run

selectNextUrl(self)

Chooses the next url to crawl to. This implementation will select a random domain and then crawl to the first link in that domain's queue.

setStopCondition(self, val)

Sets the stop condition to the specified value. Should be True to stop the analyzer thread.

Generated by Epydoc 2.1 on Mon Dec 12 14:30:34 2005 http://epydoc.sf.net