You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/15 22:58:06 UTC
[Nutch Wiki] Update of "CrawlDatumStates" by AndrzejBialecki
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CrawlDatumStates" page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/CrawlDatumStates
--------------------------------------------------
New page:
Note: information here is specific to Nutch 1.x - conceptually the state machine should be identical in Nutch 2.0 but implementation details are different.
Nutch 1.x maintains state of pages in CrawlDb, which is updated by various tools:
* Injector - to populate CrawlDb with new URLs
* Generator - to generate new fetchlists, and optionally mark those URLs in CrawlDb as "being in the process of fetching"
* CrawlDb update - to update the CrawlDb with new knowledge about the already known URLs (already in CrawlDb) as well as add new URLs discovered from page outlinks.
Below is a state diagram of CrawlDatum, which is a class that holds this state in CrawlDb.