You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/15 22:58:06 UTC

[Nutch Wiki] Update of "CrawlDatumStates" by AndrzejBialecki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "CrawlDatumStates" page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/CrawlDatumStates

--------------------------------------------------

New page:
Note: information here is specific to Nutch 1.x - conceptually the state machine should be identical in Nutch 2.0 but implementation details are different.

Nutch 1.x maintains state of pages in CrawlDb, which is updated by various tools:

* Injector - to populate CrawlDb with new URLs
* Generator - to generate new fetchlists, and optionally mark those URLs in CrawlDb as "being in the process of fetching"
* CrawlDb update - to update the CrawlDb with new knowledge about the already known URLs (already in CrawlDb) as well as add new URLs discovered from page outlinks.

Below is a state diagram of CrawlDatum, which is a class that holds this state in CrawlDb.