You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mehmet Tan <me...@agmlab.com> on 2005/09/14 09:34:39 UTC

Depth notion

 
   Hi,
I want to ask a general question about nutch.
Nutch does the crawling task in a stepwise way. I
mean the crawling steps are distinct from
each other. In the normal way of performing a crawl
you first do the generate step, then the fetch step and
then the updatedb step. (And I use the word 'depth' here as this
three step procedure) But when some other concepts,
for example re-visiting of sites, are added to nutch, not only
the concept of depth is meaningless, but also it is
a hindrance to implement the re-visit policy easily, because
you can not re-visit the sites before a fetching step is
over. But a fetching step can last forever.
So what is the rationale behind this design? Is it that
important to make the steps distinct?

Thanks for comments,

Mehmet