You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Anton Potehin <an...@orbita1.ru> on 2005/11/23 11:44:34 UTC

mapred crawl

We used nutch for whole web crawling.

In infinite loop we run tasks:

1) bin/nutch generate db <segmentsPath> -topN 10000

2) bin/nutch fetch <segment name>

3) bin/nutch updatedb db <segment name>  

4) bin/nutch analyze db <segment name>

5) bin/nutch index <segment name>

6) bin/nutch dedup segments dedup.tmp

 

After each iteration we produce new segment and may use it for search.

 

Now we try mapred. How we can use crawl in similar way? We need results
in process, but not in the end of crawling (since is very long process -
weeks).