You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken van Mulder <ke...@wavefire.com> on 2005/10/25 00:19:42 UTC

mapred questions

Hey folks,

I'm using the latest (or close to) development version of nutch, and I'm 
running into a couple of problems with crawling.

Using:

$ bin/nutch fetch segments/2005***

First is that the fetcher slows down over time and continues to use more 
and more memory as it goes (which I think is eventually hanging the 
process). Over the course of 2 hours it went from 20 pages/s to 13 
pages/s and from ~66M to ~340M resident memory. Before the process hung, 
it had grabbed 150k pages with about 28k errors. Is the slowdown and 
increased usage of memory expected? Is there a method for dealing with this?

Second problem is trying to use the crawl. I've tried with a seeds/url 
file contain 4, 2000 and then 100k urls in it. Using:

$ bin/nutch crawl seeds

Which goes through its processing and completes, but doesn't visit any 
of the urls in the seeds file. What am I missing to get it to actually 
do the crawl?

Thanks,

-- 
Ken van Mulder
Wavefire Technologies Corporation

http://www.wavefire.com
250.717.0200 (ext 113)

Re: mapred questions

Posted by Doug Cutting <cu...@nutch.org>.
Ken van Mulder wrote:
> First is that the fetcher slows down over time and continues to use more 
> and more memory as it goes (which I think is eventually hanging the 
> process).

What parser plugins do you have enabled?  These are usually the culprit. 
  Try using 'kill -QUIT' to see what various threads are doing, both at 
the start and later, when it slows and grows.

> Second problem is trying to use the crawl. I've tried with a seeds/url 
> file contain 4, 2000 and then 100k urls in it. Using:
> 
> $ bin/nutch crawl seeds
> 
> Which goes through its processing and completes, but doesn't visit any 
> of the urls in the seeds file. What am I missing to get it to actually 
> do the crawl?

Are you using NDFS?  If so, the seeds directory needs to be stored in 
NDFS.  Use 'bin/nutch ndfs -put seeds seeds'.

Doug