You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken van Mulder <ke...@wavefire.com> on 2005/10/25 00:19:42 UTC
mapred questions
Hey folks,
I'm using the latest (or close to) development version of nutch, and I'm
running into a couple of problems with crawling.
Using:
$ bin/nutch fetch segments/2005***
First is that the fetcher slows down over time and continues to use more
and more memory as it goes (which I think is eventually hanging the
process). Over the course of 2 hours it went from 20 pages/s to 13
pages/s and from ~66M to ~340M resident memory. Before the process hung,
it had grabbed 150k pages with about 28k errors. Is the slowdown and
increased usage of memory expected? Is there a method for dealing with this?
Second problem is trying to use the crawl. I've tried with a seeds/url
file contain 4, 2000 and then 100k urls in it. Using:
$ bin/nutch crawl seeds
Which goes through its processing and completes, but doesn't visit any
of the urls in the seeds file. What am I missing to get it to actually
do the crawl?
Thanks,
--
Ken van Mulder
Wavefire Technologies Corporation
http://www.wavefire.com
250.717.0200 (ext 113)
Re: mapred questions
Posted by Doug Cutting <cu...@nutch.org>.
Ken van Mulder wrote:
> First is that the fetcher slows down over time and continues to use more
> and more memory as it goes (which I think is eventually hanging the
> process).
What parser plugins do you have enabled? These are usually the culprit.
Try using 'kill -QUIT' to see what various threads are doing, both at
the start and later, when it slows and grows.
> Second problem is trying to use the crawl. I've tried with a seeds/url
> file contain 4, 2000 and then 100k urls in it. Using:
>
> $ bin/nutch crawl seeds
>
> Which goes through its processing and completes, but doesn't visit any
> of the urls in the seeds file. What am I missing to get it to actually
> do the crawl?
Are you using NDFS? If so, the seeds directory needs to be stored in
NDFS. Use 'bin/nutch ndfs -put seeds seeds'.
Doug