You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kevin MacDonald <ke...@hautesecure.com> on 2008/09/14 00:53:07 UTC

Optimizing nutch

Hello,

I need to configure nutch to be as fast as possible while operating on a
single machine. My primary purpose is to dump the link database and analyze
links leading from each of the urls that I want crawled. I do not need the
indexing or searching capabilities of nutch right now.

What I see in the logs is rather strange. I am crawling approximately 3500
urls to a depth of 1 only. All of the fetching operations complete in just
over 3 minutes, which is about 1000 fetches per minute. That seems very
reasonable. However, following that there are long periods of inactivity.
>From the last fetch to when I see "fetcher.Fetcher - Fetcher: done" about 10
minutes elapses with no log activity and the CPU sitting at zero
utilitization! It then takes about an additional 5 minutes to update the
CrawlDb. I have tried this using 10 threads and 100 threads. The results are
similar.

Can anyone explain what is happening here? What would cause nutch to sit for
so long doing nothing?

Kevin

Re: Optimizing nutch

Posted by Kevin MacDonald <ke...@hautesecure.com>.

Nutch is doing a great deal of operations such as the ones shown below from
the logs. These seem to eat up a lot of time after fetching is done.

mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at
0
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written
at 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$Counte
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records
at 0
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records
at 1
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at
2
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at
3
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input
records at 4
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output
records at 5
apred.LocalJobRunner (LocalJobRunner.java:statusUpdate(258)) - 2613 pages,
613 errors, 6.7 pages/s, 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at
0
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written
at 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$Counte
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records
at 0
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records
at 1
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at
2
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at
3
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input
records at 4
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output
records at 5
apred.LocalJobRunner (LocalJobRunner.java:statusUpdate(258)) - 2613 pages,
613 errors, 6.7 pages/s, 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at
0
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written
at 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$Counte
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records
at 0
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records
at 1
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at
2
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at
3
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input
records at 4
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output
records at 5

On Sat, Sep 13, 2008 at 3:53 PM, Kevin MacDonald <ke...@hautesecure.com>wrote:

> Hello,
>
> I need to configure nutch to be as fast as possible while operating on a
> single machine. My primary purpose is to dump the link database and analyze
> links leading from each of the urls that I want crawled. I do not need the
> indexing or searching capabilities of nutch right now.
>
> What I see in the logs is rather strange. I am crawling approximately 3500
> urls to a depth of 1 only. All of the fetching operations complete in just
> over 3 minutes, which is about 1000 fetches per minute. That seems very
> reasonable. However, following that there are long periods of inactivity.
> From the last fetch to when I see "fetcher.Fetcher - Fetcher: done" about 10
> minutes elapses with no log activity and the CPU sitting at zero
> utilitization! It then takes about an additional 5 minutes to update the
> CrawlDb. I have tried this using 10 threads and 100 threads. The results are
> similar.
>
> Can anyone explain what is happening here? What would cause nutch to sit
> for so long doing nothing?
>
> Kevin
>

RE: Optimizing nutch

Posted by zhengping deng <de...@hotmail.com>.

i have 4 machine to test nutch speed, but found it too slowly. I am so eager as you to improve it .
 
> Date: Sat, 13 Sep 2008 15:53:07 -0700> From: kevin@hautesecure.com> To: nutch-user@lucene.apache.org> Subject: Optimizing nutch> > Hello,> > I need to configure nutch to be as fast as possible while operating on a> single machine. My primary purpose is to dump the link database and analyze> links leading from each of the urls that I want crawled. I do not need the> indexing or searching capabilities of nutch right now.> > What I see in the logs is rather strange. I am crawling approximately 3500> urls to a depth of 1 only. All of the fetching operations complete in just> over 3 minutes, which is about 1000 fetches per minute. That seems very> reasonable. However, following that there are long periods of inactivity.> From the last fetch to when I see "fetcher.Fetcher - Fetcher: done" about 10> minutes elapses with no log activity and the CPU sitting at zero> utilitization! It then takes about an additional 5 minutes to update the> CrawlDb. I have tried this using 10 threads 
 and 100 threads. The results are> similar.> > Can anyone explain what is happening here? What would cause nutch to sit for> so long doing nothing?> > Kevin
_________________________________________________________________
Explore the seven wonders of the world
http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=QBRE