You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/09/11 14:37:32 UTC

Parallelize Fetching Phase

Hi everyone,

I'm running nutch 1.5.1 using a script I created, but there is
a significant slowdown in the fetching phase.
My script uses 20 thread to fetch. Here is the fetch istruction:

bin/nutch fetch $segment -threads 20

It works, but it seems they are all fetching the same URL. Here is the log:

fetching
http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
fetching
http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
...

Is there a way to make each thread crawl a different URL?

Re: Hadoop and Nutch

Posted by Walter Tietze <ti...@neofonie.de>.
Hi,


I have the same problems using Nutch 1.5.1 with
Cloudera CDH4 and Yarn.

There are already entries in the mail archive for
that:

http://www.mail-archive.com/user@nutch.apache.org/msg07182.html
http://www.mail-archive.com/user@nutch.apache.org/msg07184.html


I thought the problem has to do with CDH4 and Yarn.


Please, let me know, if you found a solution for that!



cheers, Walter



Am 12.09.2012 17:21, schrieb Stefan Scheffler:
> Hey,
> Thank you for your for the reply.
> The error stays the same :(
> On 12.09.2012 15:10, Julien Nioche wrote:
>> Hi Stefan,
>>
>> you don't need to set HADOOP CLASSPATH, just use the scripts provided
>> from runtime/deploy/bin
>>
>> ant job => runtime/deploy/bin  => nutch crawl
>>
>> J.
>>
>> On 12 September 2012 13:41, Stefan Scheffler
>> <ss...@avantgarde-labs.de>wrote:
>>
>>> Hi,
>>> I try to run nutch 2.0 on a hadoop cluster and get the following
>>> exception. I compiled nutch from sources and start it with:
>>>
>>> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
>>> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>>>
>>> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
>>> attempt_201208141240_0593_m_**000001_2, Status : FAILED
>>> java.lang.RuntimeException: Error in configuring object
>>>      at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
>>> ReflectionUtils.java:93)
>>>      at org.apache.hadoop.util.**ReflectionUtils.setConf(**
>>> ReflectionUtils.java:64)
>>>      at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
>>> ReflectionUtils.java:117)
>>>      at
>>> org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
>>>      at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
>>>      at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
>>>      at java.security.**AccessController.doPrivileged(**Native Method)
>>>      at javax.security.auth.Subject.**doAs(Subject.java:396)
>>>      at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1177)
>>>      at org.apache.hadoop.mapred.**Child.main(Child.java:264)
>>> Caused by: java.lang.reflect.**InvocationTargetException
>>>      at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>>>      at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>>> NativeMethodAccessorImpl.java:**39)
>>>      at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>>> DelegatingMethodAccessorImpl.**jav
>>>
>>> The error comes, when hadoops starts the injecting. I have no idea,
>>> where
>>> there error comes from.
>>> Has someone a clue about?
>>>
>>> With friendly regards
>>> Stefan Scheffler
>>>
>>>
>>>
>>
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------


Re: Hadoop and Nutch

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey,
Thank you for your for the reply.
The error stays the same :(
On 12.09.2012 15:10, Julien Nioche wrote:
> Hi Stefan,
>
> you don't need to set HADOOP CLASSPATH, just use the scripts provided
> from runtime/deploy/bin
>
> ant job => runtime/deploy/bin  => nutch crawl
>
> J.
>
> On 12 September 2012 13:41, Stefan Scheffler
> <ss...@avantgarde-labs.de>wrote:
>
>> Hi,
>> I try to run nutch 2.0 on a hadoop cluster and get the following
>> exception. I compiled nutch from sources and start it with:
>>
>> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
>> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>>
>> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
>> attempt_201208141240_0593_m_**000001_2, Status : FAILED
>> java.lang.RuntimeException: Error in configuring object
>>      at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
>> ReflectionUtils.java:93)
>>      at org.apache.hadoop.util.**ReflectionUtils.setConf(**
>> ReflectionUtils.java:64)
>>      at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
>> ReflectionUtils.java:117)
>>      at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
>>      at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
>>      at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
>>      at java.security.**AccessController.doPrivileged(**Native Method)
>>      at javax.security.auth.Subject.**doAs(Subject.java:396)
>>      at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>> UserGroupInformation.java:**1177)
>>      at org.apache.hadoop.mapred.**Child.main(Child.java:264)
>> Caused by: java.lang.reflect.**InvocationTargetException
>>      at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>>      at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>> NativeMethodAccessorImpl.java:**39)
>>      at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>> DelegatingMethodAccessorImpl.**jav
>>
>> The error comes, when hadoops starts the injecting. I have no idea, where
>> there error comes from.
>> Has someone a clue about?
>>
>> With friendly regards
>> Stefan Scheffler
>>
>>
>>
>


-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de


Re: Hadoop and Nutch

Posted by Julien Nioche <li...@gmail.com>.
Hi Stefan,

you don't need to set HADOOP CLASSPATH, just use the scripts provided
from runtime/deploy/bin

ant job => runtime/deploy/bin  => nutch crawl

J.

On 12 September 2012 13:41, Stefan Scheffler
<ss...@avantgarde-labs.de>wrote:

> Hi,
> I try to run nutch 2.0 on a hadoop cluster and get the following
> exception. I compiled nutch from sources and start it with:
>
> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>
> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
> attempt_201208141240_0593_m_**000001_2, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
>     at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
> ReflectionUtils.java:93)
>     at org.apache.hadoop.util.**ReflectionUtils.setConf(**
> ReflectionUtils.java:64)
>     at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
> ReflectionUtils.java:117)
>     at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
>     at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
>     at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
>     at java.security.**AccessController.doPrivileged(**Native Method)
>     at javax.security.auth.Subject.**doAs(Subject.java:396)
>     at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1177)
>     at org.apache.hadoop.mapred.**Child.main(Child.java:264)
> Caused by: java.lang.reflect.**InvocationTargetException
>     at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>     at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> NativeMethodAccessorImpl.java:**39)
>     at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> DelegatingMethodAccessorImpl.**jav
>
> The error comes, when hadoops starts the injecting. I have no idea, where
> there error comes from.
> Has someone a clue about?
>
> With friendly regards
> Stefan Scheffler
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Hadoop and Nutch

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Wed, Sep 12, 2012 at 1:41 PM, Stefan Scheffler
<ss...@avantgarde-labs.de> wrote:

> I try to run nutch 2.0 on a hadoop cluster and get the following exception.

>
> HADOOP_CLASSPATH=lib/apache-nutch-1.6-SNAPSHOT.jar hadoop
> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5

The Nutch versions are not consistent. Please check and get back to us.

Lewis

Hadoop and Nutch

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hi,
I try to run nutch 2.0 on a hadoop cluster and get the following 
exception. I compiled nutch from sources and start it with:

HADOOP_CLASSPATH=lib/apache-nutch-1.6-SNAPSHOT.jar hadoop 
org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5

12/09/12 14:34:20 INFO mapred.JobClient: Task Id : 
attempt_201208141240_0593_m_000001_2, Status : FAILED
java.lang.RuntimeException: Error in configuring object
     at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
     at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
     at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
     at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:396)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
     at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.reflect.InvocationTargetException
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav

The error comes, when hadoops starts the injecting. I have no idea, 
where there error comes from.
Has someone a clue about?

With friendly regards
Stefan Scheffler



RE: Parallelize Fetching Phase

Posted by Markus Jelsma <ma...@openindex.io>.
please share the relevant part of the log. 
 
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Wed 12-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Parallelize Fetching Phase
> 
> I've got another issue.
> 
> I'm crawling a single domain and I have fetcher.queue.mode set to "byHost".
> I'm using 20 thread so I set fetcher.threads.per.queue to 20. But I get a
> NullPointerException.
> 
> But setting it to 10 thread, it works fine.
> 
> Can someone explain me why nutch has this behavior?
> 
> Sorry tfor bothering you and thanks very much for your help.
> 
> Matteo
> 
> 2012/9/11 Markus Jelsma <ma...@openindex.io>
> 
> > Hi
> >
> > -----Original message-----
> > > From:Matteo Simoncini <si...@gmail.com>
> > > Sent: Tue 11-Sep-2012 14:41
> > > To: user@nutch.apache.org
> > > Subject: Parallelize Fetching Phase
> > >
> > > Hi everyone,
> > >
> > > I'm running nutch 1.5.1 using a script I created, but there is
> > > a significant slowdown in the fetching phase.
> > > My script uses 20 thread to fetch. Here is the fetch istruction:
> > >
> > > bin/nutch fetch $segment -threads 20
> > >
> > > It works, but it seems they are all fetching the same URL. Here is the
> > log:
> >
> > Not same URL but same host or domain. The fetcher uses either host, domain
> > or IP queues. If you have only one domain or host them setting
> > fetcher.queue.mode is useless. Instead, you would have to increase
> > fetcher.threads.per.queue.
> >
> >
> > >
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > ...
> > >
> > > Is there a way to make each thread crawl a different URL?
> > >
> >
> 

Re: Parallelize Fetching Phase

Posted by Matteo Simoncini <si...@gmail.com>.
I've got another issue.

I'm crawling a single domain and I have fetcher.queue.mode set to "byHost".
I'm using 20 thread so I set fetcher.threads.per.queue to 20. But I get a
NullPointerException.

But setting it to 10 thread, it works fine.

Can someone explain me why nutch has this behavior?

Sorry tfor bothering you and thanks very much for your help.

Matteo

2012/9/11 Markus Jelsma <ma...@openindex.io>

> Hi
>
> -----Original message-----
> > From:Matteo Simoncini <si...@gmail.com>
> > Sent: Tue 11-Sep-2012 14:41
> > To: user@nutch.apache.org
> > Subject: Parallelize Fetching Phase
> >
> > Hi everyone,
> >
> > I'm running nutch 1.5.1 using a script I created, but there is
> > a significant slowdown in the fetching phase.
> > My script uses 20 thread to fetch. Here is the fetch istruction:
> >
> > bin/nutch fetch $segment -threads 20
> >
> > It works, but it seems they are all fetching the same URL. Here is the
> log:
>
> Not same URL but same host or domain. The fetcher uses either host, domain
> or IP queues. If you have only one domain or host them setting
> fetcher.queue.mode is useless. Instead, you would have to increase
> fetcher.threads.per.queue.
>
>
> >
> > fetching
> >
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > fetching
> >
> http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > ...
> >
> > Is there a way to make each thread crawl a different URL?
> >
>

RE: Parallelize Fetching Phase

Posted by Markus Jelsma <ma...@openindex.io>.
Hi
 
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Tue 11-Sep-2012 14:41
> To: user@nutch.apache.org
> Subject: Parallelize Fetching Phase
> 
> Hi everyone,
> 
> I'm running nutch 1.5.1 using a script I created, but there is
> a significant slowdown in the fetching phase.
> My script uses 20 thread to fetch. Here is the fetch istruction:
> 
> bin/nutch fetch $segment -threads 20
> 
> It works, but it seems they are all fetching the same URL. Here is the log:

Not same URL but same host or domain. The fetcher uses either host, domain or IP queues. If you have only one domain or host them setting fetcher.queue.mode is useless. Instead, you would have to increase fetcher.threads.per.queue.


> 
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> ...
> 
> Is there a way to make each thread crawl a different URL?
>