You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/09/11 14:37:32 UTC
Parallelize Fetching Phase
Hi everyone,
I'm running nutch 1.5.1 using a script I created, but there is
a significant slowdown in the fetching phase.
My script uses 20 thread to fetch. Here is the fetch istruction:
bin/nutch fetch $segment -threads 20
It works, but it seems they are all fetching the same URL. Here is the log:
fetching
http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
fetching
http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
-activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
...
Is there a way to make each thread crawl a different URL?
Re: Hadoop and Nutch
Posted by Walter Tietze <ti...@neofonie.de>.
Hi,
I have the same problems using Nutch 1.5.1 with
Cloudera CDH4 and Yarn.
There are already entries in the mail archive for
that:
http://www.mail-archive.com/user@nutch.apache.org/msg07182.html
http://www.mail-archive.com/user@nutch.apache.org/msg07184.html
I thought the problem has to do with CDH4 and Yarn.
Please, let me know, if you found a solution for that!
cheers, Walter
Am 12.09.2012 17:21, schrieb Stefan Scheffler:
> Hey,
> Thank you for your for the reply.
> The error stays the same :(
> On 12.09.2012 15:10, Julien Nioche wrote:
>> Hi Stefan,
>>
>> you don't need to set HADOOP CLASSPATH, just use the scripts provided
>> from runtime/deploy/bin
>>
>> ant job => runtime/deploy/bin => nutch crawl
>>
>> J.
>>
>> On 12 September 2012 13:41, Stefan Scheffler
>> <ss...@avantgarde-labs.de>wrote:
>>
>>> Hi,
>>> I try to run nutch 2.0 on a hadoop cluster and get the following
>>> exception. I compiled nutch from sources and start it with:
>>>
>>> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
>>> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>>>
>>> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
>>> attempt_201208141240_0593_m_**000001_2, Status : FAILED
>>> java.lang.RuntimeException: Error in configuring object
>>> at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
>>> ReflectionUtils.java:93)
>>> at org.apache.hadoop.util.**ReflectionUtils.setConf(**
>>> ReflectionUtils.java:64)
>>> at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
>>> ReflectionUtils.java:117)
>>> at
>>> org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
>>> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
>>> at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
>>> at java.security.**AccessController.doPrivileged(**Native Method)
>>> at javax.security.auth.Subject.**doAs(Subject.java:396)
>>> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1177)
>>> at org.apache.hadoop.mapred.**Child.main(Child.java:264)
>>> Caused by: java.lang.reflect.**InvocationTargetException
>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>>> NativeMethodAccessorImpl.java:**39)
>>> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>>> DelegatingMethodAccessorImpl.**jav
>>>
>>> The error comes, when hadoops starts the injecting. I have no idea,
>>> where
>>> there error comes from.
>>> Has someone a clue about?
>>>
>>> With friendly regards
>>> Stefan Scheffler
>>>
>>>
>>>
>>
>
>
--
--------------------------------
Walter Tietze
Senior Softwareengineer
Research
Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin
T +49.30 24627 318
F +49.30 24627 120
Walter.Tietze@neofonie.de
http://www.neofonie.de
Handelsregister
Berlin-Charlottenburg: HRB 67460
Geschäftsführung:
Thomas Kitlitschko
--------------------------------
Re: Hadoop and Nutch
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey,
Thank you for your for the reply.
The error stays the same :(
On 12.09.2012 15:10, Julien Nioche wrote:
> Hi Stefan,
>
> you don't need to set HADOOP CLASSPATH, just use the scripts provided
> from runtime/deploy/bin
>
> ant job => runtime/deploy/bin => nutch crawl
>
> J.
>
> On 12 September 2012 13:41, Stefan Scheffler
> <ss...@avantgarde-labs.de>wrote:
>
>> Hi,
>> I try to run nutch 2.0 on a hadoop cluster and get the following
>> exception. I compiled nutch from sources and start it with:
>>
>> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
>> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>>
>> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
>> attempt_201208141240_0593_m_**000001_2, Status : FAILED
>> java.lang.RuntimeException: Error in configuring object
>> at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
>> ReflectionUtils.java:93)
>> at org.apache.hadoop.util.**ReflectionUtils.setConf(**
>> ReflectionUtils.java:64)
>> at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
>> ReflectionUtils.java:117)
>> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
>> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
>> at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
>> at java.security.**AccessController.doPrivileged(**Native Method)
>> at javax.security.auth.Subject.**doAs(Subject.java:396)
>> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>> UserGroupInformation.java:**1177)
>> at org.apache.hadoop.mapred.**Child.main(Child.java:264)
>> Caused by: java.lang.reflect.**InvocationTargetException
>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>> at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>> NativeMethodAccessorImpl.java:**39)
>> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>> DelegatingMethodAccessorImpl.**jav
>>
>> The error comes, when hadoops starts the injecting. I have no idea, where
>> there error comes from.
>> Has someone a clue about?
>>
>> With friendly regards
>> Stefan Scheffler
>>
>>
>>
>
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de
Re: Hadoop and Nutch
Posted by Julien Nioche <li...@gmail.com>.
Hi Stefan,
you don't need to set HADOOP CLASSPATH, just use the scripts provided
from runtime/deploy/bin
ant job => runtime/deploy/bin => nutch crawl
J.
On 12 September 2012 13:41, Stefan Scheffler
<ss...@avantgarde-labs.de>wrote:
> Hi,
> I try to run nutch 2.0 on a hadoop cluster and get the following
> exception. I compiled nutch from sources and start it with:
>
> HADOOP_CLASSPATH=lib/apache-**nutch-1.6-SNAPSHOT.jar hadoop
> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
>
> 12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
> attempt_201208141240_0593_m_**000001_2, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at org.apache.hadoop.util.**ReflectionUtils.setJobConf(**
> ReflectionUtils.java:93)
> at org.apache.hadoop.util.**ReflectionUtils.setConf(**
> ReflectionUtils.java:64)
> at org.apache.hadoop.util.**ReflectionUtils.newInstance(**
> ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:387)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:325)
> at org.apache.hadoop.mapred.**Child$4.run(Child.java:270)
> at java.security.**AccessController.doPrivileged(**Native Method)
> at javax.security.auth.Subject.**doAs(Subject.java:396)
> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1177)
> at org.apache.hadoop.mapred.**Child.main(Child.java:264)
> Caused by: java.lang.reflect.**InvocationTargetException
> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
> at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> NativeMethodAccessorImpl.java:**39)
> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> DelegatingMethodAccessorImpl.**jav
>
> The error comes, when hadoops starts the injecting. I have no idea, where
> there error comes from.
> Has someone a clue about?
>
> With friendly regards
> Stefan Scheffler
>
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Hadoop and Nutch
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
On Wed, Sep 12, 2012 at 1:41 PM, Stefan Scheffler
<ss...@avantgarde-labs.de> wrote:
> I try to run nutch 2.0 on a hadoop cluster and get the following exception.
>
> HADOOP_CLASSPATH=lib/apache-nutch-1.6-SNAPSHOT.jar hadoop
> org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
The Nutch versions are not consistent. Please check and get back to us.
Lewis
Hadoop and Nutch
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hi,
I try to run nutch 2.0 on a hadoop cluster and get the following
exception. I compiled nutch from sources and start it with:
HADOOP_CLASSPATH=lib/apache-nutch-1.6-SNAPSHOT.jar hadoop
org.apache.nutch.crawl.Crawl urls -dir test -depth 2 -topN 5
12/09/12 14:34:20 INFO mapred.JobClient: Task Id :
attempt_201208141240_0593_m_000001_2, Status : FAILED
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
The error comes, when hadoops starts the injecting. I have no idea,
where there error comes from.
Has someone a clue about?
With friendly regards
Stefan Scheffler
RE: Parallelize Fetching Phase
Posted by Markus Jelsma <ma...@openindex.io>.
please share the relevant part of the log.
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Wed 12-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Parallelize Fetching Phase
>
> I've got another issue.
>
> I'm crawling a single domain and I have fetcher.queue.mode set to "byHost".
> I'm using 20 thread so I set fetcher.threads.per.queue to 20. But I get a
> NullPointerException.
>
> But setting it to 10 thread, it works fine.
>
> Can someone explain me why nutch has this behavior?
>
> Sorry tfor bothering you and thanks very much for your help.
>
> Matteo
>
> 2012/9/11 Markus Jelsma <ma...@openindex.io>
>
> > Hi
> >
> > -----Original message-----
> > > From:Matteo Simoncini <si...@gmail.com>
> > > Sent: Tue 11-Sep-2012 14:41
> > > To: user@nutch.apache.org
> > > Subject: Parallelize Fetching Phase
> > >
> > > Hi everyone,
> > >
> > > I'm running nutch 1.5.1 using a script I created, but there is
> > > a significant slowdown in the fetching phase.
> > > My script uses 20 thread to fetch. Here is the fetch istruction:
> > >
> > > bin/nutch fetch $segment -threads 20
> > >
> > > It works, but it seems they are all fetching the same URL. Here is the
> > log:
> >
> > Not same URL but same host or domain. The fetcher uses either host, domain
> > or IP queues. If you have only one domain or host them setting
> > fetcher.queue.mode is useless. Instead, you would have to increase
> > fetcher.threads.per.queue.
> >
> >
> > >
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > ...
> > >
> > > Is there a way to make each thread crawl a different URL?
> > >
> >
>
Re: Parallelize Fetching Phase
Posted by Matteo Simoncini <si...@gmail.com>.
I've got another issue.
I'm crawling a single domain and I have fetcher.queue.mode set to "byHost".
I'm using 20 thread so I set fetcher.threads.per.queue to 20. But I get a
NullPointerException.
But setting it to 10 thread, it works fine.
Can someone explain me why nutch has this behavior?
Sorry tfor bothering you and thanks very much for your help.
Matteo
2012/9/11 Markus Jelsma <ma...@openindex.io>
> Hi
>
> -----Original message-----
> > From:Matteo Simoncini <si...@gmail.com>
> > Sent: Tue 11-Sep-2012 14:41
> > To: user@nutch.apache.org
> > Subject: Parallelize Fetching Phase
> >
> > Hi everyone,
> >
> > I'm running nutch 1.5.1 using a script I created, but there is
> > a significant slowdown in the fetching phase.
> > My script uses 20 thread to fetch. Here is the fetch istruction:
> >
> > bin/nutch fetch $segment -threads 20
> >
> > It works, but it seems they are all fetching the same URL. Here is the
> log:
>
> Not same URL but same host or domain. The fetcher uses either host, domain
> or IP queues. If you have only one domain or host them setting
> fetcher.queue.mode is useless. Instead, you would have to increase
> fetcher.threads.per.queue.
>
>
> >
> > fetching
> >
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > fetching
> >
> http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > ...
> >
> > Is there a way to make each thread crawl a different URL?
> >
>
RE: Parallelize Fetching Phase
Posted by Markus Jelsma <ma...@openindex.io>.
Hi
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Tue 11-Sep-2012 14:41
> To: user@nutch.apache.org
> Subject: Parallelize Fetching Phase
>
> Hi everyone,
>
> I'm running nutch 1.5.1 using a script I created, but there is
> a significant slowdown in the fetching phase.
> My script uses 20 thread to fetch. Here is the fetch istruction:
>
> bin/nutch fetch $segment -threads 20
>
> It works, but it seems they are all fetching the same URL. Here is the log:
Not same URL but same host or domain. The fetcher uses either host, domain or IP queues. If you have only one domain or host them setting fetcher.queue.mode is useless. Instead, you would have to increase fetcher.threads.per.queue.
>
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> ...
>
> Is there a way to make each thread crawl a different URL?
>