You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by peterbarretto <pe...@gmail.com> on 2013/01/27 11:50:07 UTC

increase the number of fetches at agiven time on nutch 1.6 or 2.1

I want to increase the number of urls fetched at a time in nutch. I have
around 10 websites to crawl. so how can i crawl all the sites at a time ?
right now i am fetching 1 site with a fetch delay of 2 second but it is too
slow. How to concurrently fetch from different domain?



--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Markus Jelsma <ma...@openindex.io>.
Try setting -numFetchers N on the generator. 
 
-----Original message-----
> From:Sourajit Basak <so...@gmail.com>
> Sent: Mon 28-Jan-2013 11:57
> To: user@nutch.apache.org
> Subject: Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
> 
> A higher number of per host threads, etc might not be useful if the
> bandwidth doesn't scale out. I have a different observation though.
> 
> We run nutch on a hadoop cluster. Even as we added new machines to the
> cluster, the fetch phase only creates two tasks. (the original number of
> nodes when we started) Why is it so ? I have checked that the tasks do get
> spawned in the newly added nodes.
> We have this setting in hadoop mapred-site.xml
>  <property>
>    <name>mapred.tasktracker.map.tasks.maximum</name>
>    <value>20</value>
>  </property>
> 
> We have planned to double the number of websites and see if it still
> doesn't spawn tasks on each node. I will keep this forum updated with out
> results. In the meantime, can anyone point out if we have missed any
> particular configuration ?
> 
> Thanks,
> Sourajit
> 
> 
> 
> On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <te...@gmail.com>wrote:
> 
> > Hey Peter,
> >
> > I am guessing that you have just increased the global thread count. Have
> > you even increased "fetcher.threads.per.host" ? This will improve the crawl
> > rate as multiple threads can attack the same site. Dont make it too high or
> > else the system will get overloaded. The nutch wiki has an article [0]
> > about the potential reasons for slow crawls and some good suggestions.
> >
> > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <peterbarretto08@gmail.com
> > >wrote:
> >
> > > I tried increasing the numbers of threads to 50 but the speed is not
> > > affected
> > >
> > >
> > > I tried changing the partition.url.mode value to byDomain and
> > > fetcher.queue.mode to byDomain but still it does not help the speed.
> > > It seems to get urls from 2 domains now and the other domains are not
> > > getting crawled. Is this due to the url score? if so how do i crawl urls
> > > from all the domains?
> > >
> > >
> > > lewis john mcgibbney wrote
> > > > Increase number of threads when fetching
> > > > Also please see nutch-deault.xml for paritioning of urls, if you know
> > > your
> > > > target domains you may wish to adapt the policy.
> > > > Lewis
> > > >
> > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > >
> > > > peterbarretto08@
> > >
> > > > &gt;
> > > > wrote:
> > > >> I want to increase the number of urls fetched at a time in nutch. I
> > have
> > > >> around 10 websites to crawl. so how can i crawl all the sites at a
> > time
> > > ?
> > > >> right now i am fetching 1 site with a fetch delay of 2 second but it
> > is
> > > > too
> > > >> slow. How to concurrently fetch from different domain?
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> View this message in context:
> > > >
> > >
> > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>
> > > >
> > > > --
> > > > *Lewis*
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> 

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hi Sourajit,

I strongly feel that having more hosts / sites will not affect the no of
part files formed. The no of part files created will be bounded by (no of
nodes) * (max no of reducers per node). This being the max, the actual
value is always less than it as the cluster won't allocate all the reducer
slots on all the nodes for one particular job.

Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 4:24 AM, Sourajit Basak <so...@gmail.com>wrote:

> I anticipated.
>
> If the #of sites crawled increases, have you seen if nutch generates more
> part files than the number of nodes ? Maybe we will wait till we see the
> results from doubling sites before forcing a non-default behavior.
>
> Thanks,
> Sourajit
>
> On Mon, Jan 28, 2013 at 5:47 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Hey Sourajit,
> >
> > I don't think that it can be passed with the crawl command. You will have
> > to use individual commands for that.
> > I personally felt messy while running a full crawl with all these bunch
> of
> > commands, so I had created a script to automate things.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <
> sourajit.basac@gmail.com
> > >wrote:
> >
> > > I will try this out.
> > > How do I pass this parameter if we are doing a one step crawl ?
> > >
> > > On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > > >wrote:
> > >
> > > > Hey Sourajit,
> > > >
> > > > I had seen such thing when running crawls over hadoop cluster. After
> > some
> > > > experiments, I came to following conclusion:
> > > > The number of mappers spawned is governed by the no of part files
> > created
> > > > by the generator (and not the #nodes in the cluster). And this is
> > nothing
> > > > but the reducers for the last job in the generate phase. There is a
> > param
> > > > passed to generate named numFetchers to control its #reducers.
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> > > sourajit.basac@gmail.com
> > > > >wrote:
> > > >
> > > > > A higher number of per host threads, etc might not be useful if the
> > > > > bandwidth doesn't scale out. I have a different observation though.
> > > > >
> > > > > We run nutch on a hadoop cluster. Even as we added new machines to
> > the
> > > > > cluster, the fetch phase only creates two tasks. (the original
> number
> > > of
> > > > > nodes when we started) Why is it so ? I have checked that the tasks
> > do
> > > > get
> > > > > spawned in the newly added nodes.
> > > > > We have this setting in hadoop mapred-site.xml
> > > > >  <property>
> > > > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > > > >    <value>20</value>
> > > > >  </property>
> > > > >
> > > > > We have planned to double the number of websites and see if it
> still
> > > > > doesn't spawn tasks on each node. I will keep this forum updated
> with
> > > out
> > > > > results. In the meantime, can anyone point out if we have missed
> any
> > > > > particular configuration ?
> > > > >
> > > > > Thanks,
> > > > > Sourajit
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> > > tejas.patil.cs@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hey Peter,
> > > > > >
> > > > > > I am guessing that you have just increased the global thread
> count.
> > > > Have
> > > > > > you even increased "fetcher.threads.per.host" ? This will improve
> > the
> > > > > crawl
> > > > > > rate as multiple threads can attack the same site. Dont make it
> too
> > > > high
> > > > > or
> > > > > > else the system will get overloaded. The nutch wiki has an
> article
> > > [0]
> > > > > > about the potential reasons for slow crawls and some good
> > > suggestions.
> > > > > >
> > > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > > > >
> > > > > > Thanks,
> > > > > > Tejas Patil
> > > > > >
> > > > > >
> > > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > > > peterbarretto08@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > I tried increasing the numbers of threads to 50 but the speed
> is
> > > not
> > > > > > > affected
> > > > > > >
> > > > > > >
> > > > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > > > fetcher.queue.mode to byDomain but still it does not help the
> > > speed.
> > > > > > > It seems to get urls from 2 domains now and the other domains
> are
> > > not
> > > > > > > getting crawled. Is this due to the url score? if so how do i
> > crawl
> > > > > urls
> > > > > > > from all the domains?
> > > > > > >
> > > > > > >
> > > > > > > lewis john mcgibbney wrote
> > > > > > > > Increase number of threads when fetching
> > > > > > > > Also please see nutch-deault.xml for paritioning of urls, if
> > you
> > > > know
> > > > > > > your
> > > > > > > > target domains you may wish to adapt the policy.
> > > > > > > > Lewis
> > > > > > > >
> > > > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > > > >
> > > > > > > > peterbarretto08@
> > > > > > >
> > > > > > > > &gt;
> > > > > > > > wrote:
> > > > > > > >> I want to increase the number of urls fetched at a time in
> > > nutch.
> > > > I
> > > > > > have
> > > > > > > >> around 10 websites to crawl. so how can i crawl all the
> sites
> > > at a
> > > > > > time
> > > > > > > ?
> > > > > > > >> right now i am fetching 1 site with a fetch delay of 2
> second
> > > but
> > > > it
> > > > > > is
> > > > > > > > too
> > > > > > > >> slow. How to concurrently fetch from different domain?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> View this message in context:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > > > >> Sent from the Nutch - User mailing list archive at
> Nabble.com.
> > > > > > > >>
> > > > > > > >
> > > > > > > > --
> > > > > > > > *Lewis*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > View this message in context:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Sourajit Basak <so...@gmail.com>.
I anticipated.

If the #of sites crawled increases, have you seen if nutch generates more
part files than the number of nodes ? Maybe we will wait till we see the
results from doubling sites before forcing a non-default behavior.

Thanks,
Sourajit

On Mon, Jan 28, 2013 at 5:47 PM, Tejas Patil <te...@gmail.com>wrote:

> Hey Sourajit,
>
> I don't think that it can be passed with the crawl command. You will have
> to use individual commands for that.
> I personally felt messy while running a full crawl with all these bunch of
> commands, so I had created a script to automate things.
>
> Thanks,
> Tejas Patil
>
>
> On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <sourajit.basac@gmail.com
> >wrote:
>
> > I will try this out.
> > How do I pass this parameter if we are doing a one step crawl ?
> >
> > On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > Hey Sourajit,
> > >
> > > I had seen such thing when running crawls over hadoop cluster. After
> some
> > > experiments, I came to following conclusion:
> > > The number of mappers spawned is governed by the no of part files
> created
> > > by the generator (and not the #nodes in the cluster). And this is
> nothing
> > > but the reducers for the last job in the generate phase. There is a
> param
> > > passed to generate named numFetchers to control its #reducers.
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > >
> > > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> > sourajit.basac@gmail.com
> > > >wrote:
> > >
> > > > A higher number of per host threads, etc might not be useful if the
> > > > bandwidth doesn't scale out. I have a different observation though.
> > > >
> > > > We run nutch on a hadoop cluster. Even as we added new machines to
> the
> > > > cluster, the fetch phase only creates two tasks. (the original number
> > of
> > > > nodes when we started) Why is it so ? I have checked that the tasks
> do
> > > get
> > > > spawned in the newly added nodes.
> > > > We have this setting in hadoop mapred-site.xml
> > > >  <property>
> > > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > > >    <value>20</value>
> > > >  </property>
> > > >
> > > > We have planned to double the number of websites and see if it still
> > > > doesn't spawn tasks on each node. I will keep this forum updated with
> > out
> > > > results. In the meantime, can anyone point out if we have missed any
> > > > particular configuration ?
> > > >
> > > > Thanks,
> > > > Sourajit
> > > >
> > > >
> > > >
> > > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> > tejas.patil.cs@gmail.com
> > > > >wrote:
> > > >
> > > > > Hey Peter,
> > > > >
> > > > > I am guessing that you have just increased the global thread count.
> > > Have
> > > > > you even increased "fetcher.threads.per.host" ? This will improve
> the
> > > > crawl
> > > > > rate as multiple threads can attack the same site. Dont make it too
> > > high
> > > > or
> > > > > else the system will get overloaded. The nutch wiki has an article
> > [0]
> > > > > about the potential reasons for slow crawls and some good
> > suggestions.
> > > > >
> > > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > > >
> > > > > Thanks,
> > > > > Tejas Patil
> > > > >
> > > > >
> > > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > > peterbarretto08@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I tried increasing the numbers of threads to 50 but the speed is
> > not
> > > > > > affected
> > > > > >
> > > > > >
> > > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > > fetcher.queue.mode to byDomain but still it does not help the
> > speed.
> > > > > > It seems to get urls from 2 domains now and the other domains are
> > not
> > > > > > getting crawled. Is this due to the url score? if so how do i
> crawl
> > > > urls
> > > > > > from all the domains?
> > > > > >
> > > > > >
> > > > > > lewis john mcgibbney wrote
> > > > > > > Increase number of threads when fetching
> > > > > > > Also please see nutch-deault.xml for paritioning of urls, if
> you
> > > know
> > > > > > your
> > > > > > > target domains you may wish to adapt the policy.
> > > > > > > Lewis
> > > > > > >
> > > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > > >
> > > > > > > peterbarretto08@
> > > > > >
> > > > > > > &gt;
> > > > > > > wrote:
> > > > > > >> I want to increase the number of urls fetched at a time in
> > nutch.
> > > I
> > > > > have
> > > > > > >> around 10 websites to crawl. so how can i crawl all the sites
> > at a
> > > > > time
> > > > > > ?
> > > > > > >> right now i am fetching 1 site with a fetch delay of 2 second
> > but
> > > it
> > > > > is
> > > > > > > too
> > > > > > >> slow. How to concurrently fetch from different domain?
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> View this message in context:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hey Sourajit,

I don't think that it can be passed with the crawl command. You will have
to use individual commands for that.
I personally felt messy while running a full crawl with all these bunch of
commands, so I had created a script to automate things.

Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 3:46 AM, Sourajit Basak <so...@gmail.com>wrote:

> I will try this out.
> How do I pass this parameter if we are doing a one step crawl ?
>
> On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Hey Sourajit,
> >
> > I had seen such thing when running crawls over hadoop cluster. After some
> > experiments, I came to following conclusion:
> > The number of mappers spawned is governed by the no of part files created
> > by the generator (and not the #nodes in the cluster). And this is nothing
> > but the reducers for the last job in the generate phase. There is a param
> > passed to generate named numFetchers to control its #reducers.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <
> sourajit.basac@gmail.com
> > >wrote:
> >
> > > A higher number of per host threads, etc might not be useful if the
> > > bandwidth doesn't scale out. I have a different observation though.
> > >
> > > We run nutch on a hadoop cluster. Even as we added new machines to the
> > > cluster, the fetch phase only creates two tasks. (the original number
> of
> > > nodes when we started) Why is it so ? I have checked that the tasks do
> > get
> > > spawned in the newly added nodes.
> > > We have this setting in hadoop mapred-site.xml
> > >  <property>
> > >    <name>mapred.tasktracker.map.tasks.maximum</name>
> > >    <value>20</value>
> > >  </property>
> > >
> > > We have planned to double the number of websites and see if it still
> > > doesn't spawn tasks on each node. I will keep this forum updated with
> out
> > > results. In the meantime, can anyone point out if we have missed any
> > > particular configuration ?
> > >
> > > Thanks,
> > > Sourajit
> > >
> > >
> > >
> > > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > > >wrote:
> > >
> > > > Hey Peter,
> > > >
> > > > I am guessing that you have just increased the global thread count.
> > Have
> > > > you even increased "fetcher.threads.per.host" ? This will improve the
> > > crawl
> > > > rate as multiple threads can attack the same site. Dont make it too
> > high
> > > or
> > > > else the system will get overloaded. The nutch wiki has an article
> [0]
> > > > about the potential reasons for slow crawls and some good
> suggestions.
> > > >
> > > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > > peterbarretto08@gmail.com
> > > > >wrote:
> > > >
> > > > > I tried increasing the numbers of threads to 50 but the speed is
> not
> > > > > affected
> > > > >
> > > > >
> > > > > I tried changing the partition.url.mode value to byDomain and
> > > > > fetcher.queue.mode to byDomain but still it does not help the
> speed.
> > > > > It seems to get urls from 2 domains now and the other domains are
> not
> > > > > getting crawled. Is this due to the url score? if so how do i crawl
> > > urls
> > > > > from all the domains?
> > > > >
> > > > >
> > > > > lewis john mcgibbney wrote
> > > > > > Increase number of threads when fetching
> > > > > > Also please see nutch-deault.xml for paritioning of urls, if you
> > know
> > > > > your
> > > > > > target domains you may wish to adapt the policy.
> > > > > > Lewis
> > > > > >
> > > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > > >
> > > > > > peterbarretto08@
> > > > >
> > > > > > &gt;
> > > > > > wrote:
> > > > > >> I want to increase the number of urls fetched at a time in
> nutch.
> > I
> > > > have
> > > > > >> around 10 websites to crawl. so how can i crawl all the sites
> at a
> > > > time
> > > > > ?
> > > > > >> right now i am fetching 1 site with a fetch delay of 2 second
> but
> > it
> > > > is
> > > > > > too
> > > > > >> slow. How to concurrently fetch from different domain?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > *Lewis*
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Sourajit Basak <so...@gmail.com>.
I will try this out.
How do I pass this parameter if we are doing a one step crawl ?

On Mon, Jan 28, 2013 at 4:28 PM, Tejas Patil <te...@gmail.com>wrote:

> Hey Sourajit,
>
> I had seen such thing when running crawls over hadoop cluster. After some
> experiments, I came to following conclusion:
> The number of mappers spawned is governed by the no of part files created
> by the generator (and not the #nodes in the cluster). And this is nothing
> but the reducers for the last job in the generate phase. There is a param
> passed to generate named numFetchers to control its #reducers.
>
> Thanks,
> Tejas Patil
>
>
> On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <sourajit.basac@gmail.com
> >wrote:
>
> > A higher number of per host threads, etc might not be useful if the
> > bandwidth doesn't scale out. I have a different observation though.
> >
> > We run nutch on a hadoop cluster. Even as we added new machines to the
> > cluster, the fetch phase only creates two tasks. (the original number of
> > nodes when we started) Why is it so ? I have checked that the tasks do
> get
> > spawned in the newly added nodes.
> > We have this setting in hadoop mapred-site.xml
> >  <property>
> >    <name>mapred.tasktracker.map.tasks.maximum</name>
> >    <value>20</value>
> >  </property>
> >
> > We have planned to double the number of websites and see if it still
> > doesn't spawn tasks on each node. I will keep this forum updated with out
> > results. In the meantime, can anyone point out if we have missed any
> > particular configuration ?
> >
> > Thanks,
> > Sourajit
> >
> >
> >
> > On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > Hey Peter,
> > >
> > > I am guessing that you have just increased the global thread count.
> Have
> > > you even increased "fetcher.threads.per.host" ? This will improve the
> > crawl
> > > rate as multiple threads can attack the same site. Dont make it too
> high
> > or
> > > else the system will get overloaded. The nutch wiki has an article [0]
> > > about the potential reasons for slow crawls and some good suggestions.
> > >
> > > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > >
> > > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> > peterbarretto08@gmail.com
> > > >wrote:
> > >
> > > > I tried increasing the numbers of threads to 50 but the speed is not
> > > > affected
> > > >
> > > >
> > > > I tried changing the partition.url.mode value to byDomain and
> > > > fetcher.queue.mode to byDomain but still it does not help the speed.
> > > > It seems to get urls from 2 domains now and the other domains are not
> > > > getting crawled. Is this due to the url score? if so how do i crawl
> > urls
> > > > from all the domains?
> > > >
> > > >
> > > > lewis john mcgibbney wrote
> > > > > Increase number of threads when fetching
> > > > > Also please see nutch-deault.xml for paritioning of urls, if you
> know
> > > > your
> > > > > target domains you may wish to adapt the policy.
> > > > > Lewis
> > > > >
> > > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > > >
> > > > > peterbarretto08@
> > > >
> > > > > &gt;
> > > > > wrote:
> > > > >> I want to increase the number of urls fetched at a time in nutch.
> I
> > > have
> > > > >> around 10 websites to crawl. so how can i crawl all the sites at a
> > > time
> > > > ?
> > > > >> right now i am fetching 1 site with a fetch delay of 2 second but
> it
> > > is
> > > > > too
> > > > >> slow. How to concurrently fetch from different domain?
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >>
> > > > >
> > > > > --
> > > > > *Lewis*
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hey Sourajit,

I had seen such thing when running crawls over hadoop cluster. After some
experiments, I came to following conclusion:
The number of mappers spawned is governed by the no of part files created
by the generator (and not the #nodes in the cluster). And this is nothing
but the reducers for the last job in the generate phase. There is a param
passed to generate named numFetchers to control its #reducers.

Thanks,
Tejas Patil


On Mon, Jan 28, 2013 at 2:49 AM, Sourajit Basak <so...@gmail.com>wrote:

> A higher number of per host threads, etc might not be useful if the
> bandwidth doesn't scale out. I have a different observation though.
>
> We run nutch on a hadoop cluster. Even as we added new machines to the
> cluster, the fetch phase only creates two tasks. (the original number of
> nodes when we started) Why is it so ? I have checked that the tasks do get
> spawned in the newly added nodes.
> We have this setting in hadoop mapred-site.xml
>  <property>
>    <name>mapred.tasktracker.map.tasks.maximum</name>
>    <value>20</value>
>  </property>
>
> We have planned to double the number of websites and see if it still
> doesn't spawn tasks on each node. I will keep this forum updated with out
> results. In the meantime, can anyone point out if we have missed any
> particular configuration ?
>
> Thanks,
> Sourajit
>
>
>
> On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Hey Peter,
> >
> > I am guessing that you have just increased the global thread count. Have
> > you even increased "fetcher.threads.per.host" ? This will improve the
> crawl
> > rate as multiple threads can attack the same site. Dont make it too high
> or
> > else the system will get overloaded. The nutch wiki has an article [0]
> > about the potential reasons for slow crawls and some good suggestions.
> >
> > [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <
> peterbarretto08@gmail.com
> > >wrote:
> >
> > > I tried increasing the numbers of threads to 50 but the speed is not
> > > affected
> > >
> > >
> > > I tried changing the partition.url.mode value to byDomain and
> > > fetcher.queue.mode to byDomain but still it does not help the speed.
> > > It seems to get urls from 2 domains now and the other domains are not
> > > getting crawled. Is this due to the url score? if so how do i crawl
> urls
> > > from all the domains?
> > >
> > >
> > > lewis john mcgibbney wrote
> > > > Increase number of threads when fetching
> > > > Also please see nutch-deault.xml for paritioning of urls, if you know
> > > your
> > > > target domains you may wish to adapt the policy.
> > > > Lewis
> > > >
> > > > On Sunday, January 27, 2013, peterbarretto &lt;
> > >
> > > > peterbarretto08@
> > >
> > > > &gt;
> > > > wrote:
> > > >> I want to increase the number of urls fetched at a time in nutch. I
> > have
> > > >> around 10 websites to crawl. so how can i crawl all the sites at a
> > time
> > > ?
> > > >> right now i am fetching 1 site with a fetch delay of 2 second but it
> > is
> > > > too
> > > >> slow. How to concurrently fetch from different domain?
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>
> > > >
> > > > --
> > > > *Lewis*
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Sourajit Basak <so...@gmail.com>.
A higher number of per host threads, etc might not be useful if the
bandwidth doesn't scale out. I have a different observation though.

We run nutch on a hadoop cluster. Even as we added new machines to the
cluster, the fetch phase only creates two tasks. (the original number of
nodes when we started) Why is it so ? I have checked that the tasks do get
spawned in the newly added nodes.
We have this setting in hadoop mapred-site.xml
 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>20</value>
 </property>

We have planned to double the number of websites and see if it still
doesn't spawn tasks on each node. I will keep this forum updated with out
results. In the meantime, can anyone point out if we have missed any
particular configuration ?

Thanks,
Sourajit



On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil <te...@gmail.com>wrote:

> Hey Peter,
>
> I am guessing that you have just increased the global thread count. Have
> you even increased "fetcher.threads.per.host" ? This will improve the crawl
> rate as multiple threads can attack the same site. Dont make it too high or
> else the system will get overloaded. The nutch wiki has an article [0]
> about the potential reasons for slow crawls and some good suggestions.
>
> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
>
> Thanks,
> Tejas Patil
>
>
> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <peterbarretto08@gmail.com
> >wrote:
>
> > I tried increasing the numbers of threads to 50 but the speed is not
> > affected
> >
> >
> > I tried changing the partition.url.mode value to byDomain and
> > fetcher.queue.mode to byDomain but still it does not help the speed.
> > It seems to get urls from 2 domains now and the other domains are not
> > getting crawled. Is this due to the url score? if so how do i crawl urls
> > from all the domains?
> >
> >
> > lewis john mcgibbney wrote
> > > Increase number of threads when fetching
> > > Also please see nutch-deault.xml for paritioning of urls, if you know
> > your
> > > target domains you may wish to adapt the policy.
> > > Lewis
> > >
> > > On Sunday, January 27, 2013, peterbarretto &lt;
> >
> > > peterbarretto08@
> >
> > > &gt;
> > > wrote:
> > >> I want to increase the number of urls fetched at a time in nutch. I
> have
> > >> around 10 websites to crawl. so how can i crawl all the sites at a
> time
> > ?
> > >> right now i am fetching 1 site with a fetch delay of 2 second but it
> is
> > > too
> > >> slow. How to concurrently fetch from different domain?
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >
> > > --
> > > *Lewis*
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
Hi Lewis,

>You are not getting very many URLs!
Should i increase the fetcher.server.delay from 2 to 5 seconds?
I did not get what you meant by it?

I want somewhat around equal number of urls in the fetchlist from all domain
so that i can fetch more number of urls at a time




lewis john mcgibbney wrote
> You are not getting very many URLs!
> 
> On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>>
>> 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
>>
>> 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
>> (db_unfetched):
>> 85672
>>





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037612.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hey Peter,
My hardware was a cluster of high-end production machines (RAM and CPU
specs were 100 times better than a normal desktop PC). I think if you
procure EC2 instances of alteast type "medium", you can expect better perf.

I have no idea about who is faster among nutch 2.1 and 1.6. I want to know
it too :) Can anyone from the @dev or @user comment on that ?

Thanks,
Tejas Patil


On Thu, Jan 31, 2013 at 12:09 AM, peterbarretto
<pe...@gmail.com>wrote:

> Hi Tejas,
>
> I am currently running nutch 1.6 on windows 7, pentium dual core 2.8Ghz, 2
> GB ram
> I will be using amazon ec2 servers later for crawling.
>
> What was ur hardware when you ran 4 million urls with 80Gb data?
>
> Will nutch 2.1 give a faster crawl speed than 1.6?
>
>
> Tejas Patil wrote
> > I had ran crawls with topN as large as 4 million while having crawldb of
> > ~80 GB. It worked fine without any such issue.
> > Maybe the hardware / cluster you have is not capable of handling load
> > above
> > 500. Note that if topN is low, then no matter how many fetcher threads
> you
> > create, you wont be able to increase #crawls. Also, as there is a
> > considerable amount of time spent in generate and update phase, overall
> > crawl rate will be low. If you are planning to use the same machine, you
> > will have to work with lower values (and thus expect lower crawl rate).
> >
> > thanks,
> > Tejas Patil
> >
> >
> > On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney <
>
> > lewis.mcgibbney@
>
> >> wrote:
> >
> >> You are not getting very many URLs!
> >>
> >> On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto &lt;
>
> > peterbarretto08@
>
> > &gt; >wrote:
> >>
> >> >
> >> > 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
> >> >
> >> > 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
> >> > (db_unfetched):
> >> > 85672
> >> >
> >>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037637.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
Hi Tejas,

I am currently running nutch 1.6 on windows 7, pentium dual core 2.8Ghz, 2
GB ram 
I will be using amazon ec2 servers later for crawling. 

What was ur hardware when you ran 4 million urls with 80Gb data?

Will nutch 2.1 give a faster crawl speed than 1.6?


Tejas Patil wrote
> I had ran crawls with topN as large as 4 million while having crawldb of
> ~80 GB. It worked fine without any such issue.
> Maybe the hardware / cluster you have is not capable of handling load
> above
> 500. Note that if topN is low, then no matter how many fetcher threads you
> create, you wont be able to increase #crawls. Also, as there is a
> considerable amount of time spent in generate and update phase, overall
> crawl rate will be low. If you are planning to use the same machine, you
> will have to work with lower values (and thus expect lower crawl rate).
> 
> thanks,
> Tejas Patil
> 
> 
> On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney <

> lewis.mcgibbney@

>> wrote:
> 
>> You are not getting very many URLs!
>>
>> On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto &lt;

> peterbarretto08@

> &gt; >wrote:
>>
>> >
>> > 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
>> >
>> > 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
>> > (db_unfetched):
>> > 85672
>> >
>>





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037637.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
I had ran crawls with topN as large as 4 million while having crawldb of
~80 GB. It worked fine without any such issue.
Maybe the hardware / cluster you have is not capable of handling load above
500. Note that if topN is low, then no matter how many fetcher threads you
create, you wont be able to increase #crawls. Also, as there is a
considerable amount of time spent in generate and update phase, overall
crawl rate will be low. If you are planning to use the same machine, you
will have to work with lower values (and thus expect lower crawl rate).

thanks,
Tejas Patil


On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> You are not getting very many URLs!
>
> On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto <peterbarretto08@gmail.com
> >wrote:
>
> >
> > 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
> >
> > 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
> > (db_unfetched):
> > 85672
> >
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
You are not getting very many URLs!

On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto <pe...@gmail.com>wrote:

>
> 2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
>
> 2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1
> (db_unfetched):
> 85672
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
Hi Tejas,

If i put a larger value for topN as 1000 i get job failed error at the end
of the fetching.
500 seems to be the optimal value and the fetching completes with this value
without any issue.

I am using nutch 1.6 right now; will be also installing 2.1 after i have
installed hbase on my windows machine

Below is some of the content of the log file:-

2013-01-29 08:44:21,902 INFO  crawl.CrawlDbReader - CrawlDb statistics
start: crawl/crawldb
2013-01-29 08:44:25,338 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb
2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls:	96404
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 0:	96030
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 1:	293
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 2:	80
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 3:	1
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - min score:	0.0
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - avg score:	2.8775778E-4
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - max score:	3.071
2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):
85672
2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 2 (db_fetched):
7598
2013-01-29 08:44:35,019 INFO  crawl.CrawlDbReader - status 3 (db_gone):	17
2013-01-29 08:44:35,020 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):	449
2013-01-29 08:44:35,021 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):	1115
2013-01-29 08:44:35,024 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):	1553
2013-01-29 08:44:35,055 INFO  crawl.CrawlDbReader - CrawlDb statistics: done
2013-01-29 08:48:09,474 INFO  crawl.Generator - Generator: starting at
2013-01-29 08:48:09
2013-01-29 08:48:09,475 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2013-01-29 08:48:09,475 INFO  crawl.Generator - Generator: filtering: true
2013-01-29 08:48:09,476 INFO  crawl.Generator - Generator: normalizing: true
2013-01-29 08:48:09,476 INFO  crawl.Generator - Generator: topN: 50
2013-01-29 08:48:09,478 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2013-01-29 08:48:10,646 INFO  plugin.PluginRepository - Plugins: looking in:
C:\apache-nutch-1.6\plugins
2013-01-29 08:48:11,273 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - Registered Plugins:
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Regex URL
Normalizer (urlnormalizer-regex)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Basic URL
Normalizer (urlnormalizer-basic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Tika Parser Plug-in
(parse-tika)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Anchor Indexing
Filter (index-anchor)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	HTTP Framework
(lib-http)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Regex URL Filter
(urlfilter-regex)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Regex URL Filter
Framework (lib-regex-filter)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Pass-through URL
Normalizer (urlnormalizer-pass)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Http Protocol
Plug-in (protocol-http)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - Registered
Extension-Points:
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2013-01-29 08:48:11,502 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-29 08:48:11,502 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-29 08:48:11,502 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-01-29 08:48:26,968 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-01-29 08:49:00,769 INFO  crawl.CrawlDbReader - CrawlDb statistics
start: crawl/crawldb
2013-01-29 08:49:01,292 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2013-01-29 08:49:04,221 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-29 08:49:04,221 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-29 08:49:04,221 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-01-29 08:49:04,223 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-01-29 08:49:05,395 INFO  crawl.Generator - Generator: Partitioning
selected urls for politeness.
2013-01-29 08:49:05,594 INFO  crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - TOTAL urls:	96404
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 0:	96030
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 1:	293
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 2:	80
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 3:	1
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - min score:	0.0
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - avg score:	2.8775778E-4
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - max score:	3.071
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):
85672
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 2 (db_fetched):
7598
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 3 (db_gone):	17
2013-01-29 08:49:05,597 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):	449
2013-01-29 08:49:05,601 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):	1115
2013-01-29 08:49:05,604 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):	1553
2013-01-29 08:49:05,622 INFO  crawl.CrawlDbReader - CrawlDb statistics: done
2013-01-29 08:49:06,396 INFO  crawl.Generator - Generator: segment:
crawl/segments/20130129084906
2013-01-29 08:49:07,350 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-01-29 08:49:08,163 INFO  crawl.Generator - Generator: finished at
2013-01-29 08:49:08, elapsed: 00:00:58
2013-01-29 08:49:32,971 INFO  fetcher.Fetcher - Fetcher: starting at
2013-01-29 08:49:32
2013-01-29 08:49:32,972 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20130129084906
2013-01-29 08:49:34,341 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,341 INFO  fetcher.Fetcher - Fetcher: threads: 10
2013-01-29 08:49:34,342 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2013-01-29 08:49:34,357 INFO  plugin.PluginRepository - Plugins: looking in:
C:\apache-nutch-1.6\plugins
2013-01-29 08:49:34,361 INFO  fetcher.Fetcher - QueueFeeder finished: total
50 records + hit by time limit :0
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Registered Plugins:
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Regex URL
Normalizer (urlnormalizer-regex)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Basic URL
Normalizer (urlnormalizer-basic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Tika Parser Plug-in
(parse-tika)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Anchor Indexing
Filter (index-anchor)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	HTTP Framework
(lib-http)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Regex URL Filter
(urlfilter-regex)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Regex URL Filter
Framework (lib-regex-filter)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Pass-through URL
Normalizer (urlnormalizer-pass)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Http Protocol
Plug-in (protocol-http)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Registered
Extension-Points:
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2013-01-29 08:49:34,546 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - fetching
http://www.example.com
2013-01-29 08:49:34,549 INFO  fetcher.Fetcher - Using queue mode : byHost



Tejas Patil wrote
> Hey Peter,
> 
> Give a bigger value for topN parameter. Also, use:
> <property>
>   
> <name>
> generate.max.count
> </name>
>   
> <value>
> -1
> </value>
> </property>
> 
> <property>
>   
> <name>
> generate.count.mode
> </name>
>   
> <value>
> domain
> </value>
> </property>
> Not sure why you see queue mode as byhost and not by domain. Did it print
> that in the logs ?
> I should have asked you this before : Are you using nutch 1.X or 2.x ?
> 
> thanks,
> Tejas Patil
> 
> 
> On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto
> &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Tejas,
>>
>> I changed the generate.count.mode to domain and generate.max.count to 100
>> but still it shows queue mode as byhost and not by domain.
>>
>>
>>
>> peterbarretto wrote
>> > Hi Tejas
>> >
>> > The fetcher.threads.per.host property has been depreciated and replaced
>> > with fetcher.threads.per.queue
>> > I am not sue if fetcher.threads.per.queue will hepl the fetching as the
>> > generator only generates the fetchlist from 2 or 3 domain. How can i
>> tell
>> > the generator to create fetchlist with equal number of urls from all
>> > domain?
>> >
>> > I am sure there are urls from the other domains but i guess since the
>> url
>> > score is less it fetches from only 2 domains.
>> >
>> > I will try increasing fetcher.threads.per.queue to 5 and see if the
>> fetch
>> > speed is increased and let you know
>> > Tejas Patil wrote
>> >> Hey Peter,
>> >>
>> >> I am guessing that you have just increased the global thread count.
>> Have
>> >> you even increased "fetcher.threads.per.host" ? This will improve the
>> >> crawl
>> >> rate as multiple threads can attack the same site. Dont make it too
>> high
>> >> or
>> >> else the system will get overloaded. The nutch wiki has an article [0]
>> >> about the potential reasons for slow crawls and some good suggestions.
>> >>
>> >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
>> >>
>> >> Thanks,
>> >> Tejas Patil
>> >>
>> >>
>> >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;
>>
>> >> peterbarretto08@
>>
>> >> &gt;wrote:
>> >>
>> >>> I tried increasing the numbers of threads to 50 but the speed is not
>> >>> affected
>> >>>
>> >>>
>> >>> I tried changing the partition.url.mode value to byDomain and
>> >>> fetcher.queue.mode to byDomain but still it does not help the speed.
>> >>> It seems to get urls from 2 domains now and the other domains are not
>> >>> getting crawled. Is this due to the url score? if so how do i crawl
>> urls
>> >>> from all the domains?
>> >>>
>> >>>
>> >>> lewis john mcgibbney wrote
>> >>> > Increase number of threads when fetching
>> >>> > Also please see nutch-deault.xml for paritioning of urls, if you
>> know
>> >>> your
>> >>> > target domains you may wish to adapt the policy.
>> >>> > Lewis
>> >>> >
>> >>> > On Sunday, January 27, 2013, peterbarretto &lt;
>> >>>
>> >>> > peterbarretto08@
>> >>>
>> >>> > &gt;
>> >>> > wrote:
>> >>> >> I want to increase the number of urls fetched at a time in nutch.
>> I
>> >>> have
>> >>> >> around 10 websites to crawl. so how can i crawl all the sites at a
>> >>> time
>> >>> ?
>> >>> >> right now i am fetching 1 site with a fetch delay of 2 second but
>> it
>> >>> is
>> >>> > too
>> >>> >> slow. How to concurrently fetch from different domain?
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> View this message in context:
>> >>> >
>> >>>
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
>> >>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>> >>
>> >>> >
>> >>> > --
>> >>> > *Lewis*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
>> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037282.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hey Peter,

Give a bigger value for topN parameter. Also, use:

<property>
  <name>generate.max.count</name>
  <value>-1</value>
</property>

<property>
  <name>generate.count.mode</name>
  <value>domain</value>
</property>

Not sure why you see queue mode as byhost and not by domain. Did it print
that in the logs ?
I should have asked you this before : Are you using nutch 1.X or 2.x ?

thanks,
Tejas Patil


On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto
<pe...@gmail.com>wrote:

> Hi Tejas,
>
> I changed the generate.count.mode to domain and generate.max.count to 100
> but still it shows queue mode as byhost and not by domain.
>
>
>
> peterbarretto wrote
> > Hi Tejas
> >
> > The fetcher.threads.per.host property has been depreciated and replaced
> > with fetcher.threads.per.queue
> > I am not sue if fetcher.threads.per.queue will hepl the fetching as the
> > generator only generates the fetchlist from 2 or 3 domain. How can i tell
> > the generator to create fetchlist with equal number of urls from all
> > domain?
> >
> > I am sure there are urls from the other domains but i guess since the url
> > score is less it fetches from only 2 domains.
> >
> > I will try increasing fetcher.threads.per.queue to 5 and see if the fetch
> > speed is increased and let you know
> > Tejas Patil wrote
> >> Hey Peter,
> >>
> >> I am guessing that you have just increased the global thread count. Have
> >> you even increased "fetcher.threads.per.host" ? This will improve the
> >> crawl
> >> rate as multiple threads can attack the same site. Dont make it too high
> >> or
> >> else the system will get overloaded. The nutch wiki has an article [0]
> >> about the potential reasons for slow crawls and some good suggestions.
> >>
> >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> >>
> >> Thanks,
> >> Tejas Patil
> >>
> >>
> >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;
>
> >> peterbarretto08@
>
> >> &gt;wrote:
> >>
> >>> I tried increasing the numbers of threads to 50 but the speed is not
> >>> affected
> >>>
> >>>
> >>> I tried changing the partition.url.mode value to byDomain and
> >>> fetcher.queue.mode to byDomain but still it does not help the speed.
> >>> It seems to get urls from 2 domains now and the other domains are not
> >>> getting crawled. Is this due to the url score? if so how do i crawl
> urls
> >>> from all the domains?
> >>>
> >>>
> >>> lewis john mcgibbney wrote
> >>> > Increase number of threads when fetching
> >>> > Also please see nutch-deault.xml for paritioning of urls, if you know
> >>> your
> >>> > target domains you may wish to adapt the policy.
> >>> > Lewis
> >>> >
> >>> > On Sunday, January 27, 2013, peterbarretto &lt;
> >>>
> >>> > peterbarretto08@
> >>>
> >>> > &gt;
> >>> > wrote:
> >>> >> I want to increase the number of urls fetched at a time in nutch. I
> >>> have
> >>> >> around 10 websites to crawl. so how can i crawl all the sites at a
> >>> time
> >>> ?
> >>> >> right now i am fetching 1 site with a fetch delay of 2 second but it
> >>> is
> >>> > too
> >>> >> slow. How to concurrently fetch from different domain?
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >
> >>>
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> >>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>> >>
> >>> >
> >>> > --
> >>> > *Lewis*
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
Hi Tejas,

I changed the generate.count.mode to domain and generate.max.count to 100
but still it shows queue mode as byhost and not by domain.



peterbarretto wrote
> Hi Tejas
> 
> The fetcher.threads.per.host property has been depreciated and replaced
> with fetcher.threads.per.queue
> I am not sue if fetcher.threads.per.queue will hepl the fetching as the
> generator only generates the fetchlist from 2 or 3 domain. How can i tell
> the generator to create fetchlist with equal number of urls from all
> domain?  
> 
> I am sure there are urls from the other domains but i guess since the url
> score is less it fetches from only 2 domains.
> 
> I will try increasing fetcher.threads.per.queue to 5 and see if the fetch
> speed is increased and let you know
> Tejas Patil wrote
>> Hey Peter,
>> 
>> I am guessing that you have just increased the global thread count. Have
>> you even increased "fetcher.threads.per.host" ? This will improve the
>> crawl
>> rate as multiple threads can attack the same site. Dont make it too high
>> or
>> else the system will get overloaded. The nutch wiki has an article [0]
>> about the potential reasons for slow crawls and some good suggestions.
>> 
>> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
>> 
>> Thanks,
>> Tejas Patil
>> 
>> 
>> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;

>> peterbarretto08@

>> &gt;wrote:
>> 
>>> I tried increasing the numbers of threads to 50 but the speed is not
>>> affected
>>>
>>>
>>> I tried changing the partition.url.mode value to byDomain and
>>> fetcher.queue.mode to byDomain but still it does not help the speed.
>>> It seems to get urls from 2 domains now and the other domains are not
>>> getting crawled. Is this due to the url score? if so how do i crawl urls
>>> from all the domains?
>>>
>>>
>>> lewis john mcgibbney wrote
>>> > Increase number of threads when fetching
>>> > Also please see nutch-deault.xml for paritioning of urls, if you know
>>> your
>>> > target domains you may wish to adapt the policy.
>>> > Lewis
>>> >
>>> > On Sunday, January 27, 2013, peterbarretto &lt;
>>>
>>> > peterbarretto08@
>>>
>>> > &gt;
>>> > wrote:
>>> >> I want to increase the number of urls fetched at a time in nutch. I
>>> have
>>> >> around 10 websites to crawl. so how can i crawl all the sites at a
>>> time
>>> ?
>>> >> right now i am fetching 1 site with a fetch delay of 2 second but it
>>> is
>>> > too
>>> >> slow. How to concurrently fetch from different domain?
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >
>>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
>>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>
>>> >
>>> > --
>>> > *Lewis*
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
Hi Tejas

The fetcher.threads.per.host property has been depreciated and replaced with
fetcher.threads.per.queue
I am not sue if fetcher.threads.per.queue will hepl the fetching as the
generator only generates the fetchlist from 2 or 3 domain. How can i tell
the generator to create fetchlist with equal number of urls from all domain?  

I am sure there are urls from the other domains but i guess since the url
score is less it fetches from only 2 domains.

I will try increasing fetcher.threads.per.queue to 5 and see if the fetch
speed is increased and let you know


Tejas Patil wrote
> Hey Peter,
> 
> I am guessing that you have just increased the global thread count. Have
> you even increased "fetcher.threads.per.host" ? This will improve the
> crawl
> rate as multiple threads can attack the same site. Dont make it too high
> or
> else the system will get overloaded. The nutch wiki has an article [0]
> about the potential reasons for slow crawls and some good suggestions.
> 
> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> 
> Thanks,
> Tejas Patil
> 
> 
> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> I tried increasing the numbers of threads to 50 but the speed is not
>> affected
>>
>>
>> I tried changing the partition.url.mode value to byDomain and
>> fetcher.queue.mode to byDomain but still it does not help the speed.
>> It seems to get urls from 2 domains now and the other domains are not
>> getting crawled. Is this due to the url score? if so how do i crawl urls
>> from all the domains?
>>
>>
>> lewis john mcgibbney wrote
>> > Increase number of threads when fetching
>> > Also please see nutch-deault.xml for paritioning of urls, if you know
>> your
>> > target domains you may wish to adapt the policy.
>> > Lewis
>> >
>> > On Sunday, January 27, 2013, peterbarretto &lt;
>>
>> > peterbarretto08@
>>
>> > &gt;
>> > wrote:
>> >> I want to increase the number of urls fetched at a time in nutch. I
>> have
>> >> around 10 websites to crawl. so how can i crawl all the sites at a
>> time
>> ?
>> >> right now i am fetching 1 site with a fetch delay of 2 second but it
>> is
>> > too
>> >> slow. How to concurrently fetch from different domain?
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >
>> > --
>> > *Lewis*
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036964.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Tejas Patil <te...@gmail.com>.
Hey Peter,

I am guessing that you have just increased the global thread count. Have
you even increased "fetcher.threads.per.host" ? This will improve the crawl
rate as multiple threads can attack the same site. Dont make it too high or
else the system will get overloaded. The nutch wiki has an article [0]
about the potential reasons for slow crawls and some good suggestions.

[0] : https://wiki.apache.org/nutch/OptimizingCrawls

Thanks,
Tejas Patil


On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto <pe...@gmail.com>wrote:

> I tried increasing the numbers of threads to 50 but the speed is not
> affected
>
>
> I tried changing the partition.url.mode value to byDomain and
> fetcher.queue.mode to byDomain but still it does not help the speed.
> It seems to get urls from 2 domains now and the other domains are not
> getting crawled. Is this due to the url score? if so how do i crawl urls
> from all the domains?
>
>
> lewis john mcgibbney wrote
> > Increase number of threads when fetching
> > Also please see nutch-deault.xml for paritioning of urls, if you know
> your
> > target domains you may wish to adapt the policy.
> > Lewis
> >
> > On Sunday, January 27, 2013, peterbarretto &lt;
>
> > peterbarretto08@
>
> > &gt;
> > wrote:
> >> I want to increase the number of urls fetched at a time in nutch. I have
> >> around 10 websites to crawl. so how can i crawl all the sites at a time
> ?
> >> right now i am fetching 1 site with a fetch delay of 2 second but it is
> > too
> >> slow. How to concurrently fetch from different domain?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> > --
> > *Lewis*
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by peterbarretto <pe...@gmail.com>.
I tried increasing the numbers of threads to 50 but the speed is not affected 


I tried changing the partition.url.mode value to byDomain and
fetcher.queue.mode to byDomain but still it does not help the speed.
It seems to get urls from 2 domains now and the other domains are not
getting crawled. Is this due to the url score? if so how do i crawl urls
from all the domains?


lewis john mcgibbney wrote
> Increase number of threads when fetching
> Also please see nutch-deault.xml for paritioning of urls, if you know your
> target domains you may wish to adapt the policy.
> Lewis
> 
> On Sunday, January 27, 2013, peterbarretto &lt;

> peterbarretto08@

> &gt;
> wrote:
>> I want to increase the number of urls fetched at a time in nutch. I have
>> around 10 websites to crawl. so how can i crawl all the sites at a time ?
>> right now i am fetching 1 site with a fetch delay of 2 second but it is
> too
>> slow. How to concurrently fetch from different domain?
>>
>>
>>
>> --
>> View this message in context:
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Increase number of threads when fetching
Also please see nutch-deault.xml for paritioning of urls, if you know your
target domains you may wish to adapt the policy.
Lewis

On Sunday, January 27, 2013, peterbarretto <pe...@gmail.com>
wrote:
> I want to increase the number of urls fetched at a time in nutch. I have
> around 10 websites to crawl. so how can i crawl all the sites at a time ?
> right now i am fetching 1 site with a fetch delay of 2 second but it is
too
> slow. How to concurrently fetch from different domain?
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*