You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bharat Goyal <bh...@shiksha.com> on 2012/02/21 09:19:26 UTC

Optimising the speed of Nutch.

Hi,

I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3.
This is done on a local machine having a configuration(having no other
large resource consuming processes running) :
Dual Core (2.4 GHz),
4GB Ram

It takes around 14-15 hours to crawl this seedlist, which generates
around 21k web page content. Is there any way this can be optimized and
takes less time, Nutch(1.2) settings are all default.

Thanks for the help.

Regards,

Bharat Goyal

DISCLAIMER
This email is intended only for the person or the entity to whom it is addressed and may contain information which is confidential and privileged. Any review, retransmission, dissemination or any other use of the said information by person or entities other than intended recipient is unauthorized and prohibited. If you are not the intended recipient, please delete this email and contact the sender.

Re: Optimising the speed of Nutch.

Posted by remi tassing <ta...@gmail.com>.
Try decreasing the number of fetcher threads instead...

On Wed, Feb 22, 2012 at 2:33 PM, Bharat Goyal <bh...@shiksha.com>wrote:

> Went through the checklist and made some changes as in increased the no
> of fetcher threads from default 10 to 30, but I still see nutch eating
> up all the resources, the CPU usage is as high as 100%
>
> -Bharat
>
> On Tuesday 21 February 2012 04:45 PM, Julien Nioche wrote:
>
>> See http://*wiki*.apache.org/***nutch*/OptimizingCrawls<http://apache.org/*nutch*/OptimizingCrawls>for a checklist
>>
>> On 21 February 2012 10:47, Bharat Goyal<bharat.goyal@shiksha.com**>
>>  wrote:
>>
>>  No of fetcher threads is equal to default value(10), What is the optimum
>>> value for no of threads? Also, the fetching and parsing are not seperate.
>>>
>>> -Bharat
>>>
>>>
>>> On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote:
>>>
>>>  How many fetcher threads do you have at play?
>>>> Also Are you separating fetching and parsing?
>>>>
>>>> These are (generally speaking) places to get started.
>>>>
>>>> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<bharat.goyal@shiksha.com*
>>>> ***
>>>>
>>>>> wrote:
>>>>>
>>>>  Hi,
>>>>
>>>>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or
>>>>> 3.
>>>>> This is done on a local machine having a configuration(having no other
>>>>> large resource consuming processes running) :
>>>>> Dual Core (2.4 GHz),
>>>>> 4GB Ram
>>>>>
>>>>> It takes around 14-15 hours to crawl this seedlist, which generates
>>>>> around 21k web page content. Is there any way this can be optimized and
>>>>> takes less time, Nutch(1.2) settings are all default.
>>>>>
>>>>> Thanks for the help.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Bharat Goyal
>>>>>
>>>>> DISCLAIMER
>>>>> This email is intended only for the person or the entity to whom it is
>>>>> addressed and may contain information which is confidential and
>>>>> privileged.
>>>>> Any review, retransmission, dissemination or any other use of the said
>>>>> information by person or entities other than intended recipient is
>>>>> unauthorized and prohibited. If you are not the intended recipient,
>>>>> please
>>>>> delete this email and contact the sender.
>>>>>
>>>>>
>>>>>
>>>>  DISCLAIMER
>>> This email is intended only for the person or the entity to whom it is
>>> addressed and may contain information which is confidential and
>>> privileged.
>>> Any review, retransmission, dissemination or any other use of the said
>>> information by person or entities other than intended recipient is
>>> unauthorized and prohibited. If you are not the intended recipient,
>>> please
>>> delete this email and contact the sender.
>>>
>>>
>>
>>
>
> DISCLAIMER
> This email is intended only for the person or the entity to whom it is
> addressed and may contain information which is confidential and privileged.
> Any review, retransmission, dissemination or any other use of the said
> information by person or entities other than intended recipient is
> unauthorized and prohibited. If you are not the intended recipient, please
> delete this email and contact the sender.
>

Re: Optimising the speed of Nutch.

Posted by Bharat Goyal <bh...@shiksha.com>.
Went through the checklist and made some changes as in increased the no
of fetcher threads from default 10 to 30, but I still see nutch eating
up all the resources, the CPU usage is as high as 100%

-Bharat

On Tuesday 21 February 2012 04:45 PM, Julien Nioche wrote:
> See http://*wiki*.apache.org/*nutch*/OptimizingCrawls for a checklist
>
> On 21 February 2012 10:47, Bharat Goyal<bh...@shiksha.com>  wrote:
>
>> No of fetcher threads is equal to default value(10), What is the optimum
>> value for no of threads? Also, the fetching and parsing are not seperate.
>>
>> -Bharat
>>
>>
>> On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote:
>>
>>> How many fetcher threads do you have at play?
>>> Also Are you separating fetching and parsing?
>>>
>>> These are (generally speaking) places to get started.
>>>
>>> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<bharat.goyal@shiksha.com**
>>>> wrote:
>>>   Hi,
>>>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3.
>>>> This is done on a local machine having a configuration(having no other
>>>> large resource consuming processes running) :
>>>> Dual Core (2.4 GHz),
>>>> 4GB Ram
>>>>
>>>> It takes around 14-15 hours to crawl this seedlist, which generates
>>>> around 21k web page content. Is there any way this can be optimized and
>>>> takes less time, Nutch(1.2) settings are all default.
>>>>
>>>> Thanks for the help.
>>>>
>>>> Regards,
>>>>
>>>> Bharat Goyal
>>>>
>>>> DISCLAIMER
>>>> This email is intended only for the person or the entity to whom it is
>>>> addressed and may contain information which is confidential and
>>>> privileged.
>>>> Any review, retransmission, dissemination or any other use of the said
>>>> information by person or entities other than intended recipient is
>>>> unauthorized and prohibited. If you are not the intended recipient,
>>>> please
>>>> delete this email and contact the sender.
>>>>
>>>>
>>>
>> DISCLAIMER
>> This email is intended only for the person or the entity to whom it is
>> addressed and may contain information which is confidential and privileged.
>> Any review, retransmission, dissemination or any other use of the said
>> information by person or entities other than intended recipient is
>> unauthorized and prohibited. If you are not the intended recipient, please
>> delete this email and contact the sender.
>>
>
>


DISCLAIMER
This email is intended only for the person or the entity to whom it is addressed and may contain information which is confidential and privileged. Any review, retransmission, dissemination or any other use of the said information by person or entities other than intended recipient is unauthorized and prohibited. If you are not the intended recipient, please delete this email and contact the sender.

Re: Optimising the speed of Nutch.

Posted by Julien Nioche <li...@gmail.com>.
See http://*wiki*.apache.org/*nutch*/OptimizingCrawls for a checklist

On 21 February 2012 10:47, Bharat Goyal <bh...@shiksha.com> wrote:

> No of fetcher threads is equal to default value(10), What is the optimum
> value for no of threads? Also, the fetching and parsing are not seperate.
>
> -Bharat
>
>
> On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote:
>
>> How many fetcher threads do you have at play?
>> Also Are you separating fetching and parsing?
>>
>> These are (generally speaking) places to get started.
>>
>> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<bharat.goyal@shiksha.com**
>> >wrote:
>>
>>  Hi,
>>>
>>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3.
>>> This is done on a local machine having a configuration(having no other
>>> large resource consuming processes running) :
>>> Dual Core (2.4 GHz),
>>> 4GB Ram
>>>
>>> It takes around 14-15 hours to crawl this seedlist, which generates
>>> around 21k web page content. Is there any way this can be optimized and
>>> takes less time, Nutch(1.2) settings are all default.
>>>
>>> Thanks for the help.
>>>
>>> Regards,
>>>
>>> Bharat Goyal
>>>
>>> DISCLAIMER
>>> This email is intended only for the person or the entity to whom it is
>>> addressed and may contain information which is confidential and
>>> privileged.
>>> Any review, retransmission, dissemination or any other use of the said
>>> information by person or entities other than intended recipient is
>>> unauthorized and prohibited. If you are not the intended recipient,
>>> please
>>> delete this email and contact the sender.
>>>
>>>
>>
>>
>
> DISCLAIMER
> This email is intended only for the person or the entity to whom it is
> addressed and may contain information which is confidential and privileged.
> Any review, retransmission, dissemination or any other use of the said
> information by person or entities other than intended recipient is
> unauthorized and prohibited. If you are not the intended recipient, please
> delete this email and contact the sender.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Optimising the speed of Nutch.

Posted by Bharat Goyal <bh...@shiksha.com>.
No of fetcher threads is equal to default value(10), What is the optimum
value for no of threads? Also, the fetching and parsing are not seperate.

-Bharat

On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote:
> How many fetcher threads do you have at play?
> Also Are you separating fetching and parsing?
>
> These are (generally speaking) places to get started.
>
> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<bh...@shiksha.com>wrote:
>
>> Hi,
>>
>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3.
>> This is done on a local machine having a configuration(having no other
>> large resource consuming processes running) :
>> Dual Core (2.4 GHz),
>> 4GB Ram
>>
>> It takes around 14-15 hours to crawl this seedlist, which generates
>> around 21k web page content. Is there any way this can be optimized and
>> takes less time, Nutch(1.2) settings are all default.
>>
>> Thanks for the help.
>>
>> Regards,
>>
>> Bharat Goyal
>>
>> DISCLAIMER
>> This email is intended only for the person or the entity to whom it is
>> addressed and may contain information which is confidential and privileged.
>> Any review, retransmission, dissemination or any other use of the said
>> information by person or entities other than intended recipient is
>> unauthorized and prohibited. If you are not the intended recipient, please
>> delete this email and contact the sender.
>>
>
>


DISCLAIMER
This email is intended only for the person or the entity to whom it is addressed and may contain information which is confidential and privileged. Any review, retransmission, dissemination or any other use of the said information by person or entities other than intended recipient is unauthorized and prohibited. If you are not the intended recipient, please delete this email and contact the sender.

Re: Optimising the speed of Nutch.

Posted by Lewis John Mcgibbney <le...@gmail.com>.
How many fetcher threads do you have at play?
Also Are you separating fetching and parsing?

These are (generally speaking) places to get started.

On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal <bh...@shiksha.com>wrote:

> Hi,
>
> I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3.
> This is done on a local machine having a configuration(having no other
> large resource consuming processes running) :
> Dual Core (2.4 GHz),
> 4GB Ram
>
> It takes around 14-15 hours to crawl this seedlist, which generates
> around 21k web page content. Is there any way this can be optimized and
> takes less time, Nutch(1.2) settings are all default.
>
> Thanks for the help.
>
> Regards,
>
> Bharat Goyal
>
> DISCLAIMER
> This email is intended only for the person or the entity to whom it is
> addressed and may contain information which is confidential and privileged.
> Any review, retransmission, dissemination or any other use of the said
> information by person or entities other than intended recipient is
> unauthorized and prohibited. If you are not the intended recipient, please
> delete this email and contact the sender.
>



-- 
*Lewis*