You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alberto Ramos <al...@gmail.com> on 2014/02/18 15:37:49 UTC

Crawling on slow and fast sites parallely

Hi,
l use nutch 2 on Hadoop in order to crawl a few sites.
One of them is deep and fast and others are shallow and slow.
At the first fetches the fast site finishing after about 2 minutes and
waits for the slow sites that finish after about 40 minutes.  After nutch
is done crawling the slow sites,  the fast site is still being fetched
(because it is deeper). I don't want to use fetcher.max.crawl.delay since I
do want to crawl on both sites. My temporary solution is to run a seperated
nutch process for each site,  which is obviously very ugly and doesn't take
effect of the Hadoop architecture.
Any suggestions for performance improvement?

Re: Crawling on slow and fast sites parallely

Posted by Sebastian Nagel <wa...@googlemail.com>.

In addition, you could use generate.max.count
to limit the number of URLs per host and cycle
to a fix maximum size. That may help to keep
the balance between hosts / sites.

On 02/18/2014 04:01 PM, Markus Jelsma wrote:
> Most of the time reducing the number of URL per cycle solves the problem. You can also limit the fetcher's run time, check the fetcher.* settings.
>  
> -----Original message-----
>> From:Alberto Ramos <al...@gmail.com>
>> Sent: Tuesday 18th February 2014 15:38
>> To: user@nutch.apache.org
>> Subject: Crawling on slow and fast sites parallely
>>
>> Hi,
>> l use nutch 2 on Hadoop in order to crawl a few sites.
>> One of them is deep and fast and others are shallow and slow.
>> At the first fetches the fast site finishing after about 2 minutes and
>> waits for the slow sites that finish after about 40 minutes.  After nutch
>> is done crawling the slow sites,  the fast site is still being fetched
>> (because it is deeper). I don't want to use fetcher.max.crawl.delay since I
>> do want to crawl on both sites. My temporary solution is to run a seperated
>> nutch process for each site,  which is obviously very ugly and doesn't take
>> effect of the Hadoop architecture.
>> Any suggestions for performance improvement?
>>

RE: Crawling on slow and fast sites parallely

Posted by Markus Jelsma <ma...@openindex.io>.

Most of the time reducing the number of URL per cycle solves the problem. You can also limit the fetcher's run time, check the fetcher.* settings.
 
-----Original message-----
> From:Alberto Ramos <al...@gmail.com>
> Sent: Tuesday 18th February 2014 15:38
> To: user@nutch.apache.org
> Subject: Crawling on slow and fast sites parallely
> 
> Hi,
> l use nutch 2 on Hadoop in order to crawl a few sites.
> One of them is deep and fast and others are shallow and slow.
> At the first fetches the fast site finishing after about 2 minutes and
> waits for the slow sites that finish after about 40 minutes.  After nutch
> is done crawling the slow sites,  the fast site is still being fetched
> (because it is deeper). I don't want to use fetcher.max.crawl.delay since I
> do want to crawl on both sites. My temporary solution is to run a seperated
> nutch process for each site,  which is obviously very ugly and doesn't take
> effect of the Hadoop architecture.
> Any suggestions for performance improvement?
>