You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by AJ Chen <aj...@web2express.org> on 2010/09/01 01:24:38 UTC

Re: performance for small cluster

Thanks for suggesting multiple segments approach - it's the way to go for
further increasing crawling throughput.  I tried the -maxNumSegments 3
option in local mode, but it did not generate 3 segments.  Does the option
work? It may be only work in distributed mode.

I also observe that, when fetching a 1M urls segment, 99% is done in 4
hours, but the last 1% takes forever. For performance reason, it makes sense
to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
appropriate time span. But, estimating the time span may not be reliable. Is
there another smarter way to empty the queues toward the end of fetching
(before Fetcher is done)?  This could potentially save several hours per
fetch operation.

thanks,
-aj

On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-08-17 23:16, AJ Chen wrote:
>
>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>> crawling selected sites at about 1M pages per day. The fetch itself is
>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>> segment).  I expect these non-fetching time will be increasing as the
>> crawl
>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>> time (i.e. generate segment and update crawldb)?
>>
>
> That's surprisingly long for this configuration... What do you think takes
> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>
> One strategy to minimize the turnaround time is to overlap crawl cycles.
> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
> start fetching the next one, and in parallel you can start parsing/updatedb
> from the first segment. Note that you need to either generate multiple
> segments (there's an option in Generator to do so), or you need to turn on
> generate.update.crawldb, but you don't need both.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: performance for small cluster

Posted by AJ Chen <aj...@web2express.org>.

in distributed mode, "generate -topN 1000000 -maxNumSegments 3"  creates 3
segments, but the size is very uneven: 1.7M, 0.8M, 0.5M.

I also tried fetcher.timelimit.mins=240 in distributed mode. but the fetcher
did not stop after 4 hours. any idea?

-aj

On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <aj...@web2express.org> wrote:

> Thanks for suggesting multiple segments approach - it's the way to go for
> further increasing crawling throughput.  I tried the -maxNumSegments 3
> option in local mode, but it did not generate 3 segments.  Does the option
> work? It may be only work in distributed mode.
>
> I also observe that, when fetching a 1M urls segment, 99% is done in 4
> hours, but the last 1% takes forever. For performance reason, it makes sense
> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
> appropriate time span. But, estimating the time span may not be reliable. Is
> there another smarter way to empty the queues toward the end of fetching
> (before Fetcher is done)?  This could potentially save several hours per
> fetch operation.
>
> thanks,
> -aj
>
>
>
> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-08-17 23:16, AJ Chen wrote:
>>
>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>> in
>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>> segment).  I expect these non-fetching time will be increasing as the
>>> crawl
>>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>>> time (i.e. generate segment and update crawldb)?
>>>
>>
>> That's surprisingly long for this configuration... What do you think takes
>> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>
>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>> start fetching the next one, and in parallel you can start parsing/updatedb
>> from the first segment. Note that you need to either generate multiple
>> segments (there's an option in Generator to do so), or you need to turn on
>> generate.update.crawldb, but you don't need both.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: performance for small cluster

Posted by AJ Chen <aj...@web2express.org>.

makes sense. thank you. -aj

On Fri, Sep 3, 2010 at 11:25 AM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Sep 3, 2010, at 11:07am, AJ Chen wrote:
>
>  I also try larger number of maps, e.g.
>> mapred.map.tasks=100
>> mapred.tasktracker.map.tasks.maximum=5
>> however, the hadoop console shows num map tasks = 40.  why is total map
>> tasks capped at 40?  maybe another config parameter overrides the
>> mapred.map.tasks?
>>
>
> The number of mappers (child JVMs launched by the TaskTracker on a slave)
> is controllable by you, in the hadoop configuration xml files.
>
> The number of map tasks for a given job is essentially out of your control
> - it's determined by the system, based on the number of splits calculated by
> the input format, for the specified input data. In Hadoop 0.20 they've
> removed this configuration, IIRC, since it was confusing for users to try to
> set this, and then have the value ignored.
>
> Typically splits are on a per-HDFS block basis, so if you need to get more
> mappers running you can configure your HDFS to use a smaller block size
> (default is 64MB). But typically the number of map tasks doesn't have a big
> impact on overall performance, other than the case of having an unsplittable
> input file (e.g. a .gz compressed file) which then means a single map task
> has to process the entire file.
>
> -- Ken
>
>
>  On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <aj...@web2express.org> wrote:
>>
>>  The other option for reducing time in fetching the last 1% urls may be
>>> using a smaller queue size, I think.
>>> In Fetcher class, the queue size is magically determined as threadCount *
>>> 50.
>>>   feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
>>> Is there any good reason for factor 50?  If using 100 threads, the queue
>>> size is 5000, which seems to cause long waiting time toward the end of
>>> fetch. I want to reduce the queue size to 100 regardless the number of
>>> threads.  Dos this make sense? Will smaller queue size has any other
>>> negative effect?
>>>
>>> -aj
>>>
>>> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <aj...@web2express.org> wrote:
>>>
>>>  Thanks for suggesting multiple segments approach - it's the way to go
>>>> for
>>>> further increasing crawling throughput.  I tried the -maxNumSegments 3
>>>> option in local mode, but it did not generate 3 segments.  Does the
>>>> option
>>>> work? It may be only work in distributed mode.
>>>>
>>>> I also observe that, when fetching a 1M urls segment, 99% is done in 4
>>>> hours, but the last 1% takes forever. For performance reason, it makes
>>>> sense
>>>> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins
>>>> to an
>>>> appropriate time span. But, estimating the time span may not be
>>>> reliable. Is
>>>> there another smarter way to empty the queues toward the end of fetching
>>>> (before Fetcher is done)?  This could potentially save several hours per
>>>> fetch operation.
>>>>
>>>> thanks,
>>>> -aj
>>>>
>>>>
>>>>
>>>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org>
>>>> wrote:
>>>>
>>>>  On 2010-08-17 23:16, AJ Chen wrote:
>>>>>
>>>>>  Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>>>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>>>>> reasonable fast. But, when crawl db has>10M urls, lots of time is
>>>>>> spend
>>>>>> in
>>>>>> generating segment (2-3 hours) and update crawldb (4-5 hours after
>>>>>> each
>>>>>> segment).  I expect these non-fetching time will be increasing as the
>>>>>> crawl
>>>>>> db grows to 100M urls.  Is there any good way to reduce the
>>>>>> non-fetching
>>>>>> time (i.e. generate segment and update crawldb)?
>>>>>>
>>>>>>
>>>>> That's surprisingly long for this configuration... What do you think
>>>>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce
>>>>> phase?
>>>>>
>>>>> One strategy to minimize the turnaround time is to overlap crawl
>>>>> cycles.
>>>>> E.g. you can generate multiple fetchlists in one go, then fetch one.
>>>>> Next,
>>>>> start fetching the next one, and in parallel you can start
>>>>> parsing/updatedb
>>>>> from the first segment. Note that you need to either generate multiple
>>>>> segments (there's an option in Generator to do so), or you need to turn
>>>>> on
>>>>> generate.update.crawldb, but you don't need both.
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>> ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> AJ Chen, PhD
>>>> Chair, Semantic Web SIG, sdforum.org
>>>> http://web2express.org
>>>> twitter @web2express
>>>> Palo Alto, CA, USA
>>>>
>>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: performance for small cluster

Posted by Ken Krugler <kk...@transpac.com>.

On Sep 3, 2010, at 11:07am, AJ Chen wrote:

> I also try larger number of maps, e.g.
> mapred.map.tasks=100
> mapred.tasktracker.map.tasks.maximum=5
> however, the hadoop console shows num map tasks = 40.  why is total  
> map
> tasks capped at 40?  maybe another config parameter overrides the
> mapred.map.tasks?

The number of mappers (child JVMs launched by the TaskTracker on a  
slave) is controllable by you, in the hadoop configuration xml files.

The number of map tasks for a given job is essentially out of your  
control - it's determined by the system, based on the number of splits  
calculated by the input format, for the specified input data. In  
Hadoop 0.20 they've removed this configuration, IIRC, since it was  
confusing for users to try to set this, and then have the value ignored.

Typically splits are on a per-HDFS block basis, so if you need to get  
more mappers running you can configure your HDFS to use a smaller  
block size (default is 64MB). But typically the number of map tasks  
doesn't have a big impact on overall performance, other than the case  
of having an unsplittable input file (e.g. a .gz compressed file)  
which then means a single map task has to process the entire file.

-- Ken

> On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <aj...@web2express.org>  
> wrote:
>
>> The other option for reducing time in fetching the last 1% urls may  
>> be
>> using a smaller queue size, I think.
>> In Fetcher class, the queue size is magically determined as  
>> threadCount *
>> 50.
>>    feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
>> Is there any good reason for factor 50?  If using 100 threads, the  
>> queue
>> size is 5000, which seems to cause long waiting time toward the end  
>> of
>> fetch. I want to reduce the queue size to 100 regardless the number  
>> of
>> threads.  Dos this make sense? Will smaller queue size has any other
>> negative effect?
>>
>> -aj
>>
>> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <aj...@web2express.org>  
>> wrote:
>>
>>> Thanks for suggesting multiple segments approach - it's the way to  
>>> go for
>>> further increasing crawling throughput.  I tried the - 
>>> maxNumSegments 3
>>> option in local mode, but it did not generate 3 segments.  Does  
>>> the option
>>> work? It may be only work in distributed mode.
>>>
>>> I also observe that, when fetching a 1M urls segment, 99% is done  
>>> in 4
>>> hours, but the last 1% takes forever. For performance reason, it  
>>> makes sense
>>> to drop the last 1% urls.  One option is to set  
>>> fetcher.timelimit.mins to an
>>> appropriate time span. But, estimating the time span may not be  
>>> reliable. Is
>>> there another smarter way to empty the queues toward the end of  
>>> fetching
>>> (before Fetcher is done)?  This could potentially save several  
>>> hours per
>>> fetch operation.
>>>
>>> thanks,
>>> -aj
>>>
>>>
>>>
>>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org>  
>>> wrote:
>>>
>>>> On 2010-08-17 23:16, AJ Chen wrote:
>>>>
>>>>> Scott, thanks again for your insights. My 4 cheap linux boxes is  
>>>>> now
>>>>> crawling selected sites at about 1M pages per day. The fetch  
>>>>> itself is
>>>>> reasonable fast. But, when crawl db has>10M urls, lots of time  
>>>>> is spend
>>>>> in
>>>>> generating segment (2-3 hours) and update crawldb (4-5 hours  
>>>>> after each
>>>>> segment).  I expect these non-fetching time will be increasing  
>>>>> as the
>>>>> crawl
>>>>> db grows to 100M urls.  Is there any good way to reduce the non- 
>>>>> fetching
>>>>> time (i.e. generate segment and update crawldb)?
>>>>>
>>>>
>>>> That's surprisingly long for this configuration... What do you  
>>>> think
>>>> takes most time in e.g. updatedb job? map, shuffle, sort or  
>>>> reduce phase?
>>>>
>>>> One strategy to minimize the turnaround time is to overlap crawl  
>>>> cycles.
>>>> E.g. you can generate multiple fetchlists in one go, then fetch  
>>>> one. Next,
>>>> start fetching the next one, and in parallel you can start  
>>>> parsing/updatedb
>>>> from the first segment. Note that you need to either generate  
>>>> multiple
>>>> segments (there's an option in Generator to do so), or you need  
>>>> to turn on
>>>> generate.update.crawldb, but you don't need both.
>>>>
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki     <><
>>>> ___. ___ ___ ___ _ _   __________________________________
>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>
>>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>
>
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: performance for small cluster

Posted by AJ Chen <aj...@web2express.org>.

I also try larger number of maps, e.g.
mapred.map.tasks=100
mapred.tasktracker.map.tasks.maximum=5
however, the hadoop console shows num map tasks = 40.  why is total map
tasks capped at 40?  maybe another config parameter overrides the
mapred.map.tasks?
-aj

On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <aj...@web2express.org> wrote:

> The other option for reducing time in fetching the last 1% urls may be
> using a smaller queue size, I think.
> In Fetcher class, the queue size is magically determined as threadCount *
> 50.
>     feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
> Is there any good reason for factor 50?  If using 100 threads, the queue
> size is 5000, which seems to cause long waiting time toward the end of
> fetch. I want to reduce the queue size to 100 regardless the number of
> threads.  Dos this make sense? Will smaller queue size has any other
> negative effect?
>
> -aj
>
> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <aj...@web2express.org> wrote:
>
>> Thanks for suggesting multiple segments approach - it's the way to go for
>> further increasing crawling throughput.  I tried the -maxNumSegments 3
>> option in local mode, but it did not generate 3 segments.  Does the option
>> work? It may be only work in distributed mode.
>>
>> I also observe that, when fetching a 1M urls segment, 99% is done in 4
>> hours, but the last 1% takes forever. For performance reason, it makes sense
>> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
>> appropriate time span. But, estimating the time span may not be reliable. Is
>> there another smarter way to empty the queues toward the end of fetching
>> (before Fetcher is done)?  This could potentially save several hours per
>> fetch operation.
>>
>> thanks,
>> -aj
>>
>>
>>
>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>>
>>> On 2010-08-17 23:16, AJ Chen wrote:
>>>
>>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>>> in
>>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>>> segment).  I expect these non-fetching time will be increasing as the
>>>> crawl
>>>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>>>> time (i.e. generate segment and update crawldb)?
>>>>
>>>
>>> That's surprisingly long for this configuration... What do you think
>>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>>
>>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>>> start fetching the next one, and in parallel you can start parsing/updatedb
>>> from the first segment. Note that you need to either generate multiple
>>> segments (there's an option in Generator to do so), or you need to turn on
>>> generate.update.crawldb, but you don't need both.
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: performance for small cluster

Posted by AJ Chen <aj...@web2express.org>.

The other option for reducing time in fetching the last 1% urls may be using
a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as threadCount *
50.
    feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50?  If using 100 threads, the queue
size is 5000, which seems to cause long waiting time toward the end of
fetch. I want to reduce the queue size to 100 regardless the number of
threads.  Dos this make sense? Will smaller queue size has any other
negative effect?

-aj

On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <aj...@web2express.org> wrote:

> Thanks for suggesting multiple segments approach - it's the way to go for
> further increasing crawling throughput.  I tried the -maxNumSegments 3
> option in local mode, but it did not generate 3 segments.  Does the option
> work? It may be only work in distributed mode.
>
> I also observe that, when fetching a 1M urls segment, 99% is done in 4
> hours, but the last 1% takes forever. For performance reason, it makes sense
> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
> appropriate time span. But, estimating the time span may not be reliable. Is
> there another smarter way to empty the queues toward the end of fetching
> (before Fetcher is done)?  This could potentially save several hours per
> fetch operation.
>
> thanks,
> -aj
>
>
>
> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-08-17 23:16, AJ Chen wrote:
>>
>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>> in
>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>> segment).  I expect these non-fetching time will be increasing as the
>>> crawl
>>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>>> time (i.e. generate segment and update crawldb)?
>>>
>>
>> That's surprisingly long for this configuration... What do you think takes
>> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>
>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>> start fetching the next one, and in parallel you can start parsing/updatedb
>> from the first segment. Note that you need to either generate multiple
>> segments (there's an option in Generator to do so), or you need to turn on
>> generate.update.crawldb, but you don't need both.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA