You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Pavan Sudheendra <pa...@gmail.com> on 2013/08/25 22:36:50 UTC

Mapper and Reducer takes longer than usual for a HBase table aggregation task

Hi all,

My mapper function is processing and aggregating 3 HBase table's data and
writing it to the reducer for further operations..

However, all the 3 tables have small number of rows.. Not in the order of
millions.. Still my map task completes in

16:07:29,632  INFO JobClient:1435 - Running job:
job_201308231255_005716:07:30,640  INFO JobClient:1448 -  map 0%
reduce 0%16:42:02,778  INFO JobClient:1448 -  map 100% reduce
0%16:42:11,793  INFO JobClient:1448 -  map 100% reduce 67%16:43:51,959
 INFO JobClient:1448 -  map 100% reduce 68%16:46:28,278  INFO
JobClient:1448 -  map 100% reduce 69%16:48:44,497  INFO JobClient:1448
-  map 100% reduce 70%16:50:51,698  INFO JobClient:1448 -  map 100%
reduce 71%16:52:55,885  INFO JobClient:1448 -  map 100% reduce
72%16:55:42,141  INFO JobClient:1448 -  map 100% reduce
73%16:58:24,384  INFO JobClient:1448 -  map 100% reduce
74%17:00:58,614  INFO JobClient:1448 -  map 100% reduce
75%17:03:36,849  INFO JobClient:1448 -  map 100% reduce
100%17:03:38,853  INFO JobClient:1503 - Job complete:
job_201308231255_005717:03:38,869  INFO JobClient:566 - Counters:
3217:03:38,873  INFO JobClient:568 -   File System
Counters17:03:38,876  INFO JobClient:570 -     FILE: Number of bytes
read=225315717:03:38,876  INFO JobClient:570 -     FILE: Number of
bytes written=493611617:03:38,877  INFO JobClient:570 -     FILE:
Number of read operations=017:03:38,877  INFO JobClient:570 -
FILE: Number of large read operations=017:03:38,877  INFO
JobClient:570 -     FILE: Number of write operations=017:03:38,877
INFO JobClient:570 -     HDFS: Number of bytes read=11617:03:38,877
INFO JobClient:570 -     HDFS: Number of bytes written=017:03:38,878
INFO JobClient:570 -     HDFS: Number of read operations=117:03:38,878
 INFO JobClient:570 -     HDFS: Number of large read
operations=017:03:38,878  INFO JobClient:570 -     HDFS: Number of
write operations=017:03:38,881  INFO JobClient:568 -   Job Counters
17:03:38,882  INFO JobClient:570 -     Launched map
tasks=117:03:38,882  INFO JobClient:570 -     Launched reduce
tasks=117:03:38,882  INFO JobClient:570 -     Data-local map
tasks=117:03:38,882  INFO JobClient:570 -     Total time spent by all
maps in occupied slots (ms)=206626217:03:38,882  INFO JobClient:570 -
   Total time spent by all reduces in occupied slots
(ms)=129324317:03:38,883  INFO JobClient:570 -     Total time spent by
all maps waiting after reserving slots (ms)=017:03:38,883  INFO
JobClient:570 -     Total time spent by all reduces waiting after
reserving slots (ms)=017:03:38,886  INFO JobClient:568 -   Map-Reduce
Framework17:03:38,886  INFO JobClient:570 -     Map input
records=8281817:03:38,886  INFO JobClient:570 -     Map output
records=8281817:03:38,886  INFO JobClient:570 -     Map output
bytes=850491517:03:38,886  INFO JobClient:570 -     Input split
bytes=11617:03:38,887  INFO JobClient:570 -     Combine input
records=017:03:38,887  INFO JobClient:570 -     Combine output
records=017:03:38,887  INFO JobClient:570 -     Reduce input
groups=8270617:03:38,887  INFO JobClient:570 -     Reduce shuffle
bytes=225315317:03:38,887  INFO JobClient:570 -     Reduce input
records=8281817:03:38,888  INFO JobClient:570 -     Reduce output
records=8270617:03:38,888  INFO JobClient:570 -     Spilled
Records=16563617:03:38,888  INFO JobClient:570 -     CPU time spent
(ms)=320136017:03:38,888  INFO JobClient:570 -     Physical memory
(bytes) snapshot=109038796817:03:38,888  INFO JobClient:570 -
Virtual memory (bytes) snapshot=668360704017:03:38,889  INFO
JobClient:570 -     Total committed heap usage
(bytes)=48732569617:03:38,890  INFO ActionDataInterpret:595 - Map Job
is Completed


This is a lot longer than what i expected.. 1 hour is just too slow.. Can i
improve it? We have a 6 node cluster running on EC2 at the moment.

Another Question, why does it indicate number of mappers as 1? Can i change
it so that multiple mappers perform computation?

2. ) If my table is in the order of millions, the number of mappers is
increased to 5.. How does Hadoop know how many mappers to run for a
specific job?

-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.

Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.

Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.

Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.

Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Pavan,


> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.

Best regards,

Jens