You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Pavan Sudheendra <pa...@gmail.com> on 2013/08/25 22:36:50 UTC

Mapper and Reducer takes longer than usual for a HBase table aggregation task

Hi all,

My mapper function is processing and aggregating 3 HBase table's data and
writing it to the reducer for further operations..

However, all the 3 tables have small number of rows.. Not in the order of
millions.. Still my map task completes in

16:07:29,632  INFO JobClient:1435 - Running job:
job_201308231255_005716:07:30,640  INFO JobClient:1448 -  map 0%
reduce 0%16:42:02,778  INFO JobClient:1448 -  map 100% reduce
0%16:42:11,793  INFO JobClient:1448 -  map 100% reduce 67%16:43:51,959
 INFO JobClient:1448 -  map 100% reduce 68%16:46:28,278  INFO
JobClient:1448 -  map 100% reduce 69%16:48:44,497  INFO JobClient:1448
-  map 100% reduce 70%16:50:51,698  INFO JobClient:1448 -  map 100%
reduce 71%16:52:55,885  INFO JobClient:1448 -  map 100% reduce
72%16:55:42,141  INFO JobClient:1448 -  map 100% reduce
73%16:58:24,384  INFO JobClient:1448 -  map 100% reduce
74%17:00:58,614  INFO JobClient:1448 -  map 100% reduce
75%17:03:36,849  INFO JobClient:1448 -  map 100% reduce
100%17:03:38,853  INFO JobClient:1503 - Job complete:
job_201308231255_005717:03:38,869  INFO JobClient:566 - Counters:
3217:03:38,873  INFO JobClient:568 -   File System
Counters17:03:38,876  INFO JobClient:570 -     FILE: Number of bytes
read=225315717:03:38,876  INFO JobClient:570 -     FILE: Number of
bytes written=493611617:03:38,877  INFO JobClient:570 -     FILE:
Number of read operations=017:03:38,877  INFO JobClient:570 -
FILE: Number of large read operations=017:03:38,877  INFO
JobClient:570 -     FILE: Number of write operations=017:03:38,877
INFO JobClient:570 -     HDFS: Number of bytes read=11617:03:38,877
INFO JobClient:570 -     HDFS: Number of bytes written=017:03:38,878
INFO JobClient:570 -     HDFS: Number of read operations=117:03:38,878
 INFO JobClient:570 -     HDFS: Number of large read
operations=017:03:38,878  INFO JobClient:570 -     HDFS: Number of
write operations=017:03:38,881  INFO JobClient:568 -   Job Counters
17:03:38,882  INFO JobClient:570 -     Launched map
tasks=117:03:38,882  INFO JobClient:570 -     Launched reduce
tasks=117:03:38,882  INFO JobClient:570 -     Data-local map
tasks=117:03:38,882  INFO JobClient:570 -     Total time spent by all
maps in occupied slots (ms)=206626217:03:38,882  INFO JobClient:570 -
   Total time spent by all reduces in occupied slots
(ms)=129324317:03:38,883  INFO JobClient:570 -     Total time spent by
all maps waiting after reserving slots (ms)=017:03:38,883  INFO
JobClient:570 -     Total time spent by all reduces waiting after
reserving slots (ms)=017:03:38,886  INFO JobClient:568 -   Map-Reduce
Framework17:03:38,886  INFO JobClient:570 -     Map input
records=8281817:03:38,886  INFO JobClient:570 -     Map output
records=8281817:03:38,886  INFO JobClient:570 -     Map output
bytes=850491517:03:38,886  INFO JobClient:570 -     Input split
bytes=11617:03:38,887  INFO JobClient:570 -     Combine input
records=017:03:38,887  INFO JobClient:570 -     Combine output
records=017:03:38,887  INFO JobClient:570 -     Reduce input
groups=8270617:03:38,887  INFO JobClient:570 -     Reduce shuffle
bytes=225315317:03:38,887  INFO JobClient:570 -     Reduce input
records=8281817:03:38,888  INFO JobClient:570 -     Reduce output
records=8270617:03:38,888  INFO JobClient:570 -     Spilled
Records=16563617:03:38,888  INFO JobClient:570 -     CPU time spent
(ms)=320136017:03:38,888  INFO JobClient:570 -     Physical memory
(bytes) snapshot=109038796817:03:38,888  INFO JobClient:570 -
Virtual memory (bytes) snapshot=668360704017:03:38,889  INFO
JobClient:570 -     Total committed heap usage
(bytes)=48732569617:03:38,890  INFO ActionDataInterpret:595 - Map Job
is Completed


This is a lot longer than what i expected.. 1 hour is just too slow.. Can i
improve it? We have a 6 node cluster running on EC2 at the moment.

Another Question, why does it indicate number of mappers as 1? Can i change
it so that multiple mappers perform computation?

2. ) If my table is in the order of millions, the number of mappers is
increased to 5.. How does Hadoop know how many mappers to run for a
specific job?

-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().

Cheers


On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,


> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.

Best regards,

Jens

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,


> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.

Best regards,

Jens

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA

Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.

Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..

The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..


On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
>   </property>
>   <property>
>     <name>mapred.job.tracker.http.address</name>
>     <value>0.0.0.0:50030</value>
>   </property>
>   <property>
>     <name>mapreduce.job.counters.max</name>
>     <value>120</value>
>   </property>
>   <property>
>     <name>mapred.output.compress</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.type</name>
>     <value>BLOCK</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.DefaultCodec</value>
>   </property>
>   <property>
>     <name>mapred.map.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>   </property>
>   <property>
>     <name>mapred.compress.map.output</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>zlib.compress.level</name>
>     <value>DEFAULT_COMPRESSION</value>
>   </property>
>   <property>
>     <name>io.sort.factor</name>
>     <value>64</value>
>   </property>
>   <property>
>     <name>io.sort.record.percent</name>
>     <value>0.05</value>
>   </property>
>   <property>
>     <name>io.sort.spill.percent</name>
>     <value>0.8</value>
>   </property>
>   <property>
>     <name>mapred.reduce.parallel.copies</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>mapred.submit.replication</name>
>     <value>2</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>   </property>
>   <property>
>     <name>mapred.userlog.retain.hours</name>
>     <value>24</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>112</value>
>   </property>
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value> -Xmx471075479</value>
>   </property>
>   <property>
>     <name>mapred.job.reuse.jvm.num.tasks</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>mapred.map.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.slowstart.completed.maps</name>
>     <value>0.8</value>
>   </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>>  Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA

Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.

Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..

The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..


On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
>   </property>
>   <property>
>     <name>mapred.job.tracker.http.address</name>
>     <value>0.0.0.0:50030</value>
>   </property>
>   <property>
>     <name>mapreduce.job.counters.max</name>
>     <value>120</value>
>   </property>
>   <property>
>     <name>mapred.output.compress</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.type</name>
>     <value>BLOCK</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.DefaultCodec</value>
>   </property>
>   <property>
>     <name>mapred.map.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>   </property>
>   <property>
>     <name>mapred.compress.map.output</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>zlib.compress.level</name>
>     <value>DEFAULT_COMPRESSION</value>
>   </property>
>   <property>
>     <name>io.sort.factor</name>
>     <value>64</value>
>   </property>
>   <property>
>     <name>io.sort.record.percent</name>
>     <value>0.05</value>
>   </property>
>   <property>
>     <name>io.sort.spill.percent</name>
>     <value>0.8</value>
>   </property>
>   <property>
>     <name>mapred.reduce.parallel.copies</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>mapred.submit.replication</name>
>     <value>2</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>   </property>
>   <property>
>     <name>mapred.userlog.retain.hours</name>
>     <value>24</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>112</value>
>   </property>
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value> -Xmx471075479</value>
>   </property>
>   <property>
>     <name>mapred.job.reuse.jvm.num.tasks</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>mapred.map.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.slowstart.completed.maps</name>
>     <value>0.8</value>
>   </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>>  Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA

Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.

Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..

The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..


On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
>   </property>
>   <property>
>     <name>mapred.job.tracker.http.address</name>
>     <value>0.0.0.0:50030</value>
>   </property>
>   <property>
>     <name>mapreduce.job.counters.max</name>
>     <value>120</value>
>   </property>
>   <property>
>     <name>mapred.output.compress</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.type</name>
>     <value>BLOCK</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.DefaultCodec</value>
>   </property>
>   <property>
>     <name>mapred.map.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>   </property>
>   <property>
>     <name>mapred.compress.map.output</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>zlib.compress.level</name>
>     <value>DEFAULT_COMPRESSION</value>
>   </property>
>   <property>
>     <name>io.sort.factor</name>
>     <value>64</value>
>   </property>
>   <property>
>     <name>io.sort.record.percent</name>
>     <value>0.05</value>
>   </property>
>   <property>
>     <name>io.sort.spill.percent</name>
>     <value>0.8</value>
>   </property>
>   <property>
>     <name>mapred.reduce.parallel.copies</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>mapred.submit.replication</name>
>     <value>2</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>   </property>
>   <property>
>     <name>mapred.userlog.retain.hours</name>
>     <value>24</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>112</value>
>   </property>
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value> -Xmx471075479</value>
>   </property>
>   <property>
>     <name>mapred.job.reuse.jvm.num.tasks</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>mapred.map.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.slowstart.completed.maps</name>
>     <value>0.8</value>
>   </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>>  Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA

Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.

Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..

The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..


On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:

> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
>   </property>
>   <property>
>     <name>mapred.job.tracker.http.address</name>
>     <value>0.0.0.0:50030</value>
>   </property>
>   <property>
>     <name>mapreduce.job.counters.max</name>
>     <value>120</value>
>   </property>
>   <property>
>     <name>mapred.output.compress</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.type</name>
>     <value>BLOCK</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.DefaultCodec</value>
>   </property>
>   <property>
>     <name>mapred.map.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>   </property>
>   <property>
>     <name>mapred.compress.map.output</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>zlib.compress.level</name>
>     <value>DEFAULT_COMPRESSION</value>
>   </property>
>   <property>
>     <name>io.sort.factor</name>
>     <value>64</value>
>   </property>
>   <property>
>     <name>io.sort.record.percent</name>
>     <value>0.05</value>
>   </property>
>   <property>
>     <name>io.sort.spill.percent</name>
>     <value>0.8</value>
>   </property>
>   <property>
>     <name>mapred.reduce.parallel.copies</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>mapred.submit.replication</name>
>     <value>2</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>   </property>
>   <property>
>     <name>mapred.userlog.retain.hours</name>
>     <value>24</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>112</value>
>   </property>
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value> -Xmx471075479</value>
>   </property>
>   <property>
>     <name>mapred.job.reuse.jvm.num.tasks</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>mapred.map.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.slowstart.completed.maps</name>
>     <value>0.8</value>
>   </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>>  Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);

This is our mapred-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
  </property>
  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>0.0.0.0:50030</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapred.output.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>zlib.compress.level</name>
    <value>DEFAULT_COMPRESSION</value>
  </property>
  <property>
    <name>io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>io.sort.record.percent</name>
    <value>0.05</value>
  </property>
  <property>
    <name>io.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.submit.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
  </property>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>24</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>112</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx471075479</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.8</value>
  </property></configuration>


Suggest ways to overwrite the default value please.


On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:

> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>>  Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);

This is our mapred-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
  </property>
  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>0.0.0.0:50030</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapred.output.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>zlib.compress.level</name>
    <value>DEFAULT_COMPRESSION</value>
  </property>
  <property>
    <name>io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>io.sort.record.percent</name>
    <value>0.05</value>
  </property>
  <property>
    <name>io.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.submit.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
  </property>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>24</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>112</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx471075479</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.8</value>
  </property></configuration>


Suggest ways to overwrite the default value please.


On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:

> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>>  Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);

This is our mapred-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
  </property>
  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>0.0.0.0:50030</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapred.output.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>zlib.compress.level</name>
    <value>DEFAULT_COMPRESSION</value>
  </property>
  <property>
    <name>io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>io.sort.record.percent</name>
    <value>0.05</value>
  </property>
  <property>
    <name>io.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.submit.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
  </property>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>24</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>112</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx471075479</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.8</value>
  </property></configuration>


Suggest ways to overwrite the default value please.


On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:

> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>>  Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);

This is our mapred-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
  </property>
  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>0.0.0.0:50030</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapred.output.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>zlib.compress.level</name>
    <value>DEFAULT_COMPRESSION</value>
  </property>
  <property>
    <name>io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>io.sort.record.percent</name>
    <value>0.05</value>
  </property>
  <property>
    <name>io.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.submit.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
  </property>
  <property>
    <name>mapred.userlog.retain.hours</name>
    <value>24</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>112</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx471075479</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
  </property>
  <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.8</value>
  </property></configuration>


Suggest ways to overwrite the default value please.


On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:

> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>>  Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by anil gupta <an...@gmail.com>.
Hi Pavan,

Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.

~Anil



On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:

> You need release your map code here to analyze the question. generally,
> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
>  Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by anil gupta <an...@gmail.com>.
Hi Pavan,

Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.

~Anil



On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:

> You need release your map code here to analyze the question. generally,
> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
>  Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by anil gupta <an...@gmail.com>.
Hi Pavan,

Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.

~Anil



On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:

> You need release your map code here to analyze the question. generally,
> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
>  Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by anil gupta <an...@gmail.com>.
Hi Pavan,

Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.

~Anil



On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:

> You need release your map code here to analyze the question. generally,
> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
>  Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally, 
when map/reduce hbase,  scanner with filter(s) is used. so the mapper 
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join 
on the three tables, which cause too many rows.

于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i 
> change it so that multiple mappers perform computation?


Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,


> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.

Best regards,

Jens

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally, 
when map/reduce hbase,  scanner with filter(s) is used. so the mapper 
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join 
on the three tables, which cause too many rows.

于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i 
> change it so that multiple mappers perform computation?


Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally, 
when map/reduce hbase,  scanner with filter(s) is used. so the mapper 
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join 
on the three tables, which cause too many rows.

于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i 
> change it so that multiple mappers perform computation?


Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally, 
when map/reduce hbase,  scanner with filter(s) is used. so the mapper 
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join 
on the three tables, which cause too many rows.

于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i 
> change it so that multiple mappers perform computation?


Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,


> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.

Best regards,

Jens