You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Pavan Sudheendra <pa...@gmail.com> on 2013/08/25 22:36:50 UTC
Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Hi all,
My mapper function is processing and aggregating 3 HBase table's data and
writing it to the reducer for further operations..
However, all the 3 tables have small number of rows.. Not in the order of
millions.. Still my map task completes in
16:07:29,632 INFO JobClient:1435 - Running job:
job_201308231255_005716:07:30,640 INFO JobClient:1448 - map 0%
reduce 0%16:42:02,778 INFO JobClient:1448 - map 100% reduce
0%16:42:11,793 INFO JobClient:1448 - map 100% reduce 67%16:43:51,959
INFO JobClient:1448 - map 100% reduce 68%16:46:28,278 INFO
JobClient:1448 - map 100% reduce 69%16:48:44,497 INFO JobClient:1448
- map 100% reduce 70%16:50:51,698 INFO JobClient:1448 - map 100%
reduce 71%16:52:55,885 INFO JobClient:1448 - map 100% reduce
72%16:55:42,141 INFO JobClient:1448 - map 100% reduce
73%16:58:24,384 INFO JobClient:1448 - map 100% reduce
74%17:00:58,614 INFO JobClient:1448 - map 100% reduce
75%17:03:36,849 INFO JobClient:1448 - map 100% reduce
100%17:03:38,853 INFO JobClient:1503 - Job complete:
job_201308231255_005717:03:38,869 INFO JobClient:566 - Counters:
3217:03:38,873 INFO JobClient:568 - File System
Counters17:03:38,876 INFO JobClient:570 - FILE: Number of bytes
read=225315717:03:38,876 INFO JobClient:570 - FILE: Number of
bytes written=493611617:03:38,877 INFO JobClient:570 - FILE:
Number of read operations=017:03:38,877 INFO JobClient:570 -
FILE: Number of large read operations=017:03:38,877 INFO
JobClient:570 - FILE: Number of write operations=017:03:38,877
INFO JobClient:570 - HDFS: Number of bytes read=11617:03:38,877
INFO JobClient:570 - HDFS: Number of bytes written=017:03:38,878
INFO JobClient:570 - HDFS: Number of read operations=117:03:38,878
INFO JobClient:570 - HDFS: Number of large read
operations=017:03:38,878 INFO JobClient:570 - HDFS: Number of
write operations=017:03:38,881 INFO JobClient:568 - Job Counters
17:03:38,882 INFO JobClient:570 - Launched map
tasks=117:03:38,882 INFO JobClient:570 - Launched reduce
tasks=117:03:38,882 INFO JobClient:570 - Data-local map
tasks=117:03:38,882 INFO JobClient:570 - Total time spent by all
maps in occupied slots (ms)=206626217:03:38,882 INFO JobClient:570 -
Total time spent by all reduces in occupied slots
(ms)=129324317:03:38,883 INFO JobClient:570 - Total time spent by
all maps waiting after reserving slots (ms)=017:03:38,883 INFO
JobClient:570 - Total time spent by all reduces waiting after
reserving slots (ms)=017:03:38,886 INFO JobClient:568 - Map-Reduce
Framework17:03:38,886 INFO JobClient:570 - Map input
records=8281817:03:38,886 INFO JobClient:570 - Map output
records=8281817:03:38,886 INFO JobClient:570 - Map output
bytes=850491517:03:38,886 INFO JobClient:570 - Input split
bytes=11617:03:38,887 INFO JobClient:570 - Combine input
records=017:03:38,887 INFO JobClient:570 - Combine output
records=017:03:38,887 INFO JobClient:570 - Reduce input
groups=8270617:03:38,887 INFO JobClient:570 - Reduce shuffle
bytes=225315317:03:38,887 INFO JobClient:570 - Reduce input
records=8281817:03:38,888 INFO JobClient:570 - Reduce output
records=8270617:03:38,888 INFO JobClient:570 - Spilled
Records=16563617:03:38,888 INFO JobClient:570 - CPU time spent
(ms)=320136017:03:38,888 INFO JobClient:570 - Physical memory
(bytes) snapshot=109038796817:03:38,888 INFO JobClient:570 -
Virtual memory (bytes) snapshot=668360704017:03:38,889 INFO
JobClient:570 - Total committed heap usage
(bytes)=48732569617:03:38,890 INFO ActionDataInterpret:595 - Map Job
is Completed
This is a lot longer than what i expected.. 1 hour is just too slow.. Can i
improve it? We have a 6 node cluster running on EC2 at the moment.
Another Question, why does it indicate number of mappers as 1? Can i change
it so that multiple mappers perform computation?
2. ) If my table is in the order of millions, the number of mappers is
increased to 5.. How does Hadoop know how many mappers to run for a
specific job?
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().
Cheers
On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:
> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().
Cheers
On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:
> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().
Cheers
On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:
> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Ted Yu <yu...@gmail.com>.
Pavan:
Did you use TableInputFormat or its variant ?
If so, take a look at TableSplit and how it is used in
TableInputFormatBase#getSplits().
Cheers
On Sun, Aug 25, 2013 at 2:36 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:
> Hi Pavan,
>
>
>> 2. ) If my table is in the order of millions, the number of mappers is
>> increased to 5.. How does Hadoop know how many mappers to run for a
>> specific job?
>>
>> The number of input splits determines the number of mappers. Usually (in
> the default case) your source is split into hdfs blocks (usually 64 MB) and
> for each block, there will be a mapper.
>
> Best regards,
>
> Jens
>
>
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,
> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.
Best regards,
Jens
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,
> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.
Best regards,
Jens
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA
Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.
Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..
The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..
On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>0.0.0.0:50030</value>
> </property>
> <property>
> <name>mapreduce.job.counters.max</name>
> <value>120</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>zlib.compress.level</name>
> <value>DEFAULT_COMPRESSION</value>
> </property>
> <property>
> <name>io.sort.factor</name>
> <value>64</value>
> </property>
> <property>
> <name>io.sort.record.percent</name>
> <value>0.05</value>
> </property>
> <property>
> <name>io.sort.spill.percent</name>
> <value>0.8</value>
> </property>
> <property>
> <name>mapred.reduce.parallel.copies</name>
> <value>10</value>
> </property>
> <property>
> <name>mapred.submit.replication</name>
> <value>2</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
> </property>
> <property>
> <name>mapred.userlog.retain.hours</name>
> <value>24</value>
> </property>
> <property>
> <name>io.sort.mb</name>
> <value>112</value>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
> <value> -Xmx471075479</value>
> </property>
> <property>
> <name>mapred.job.reuse.jvm.num.tasks</name>
> <value>1</value>
> </property>
> <property>
> <name>mapred.map.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.slowstart.completed.maps</name>
> <value>0.8</value>
> </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>> Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA
Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.
Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..
The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..
On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>0.0.0.0:50030</value>
> </property>
> <property>
> <name>mapreduce.job.counters.max</name>
> <value>120</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>zlib.compress.level</name>
> <value>DEFAULT_COMPRESSION</value>
> </property>
> <property>
> <name>io.sort.factor</name>
> <value>64</value>
> </property>
> <property>
> <name>io.sort.record.percent</name>
> <value>0.05</value>
> </property>
> <property>
> <name>io.sort.spill.percent</name>
> <value>0.8</value>
> </property>
> <property>
> <name>mapred.reduce.parallel.copies</name>
> <value>10</value>
> </property>
> <property>
> <name>mapred.submit.replication</name>
> <value>2</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
> </property>
> <property>
> <name>mapred.userlog.retain.hours</name>
> <value>24</value>
> </property>
> <property>
> <name>io.sort.mb</name>
> <value>112</value>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
> <value> -Xmx471075479</value>
> </property>
> <property>
> <name>mapred.job.reuse.jvm.num.tasks</name>
> <value>1</value>
> </property>
> <property>
> <name>mapred.map.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.slowstart.completed.maps</name>
> <value>0.8</value>
> </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>> Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA
Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.
Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..
The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..
On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>0.0.0.0:50030</value>
> </property>
> <property>
> <name>mapreduce.job.counters.max</name>
> <value>120</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>zlib.compress.level</name>
> <value>DEFAULT_COMPRESSION</value>
> </property>
> <property>
> <name>io.sort.factor</name>
> <value>64</value>
> </property>
> <property>
> <name>io.sort.record.percent</name>
> <value>0.05</value>
> </property>
> <property>
> <name>io.sort.spill.percent</name>
> <value>0.8</value>
> </property>
> <property>
> <name>mapred.reduce.parallel.copies</name>
> <value>10</value>
> </property>
> <property>
> <name>mapred.submit.replication</name>
> <value>2</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
> </property>
> <property>
> <name>mapred.userlog.retain.hours</name>
> <value>24</value>
> </property>
> <property>
> <name>io.sort.mb</name>
> <value>112</value>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
> <value> -Xmx471075479</value>
> </property>
> <property>
> <name>mapred.job.reuse.jvm.num.tasks</name>
> <value>1</value>
> </property>
> <property>
> <name>mapred.map.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.slowstart.completed.maps</name>
> <value>0.8</value>
> </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>> Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA
Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.
Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..
The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..
On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <pa...@gmail.com>wrote:
> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>0.0.0.0:50030</value>
> </property>
> <property>
> <name>mapreduce.job.counters.max</name>
> <value>120</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>zlib.compress.level</name>
> <value>DEFAULT_COMPRESSION</value>
> </property>
> <property>
> <name>io.sort.factor</name>
> <value>64</value>
> </property>
> <property>
> <name>io.sort.record.percent</name>
> <value>0.05</value>
> </property>
> <property>
> <name>io.sort.spill.percent</name>
> <value>0.8</value>
> </property>
> <property>
> <name>mapred.reduce.parallel.copies</name>
> <value>10</value>
> </property>
> <property>
> <name>mapred.submit.replication</name>
> <value>2</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
> </property>
> <property>
> <name>mapred.userlog.retain.hours</name>
> <value>24</value>
> </property>
> <property>
> <name>io.sort.mb</name>
> <value>112</value>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
> <value> -Xmx471075479</value>
> </property>
> <property>
> <name>mapred.job.reuse.jvm.num.tasks</name>
> <value>1</value>
> </property>
> <property>
> <name>mapred.map.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.tasks.speculative.execution</name>
> <value>false</value>
> </property>
> <property>
> <name>mapred.reduce.slowstart.completed.maps</name>
> <value>0.8</value>
> </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>> Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);
This is our mapred-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>120</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>false</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>zlib.compress.level</name>
<value>DEFAULT_COMPRESSION</value>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>io.sort.record.percent</name>
<value>0.05</value>
</property>
<property>
<name>io.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>10</value>
</property>
<property>
<name>mapred.submit.replication</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>6</value>
</property>
<property>
<name>mapred.userlog.retain.hours</name>
<value>24</value>
</property>
<property>
<name>io.sort.mb</name>
<value>112</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value> -Xmx471075479</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0.8</value>
</property></configuration>
Suggest ways to overwrite the default value please.
On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>> Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);
This is our mapred-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>120</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>false</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>zlib.compress.level</name>
<value>DEFAULT_COMPRESSION</value>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>io.sort.record.percent</name>
<value>0.05</value>
</property>
<property>
<name>io.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>10</value>
</property>
<property>
<name>mapred.submit.replication</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>6</value>
</property>
<property>
<name>mapred.userlog.retain.hours</name>
<value>24</value>
</property>
<property>
<name>io.sort.mb</name>
<value>112</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value> -Xmx471075479</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0.8</value>
</property></configuration>
Suggest ways to overwrite the default value please.
On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>> Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);
This is our mapred-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>120</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>false</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>zlib.compress.level</name>
<value>DEFAULT_COMPRESSION</value>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>io.sort.record.percent</name>
<value>0.05</value>
</property>
<property>
<name>io.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>10</value>
</property>
<property>
<name>mapred.submit.replication</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>6</value>
</property>
<property>
<name>mapred.userlog.retain.hours</name>
<value>24</value>
</property>
<property>
<name>io.sort.mb</name>
<value>112</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value> -Xmx471075479</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0.8</value>
</property></configuration>
Suggest ways to overwrite the default value please.
On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>> Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Pavan Sudheendra <pa...@gmail.com>.
Jens, can i set a smaller value in my application?
Is this valid?
conf.setInt("mapred.max.split.size", 50);
This is our mapred-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>120</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>false</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>zlib.compress.level</name>
<value>DEFAULT_COMPRESSION</value>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>io.sort.record.percent</name>
<value>0.05</value>
</property>
<property>
<name>io.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>10</value>
</property>
<property>
<name>mapred.submit.replication</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>6</value>
</property>
<property>
<name>mapred.userlog.retain.hours</name>
<value>24</value>
</property>
<property>
<name>io.sort.mb</name>
<value>112</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value> -Xmx471075479</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0.8</value>
</property></configuration>
Suggest ways to overwrite the default value please.
On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <an...@gmail.com> wrote:
> Hi Pavan,
>
> Standalone cluster? How many RS you are running?What are you trying to
> achieve in MR? Have you tried increasing scanner caching?
> Slow is very theoretical unless we know some more details of your stuff.
>
> ~Anil
>
>
>
> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
>
>> You need release your map code here to analyze the question. generally,
>> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
>> is the hbase region count in your hbase table.
>> As the reason why you reduce so slow, I guess, you have an disaster join
>> on the three tables, which cause too many rows.
>>
>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>
>> Another Question, why does it indicate number of mappers as 1? Can i
>>> change it so that multiple mappers perform computation?
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
--
Regards-
Pavan
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by anil gupta <an...@gmail.com>.
Hi Pavan,
Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.
~Anil
On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
> You need release your map code here to analyze the question. generally,
> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
> Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>
--
Thanks & Regards,
Anil Gupta
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by anil gupta <an...@gmail.com>.
Hi Pavan,
Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.
~Anil
On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
> You need release your map code here to analyze the question. generally,
> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
> Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>
--
Thanks & Regards,
Anil Gupta
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by anil gupta <an...@gmail.com>.
Hi Pavan,
Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.
~Anil
On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
> You need release your map code here to analyze the question. generally,
> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
> Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>
--
Thanks & Regards,
Anil Gupta
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by anil gupta <an...@gmail.com>.
Hi Pavan,
Standalone cluster? How many RS you are running?What are you trying to
achieve in MR? Have you tried increasing scanner caching?
Slow is very theoretical unless we know some more details of your stuff.
~Anil
On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <lh...@hotmail.com> wrote:
> You need release your map code here to analyze the question. generally,
> when map/reduce hbase, scanner with filter(s) is used. so the mapper count
> is the hbase region count in your hbase table.
> As the reason why you reduce so slow, I guess, you have an disaster join
> on the three tables, which cause too many rows.
>
> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>
> Another Question, why does it indicate number of mappers as 1? Can i
>> change it so that multiple mappers perform computation?
>>
>
>
--
Thanks & Regards,
Anil Gupta
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally,
when map/reduce hbase, scanner with filter(s) is used. so the mapper
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join
on the three tables, which cause too many rows.
于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i
> change it so that multiple mappers perform computation?
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,
> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.
Best regards,
Jens
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally,
when map/reduce hbase, scanner with filter(s) is used. so the mapper
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join
on the three tables, which cause too many rows.
于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i
> change it so that multiple mappers perform computation?
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally,
when map/reduce hbase, scanner with filter(s) is used. so the mapper
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join
on the three tables, which cause too many rows.
于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i
> change it so that multiple mappers perform computation?
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by 李洪忠 <lh...@hotmail.com>.
You need release your map code here to analyze the question. generally,
when map/reduce hbase, scanner with filter(s) is used. so the mapper
count is the hbase region count in your hbase table.
As the reason why you reduce so slow, I guess, you have an disaster join
on the three tables, which cause too many rows.
于 2013/8/26 4:36, Pavan Sudheendra 写道:
> Another Question, why does it indicate number of mappers as 1? Can i
> change it so that multiple mappers perform computation?
Re: Mapper and Reducer takes longer than usual for a HBase table
aggregation task
Posted by Jens Scheidtmann <je...@gmail.com>.
Hi Pavan,
> 2. ) If my table is in the order of millions, the number of mappers is
> increased to 5.. How does Hadoop know how many mappers to run for a
> specific job?
>
> The number of input splits determines the number of mappers. Usually (in
the default case) your source is split into hdfs blocks (usually 64 MB) and
for each block, there will be a mapper.
Best regards,
Jens