You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Arindam Choudhury <ar...@gmail.com> on 2013/02/26 12:09:29 UTC

Running terasort with 1 map task

Hi all,

I am trying to run terasort using one map and one reduce. so, I generated
the input data using:

hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map

Then I launched the hadoop terasort job using:

hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

I thought it will run the job using 1 map and 1 reduce, but when inspect
the job statistics I found:

hadoop job -history /user/hadoop/output1

Task Summary
============================
Kind    Total    Successful    Failed    Killed    StartTime    FinishTime

Setup    1    1        0    0    26-Feb-2013 10:57:47    26-Feb-2013
10:57:55 (8sec)
Map    24    24        0    0    26-Feb-2013 10:57:57    26-Feb-2013
11:05:37 (7mins, 40sec)
Reduce    1    1        0    0    26-Feb-2013 10:58:21    26-Feb-2013
11:08:31 (10mins, 10sec)
Cleanup    1    1        0    0    26-Feb-2013 11:08:32    26-Feb-2013
11:08:36 (4sec)
============================

so, though I mentioned to launch one map tasks, there are 24 of them.

How to solve this problem. How to tell hadoop to launch only one map.

Thanks,

Re: Running terasort with 1 map task

Posted by Mahesh Balija <ba...@gmail.com>.

does passing the dfs.block.size=134217728 resolves your issue? or is it
something else fixed your problem?

On Tue, Feb 26, 2013 at 6:04 PM, Arindam Choudhury <
arindamchoudhury0@gmail.com> wrote:

> sorry my bad, it solved
>
>
> On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury <
> arindamchoudhury0@gmail.com> wrote:
>
>> In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block
>> size
>>
>> <property>
>>   <name>dfs.block.size</name>
>>   <value>134217728</value>
>>   <final>true</final>
>> </property>
>>
>> While running the teragen I am again specifying it to be sure:
>>
>> hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
>> -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
>> 320000 /user/hadoop/input
>>
>> but it generates 3 blocks:
>>
>> hadoop fsck -blocks -files -locations /user/hadoop/input
>> Status: HEALTHY
>>  Total size:    32029543 B
>>  Total dirs:    3
>>  Total files:    4
>>  Total blocks (validated):    3 (avg. block size 10676514 B)
>>  Minimally replicated blocks:    3 (100.0 %)
>>
>> What I am doing wrong? How can I generate only one block?
>>
>>
>>
>> On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury <
>> arindamchoudhury0@gmail.com> wrote:
>>
>>> Thanks . As Julien said I want to do a performance measurement.
>>>
>>> Actually,
>>>
>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>
>>> has generated:
>>> Total size:    3200029737 B
>>> Total dirs:    3
>>> Total files:    5
>>> Total blocks (validated):    27 (avg. block size 118519619 B)
>>>
>>> Thats why so many maps.
>>>
>>>
>>> On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller <julien.muller@ezako.com
>>> > wrote:
>>>
>>>> Maybe your goal is to have a baseline for performance measurement?
>>>> In that case, you might want to consider running only one taskTracker?
>>>>  You would have multiple tasks but running on only 1 machine. Also, you
>>>> could make mappers run serially, by configuring only one map slot on your 1
>>>> node cluster.
>>>>
>>>> Nevertheless I agree with Bertrand, this is not really a realistic use
>>>> case (or maybe you can give us more clues).
>>>>
>>>> Julien
>>>>
>>>>
>>>> 2013/2/26 Bertrand Dechoux <de...@gmail.com>
>>>>
>>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>>
>>>>> It is possible to have a single mapper if the input is not splittable
>>>>> BUT it is rarely seen as a feature.
>>>>> One could ask why you want to use a platform for distributed computing
>>>>> for a job that shouldn't be distributed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury <
>>>>> arindamchoudhury0@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to run terasort using one map and one reduce. so, I
>>>>>> generated the input data using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>>>>
>>>>>> Then I launched the hadoop terasort job using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1
>>>>>>
>>>>>> I thought it will run the job using 1 map and 1 reduce, but when
>>>>>> inspect the job statistics I found:
>>>>>>
>>>>>> hadoop job -history /user/hadoop/output1
>>>>>>
>>>>>> Task Summary
>>>>>> ============================
>>>>>> Kind    Total    Successful    Failed    Killed    StartTime
>>>>>> FinishTime
>>>>>>
>>>>>> Setup    1    1        0    0    26-Feb-2013 10:57:47    26-Feb-2013
>>>>>> 10:57:55 (8sec)
>>>>>> Map    24    24        0    0    26-Feb-2013 10:57:57    26-Feb-2013
>>>>>> 11:05:37 (7mins, 40sec)
>>>>>> Reduce    1    1        0    0    26-Feb-2013 10:58:21    26-Feb-2013
>>>>>> 11:08:31 (10mins, 10sec)
>>>>>> Cleanup    1    1        0    0    26-Feb-2013 11:08:32
>>>>>> 26-Feb-2013 11:08:36 (4sec)
>>>>>> ============================
>>>>>>
>>>>>> so, though I mentioned to launch one map tasks, there are 24 of them.
>>>>>>
>>>>>> How to solve this problem. How to tell hadoop to launch only one map.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running terasort with 1 map task

Posted by Mahesh Balija <ba...@gmail.com>.

does passing the dfs.block.size=134217728 resolves your issue? or is it
something else fixed your problem?

On Tue, Feb 26, 2013 at 6:04 PM, Arindam Choudhury <
arindamchoudhury0@gmail.com> wrote:

> sorry my bad, it solved
>
>
> On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury <
> arindamchoudhury0@gmail.com> wrote:
>
>> In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block
>> size
>>
>> <property>
>>   <name>dfs.block.size</name>
>>   <value>134217728</value>
>>   <final>true</final>
>> </property>
>>
>> While running the teragen I am again specifying it to be sure:
>>
>> hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
>> -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
>> 320000 /user/hadoop/input
>>
>> but it generates 3 blocks:
>>
>> hadoop fsck -blocks -files -locations /user/hadoop/input
>> Status: HEALTHY
>>  Total size:    32029543 B
>>  Total dirs:    3
>>  Total files:    4
>>  Total blocks (validated):    3 (avg. block size 10676514 B)
>>  Minimally replicated blocks:    3 (100.0 %)
>>
>> What I am doing wrong? How can I generate only one block?
>>
>>
>>
>> On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury <
>> arindamchoudhury0@gmail.com> wrote:
>>
>>> Thanks . As Julien said I want to do a performance measurement.
>>>
>>> Actually,
>>>
>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>
>>> has generated:
>>> Total size:    3200029737 B
>>> Total dirs:    3
>>> Total files:    5
>>> Total blocks (validated):    27 (avg. block size 118519619 B)
>>>
>>> Thats why so many maps.
>>>
>>>
>>> On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller <julien.muller@ezako.com
>>> > wrote:
>>>
>>>> Maybe your goal is to have a baseline for performance measurement?
>>>> In that case, you might want to consider running only one taskTracker?
>>>>  You would have multiple tasks but running on only 1 machine. Also, you
>>>> could make mappers run serially, by configuring only one map slot on your 1
>>>> node cluster.
>>>>
>>>> Nevertheless I agree with Bertrand, this is not really a realistic use
>>>> case (or maybe you can give us more clues).
>>>>
>>>> Julien
>>>>
>>>>
>>>> 2013/2/26 Bertrand Dechoux <de...@gmail.com>
>>>>
>>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>>
>>>>> It is possible to have a single mapper if the input is not splittable
>>>>> BUT it is rarely seen as a feature.
>>>>> One could ask why you want to use a platform for distributed computing
>>>>> for a job that shouldn't be distributed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury <
>>>>> arindamchoudhury0@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to run terasort using one map and one reduce. so, I
>>>>>> generated the input data using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>>>>
>>>>>> Then I launched the hadoop terasort job using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1
>>>>>>
>>>>>> I thought it will run the job using 1 map and 1 reduce, but when
>>>>>> inspect the job statistics I found:
>>>>>>
>>>>>> hadoop job -history /user/hadoop/output1
>>>>>>
>>>>>> Task Summary
>>>>>> ============================
>>>>>> Kind    Total    Successful    Failed    Killed    StartTime
>>>>>> FinishTime
>>>>>>
>>>>>> Setup    1    1        0    0    26-Feb-2013 10:57:47    26-Feb-2013
>>>>>> 10:57:55 (8sec)
>>>>>> Map    24    24        0    0    26-Feb-2013 10:57:57    26-Feb-2013
>>>>>> 11:05:37 (7mins, 40sec)
>>>>>> Reduce    1    1        0    0    26-Feb-2013 10:58:21    26-Feb-2013
>>>>>> 11:08:31 (10mins, 10sec)
>>>>>> Cleanup    1    1        0    0    26-Feb-2013 11:08:32
>>>>>> 26-Feb-2013 11:08:36 (4sec)
>>>>>> ============================
>>>>>>
>>>>>> so, though I mentioned to launch one map tasks, there are 24 of them.
>>>>>>
>>>>>> How to solve this problem. How to tell hadoop to launch only one map.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running terasort with 1 map task

Posted by Mahesh Balija <ba...@gmail.com>.

does passing the dfs.block.size=134217728 resolves your issue? or is it
something else fixed your problem?

On Tue, Feb 26, 2013 at 6:04 PM, Arindam Choudhury <
arindamchoudhury0@gmail.com> wrote:

> sorry my bad, it solved
>
>
> On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury <
> arindamchoudhury0@gmail.com> wrote:
>
>> In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block
>> size
>>
>> <property>
>>   <name>dfs.block.size</name>
>>   <value>134217728</value>
>>   <final>true</final>
>> </property>
>>
>> While running the teragen I am again specifying it to be sure:
>>
>> hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
>> -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
>> 320000 /user/hadoop/input
>>
>> but it generates 3 blocks:
>>
>> hadoop fsck -blocks -files -locations /user/hadoop/input
>> Status: HEALTHY
>>  Total size:    32029543 B
>>  Total dirs:    3
>>  Total files:    4
>>  Total blocks (validated):    3 (avg. block size 10676514 B)
>>  Minimally replicated blocks:    3 (100.0 %)
>>
>> What I am doing wrong? How can I generate only one block?
>>
>>
>>
>> On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury <
>> arindamchoudhury0@gmail.com> wrote:
>>
>>> Thanks . As Julien said I want to do a performance measurement.
>>>
>>> Actually,
>>>
>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>
>>> has generated:
>>> Total size:    3200029737 B
>>> Total dirs:    3
>>> Total files:    5
>>> Total blocks (validated):    27 (avg. block size 118519619 B)
>>>
>>> Thats why so many maps.
>>>
>>>
>>> On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller <julien.muller@ezako.com
>>> > wrote:
>>>
>>>> Maybe your goal is to have a baseline for performance measurement?
>>>> In that case, you might want to consider running only one taskTracker?
>>>>  You would have multiple tasks but running on only 1 machine. Also, you
>>>> could make mappers run serially, by configuring only one map slot on your 1
>>>> node cluster.
>>>>
>>>> Nevertheless I agree with Bertrand, this is not really a realistic use
>>>> case (or maybe you can give us more clues).
>>>>
>>>> Julien
>>>>
>>>>
>>>> 2013/2/26 Bertrand Dechoux <de...@gmail.com>
>>>>
>>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>>
>>>>> It is possible to have a single mapper if the input is not splittable
>>>>> BUT it is rarely seen as a feature.
>>>>> One could ask why you want to use a platform for distributed computing
>>>>> for a job that shouldn't be distributed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury <
>>>>> arindamchoudhury0@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to run terasort using one map and one reduce. so, I
>>>>>> generated the input data using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>>>>
>>>>>> Then I launched the hadoop terasort job using:
>>>>>>
>>>>>> hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
>>>>>> -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1
>>>>>>
>>>>>> I thought it will run the job using 1 map and 1 reduce, but when
>>>>>> inspect the job statistics I found:
>>>>>>
>>>>>> hadoop job -history /user/hadoop/output1
>>>>>>
>>>>>> Task Summary
>>>>>> ============================
>>>>>> Kind    Total    Successful    Failed    Killed    StartTime
>>>>>> FinishTime
>>>>>>
>>>>>> Setup    1    1        0    0    26-Feb-2013 10:57:47    26-Feb-2013
>>>>>> 10:57:55 (8sec)
>>>>>> Map    24    24        0    0    26-Feb-2013 10:57:57    26-Feb-2013
>>>>>> 11:05:37 (7mins, 40sec)
>>>>>> Reduce    1    1        0    0    26-Feb-2013 10:58:21    26-Feb-2013
>>>>>> 11:08:31 (10mins, 10sec)
>>>>>> Cleanup    1    1        0    0    26-Feb-2013 11:08:32
>>>>>> 26-Feb-2013 11:08:36 (4sec)
>>>>>> ============================
>>>>>>
>>>>>> so, though I mentioned to launch one map tasks, there are 24 of them.
>>>>>>
>>>>>> How to solve this problem. How to tell hadoop to launch only one map.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>