You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Thamizhannal Paramasivam <th...@gmail.com> on 2012/02/16 07:40:36 UTC

num of reducer

Hi All,
I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
gave number Of reducer=0. So, it is using 2 mapper to run all the input
files. If we did not state the number of mapper, would n't it pick the 1
mapper per input file? Or Does the default won't it pick a fair num of
mapper according to number input file?
Thanks,
tamil

Re: num of reducer

Posted by be...@gmail.com.

Hi Tamizh
         If your input comprises of text files then changing the input format to TextInputFormat can get things right. One mapper for each hdfs block.


Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Thamizhannal Paramasivam <th...@gmail.com>
Date: Thu, 16 Feb 2012 21:33:11 
To: <ma...@hadoop.apache.org>
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Re: num of reducer

Here are the input format for mapper.
Input Format: MultiFileInputFormat
MapperOutputKey : Text
MapperOutputValue: CustomWritable

I shall not be in the position to upgrade hadoop-0.19.2 for some reason.

I have checked in number of mapper on job-tracker.

Thanks,
Thamizh

On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> Hi Tamil,
>
> I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
> your question, most input formats should set the number mappers correctly.
> What input format are you using? Where did you see the number of tasks it
> assigned to the job?
>
> -Joey
>
>
> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
> thamizhannal.p@gmail.com> wrote:
>
>> Hi All,
>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
>> input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
>> gave number Of reducer=0. So, it is using 2 mapper to run all the input
>> files. If we did not state the number of mapper, would n't it pick the 1
>> mapper per input file? Or Does the default won't it pick a fair num of
>> mapper according to number input file?
>> Thanks,
>> tamil
>
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>
>

Re: num of reducer

Posted by Thamizhannal Paramasivam <th...@gmail.com>.

It worked me. Thanks a lot Bejoy.

Thanks
Thamizh

On Fri, Feb 17, 2012 at 3:08 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Tamizh
>          MultiFileInputFormat / CombineFileInputFormat is typically used
> where the input files are relatively small (typically less than a block
> size). When you use these, there is some loss in data locality, as all the
> splits a mapper process won't be in the same node.
>        TextInputFormat spawns one mapper each for one block in default
> (not one per file). Here you hold data locality pretty much compared to
> MultiFileInputFormat.
>       If your mapper is not very short lived and has some decent amount of
> processing involved then you can go with TextInputFormat . The one
> consideration you need to make is, on your specified input when this job is
> running it may span a larger number of map tasks there by occupying almost
> all your map task slots in your cluster. If there are other tasks to be
> triggred they may have to wait for free map slots. You may need to consider
> using a Scheduler for fair share of slots to other parallel jobs as well,
> if any.
>
> Regards
> Bejoy.K.S
>
>
>
> On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam <
> thamizhannal.p@gmail.com> wrote:
>
>> Thank you so much to Joey & Bejoy for your suggestions.
>>
>> The Job's input path has 1300-1400 text files and each of 100-200MB.
>>
>> I thought, TextInputFormat spans single mapper per file and
>> MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes
>> more many input files.
>>
>> Which input format do you thing would be most appropriate in my case and
>> why?
>>
>> Looking forward to your reply.
>>
>> Thanks,
>> Thamizh
>>
>>
>>
>> On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>>
>>> Is your data size 100-200MB *total*?
>>>
>>> If so, then this is the expected behavior for MultiFileInputFormat. As
>>> Bejoy says, you can switch to TextInputFormat to get one mapper per block
>>> (min one mapper per file).
>>>
>>> -Joey
>>>
>>>
>>> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam <
>>> thamizhannal.p@gmail.com> wrote:
>>>
>>>> Here are the input format for mapper.
>>>> Input Format: MultiFileInputFormat
>>>> MapperOutputKey : Text
>>>> MapperOutputValue: CustomWritable
>>>>
>>>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason.
>>>>
>>>> I have checked in number of mapper on job-tracker.
>>>>
>>>> Thanks,
>>>> Thamizh
>>>>
>>>>
>>>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>>>>
>>>>> Hi Tamil,
>>>>>
>>>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As
>>>>> for your question, most input formats should set the number mappers
>>>>> correctly. What input format are you using? Where did you see the number of
>>>>> tasks it assigned to the job?
>>>>>
>>>>> -Joey
>>>>>
>>>>>
>>>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
>>>>> thamizhannal.p@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster.
>>>>>> It's input path has >1000 files of 100-200MB. Since, it is Mapper only job,
>>>>>> I gave number Of reducer=0. So, it is using 2 mapper to run all the input
>>>>>> files. If we did not state the number of mapper, would n't it pick the 1
>>>>>> mapper per input file? Or Does the default won't it pick a fair num of
>>>>>> mapper according to number input file?
>>>>>> Thanks,
>>>>>> tamil
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Echeverria
>>>>> Cloudera, Inc.
>>>>> 443.305.9434
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>
>

Re: num of reducer

Posted by Bejoy Ks <be...@gmail.com>.

Hi Tamizh
         MultiFileInputFormat / CombineFileInputFormat is typically used
where the input files are relatively small (typically less than a block
size). When you use these, there is some loss in data locality, as all the
splits a mapper process won't be in the same node.
       TextInputFormat spawns one mapper each for one block in default (not
one per file). Here you hold data locality pretty much compared to
MultiFileInputFormat.
      If your mapper is not very short lived and has some decent amount of
processing involved then you can go with TextInputFormat . The one
consideration you need to make is, on your specified input when this job is
running it may span a larger number of map tasks there by occupying almost
all your map task slots in your cluster. If there are other tasks to be
triggred they may have to wait for free map slots. You may need to consider
using a Scheduler for fair share of slots to other parallel jobs as well,
if any.

Regards
Bejoy.K.S


On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam <
thamizhannal.p@gmail.com> wrote:

> Thank you so much to Joey & Bejoy for your suggestions.
>
> The Job's input path has 1300-1400 text files and each of 100-200MB.
>
> I thought, TextInputFormat spans single mapper per file and
> MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes
> more many input files.
>
> Which input format do you thing would be most appropriate in my case and
> why?
>
> Looking forward to your reply.
>
> Thanks,
> Thamizh
>
>
>
> On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>
>> Is your data size 100-200MB *total*?
>>
>> If so, then this is the expected behavior for MultiFileInputFormat. As
>> Bejoy says, you can switch to TextInputFormat to get one mapper per block
>> (min one mapper per file).
>>
>> -Joey
>>
>>
>> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam <
>> thamizhannal.p@gmail.com> wrote:
>>
>>> Here are the input format for mapper.
>>> Input Format: MultiFileInputFormat
>>> MapperOutputKey : Text
>>> MapperOutputValue: CustomWritable
>>>
>>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason.
>>>
>>> I have checked in number of mapper on job-tracker.
>>>
>>> Thanks,
>>> Thamizh
>>>
>>>
>>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>>>
>>>> Hi Tamil,
>>>>
>>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As
>>>> for your question, most input formats should set the number mappers
>>>> correctly. What input format are you using? Where did you see the number of
>>>> tasks it assigned to the job?
>>>>
>>>> -Joey
>>>>
>>>>
>>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
>>>> thamizhannal.p@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster.
>>>>> It's input path has >1000 files of 100-200MB. Since, it is Mapper only job,
>>>>> I gave number Of reducer=0. So, it is using 2 mapper to run all the input
>>>>> files. If we did not state the number of mapper, would n't it pick the 1
>>>>> mapper per input file? Or Does the default won't it pick a fair num of
>>>>> mapper according to number input file?
>>>>> Thanks,
>>>>> tamil
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joseph Echeverria
>>>> Cloudera, Inc.
>>>> 443.305.9434
>>>>
>>>>
>>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>>
>

Re: num of reducer

Posted by Thamizhannal Paramasivam <th...@gmail.com>.

Thank you so much to Joey & Bejoy for your suggestions.

The Job's input path has 1300-1400 text files and each of 100-200MB.

I thought, TextInputFormat spans single mapper per file and
MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes
more many input files.

Which input format do you thing would be most appropriate in my case and
why?

Looking forward to your reply.

Thanks,
Thamizh


On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> Is your data size 100-200MB *total*?
>
> If so, then this is the expected behavior for MultiFileInputFormat. As
> Bejoy says, you can switch to TextInputFormat to get one mapper per block
> (min one mapper per file).
>
> -Joey
>
>
> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam <
> thamizhannal.p@gmail.com> wrote:
>
>> Here are the input format for mapper.
>> Input Format: MultiFileInputFormat
>> MapperOutputKey : Text
>> MapperOutputValue: CustomWritable
>>
>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason.
>>
>> I have checked in number of mapper on job-tracker.
>>
>> Thanks,
>> Thamizh
>>
>>
>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>>
>>> Hi Tamil,
>>>
>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
>>> your question, most input formats should set the number mappers correctly.
>>> What input format are you using? Where did you see the number of tasks it
>>> assigned to the job?
>>>
>>> -Joey
>>>
>>>
>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
>>> thamizhannal.p@gmail.com> wrote:
>>>
>>>> Hi All,
>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
>>>> input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
>>>> gave number Of reducer=0. So, it is using 2 mapper to run all the input
>>>> files. If we did not state the number of mapper, would n't it pick the 1
>>>> mapper per input file? Or Does the default won't it pick a fair num of
>>>> mapper according to number input file?
>>>> Thanks,
>>>> tamil
>>>
>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>
>

Re: num of reducer

Posted by Joey Echeverria <jo...@cloudera.com>.

Is your data size 100-200MB *total*?

If so, then this is the expected behavior for MultiFileInputFormat. As
Bejoy says, you can switch to TextInputFormat to get one mapper per block
(min one mapper per file).

-Joey

On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam <
thamizhannal.p@gmail.com> wrote:

> Here are the input format for mapper.
> Input Format: MultiFileInputFormat
> MapperOutputKey : Text
> MapperOutputValue: CustomWritable
>
> I shall not be in the position to upgrade hadoop-0.19.2 for some reason.
>
> I have checked in number of mapper on job-tracker.
>
> Thanks,
> Thamizh
>
>
> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com>wrote:
>
>> Hi Tamil,
>>
>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
>> your question, most input formats should set the number mappers correctly.
>> What input format are you using? Where did you see the number of tasks it
>> assigned to the job?
>>
>> -Joey
>>
>>
>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
>> thamizhannal.p@gmail.com> wrote:
>>
>>> Hi All,
>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
>>> input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
>>> gave number Of reducer=0. So, it is using 2 mapper to run all the input
>>> files. If we did not state the number of mapper, would n't it pick the 1
>>> mapper per input file? Or Does the default won't it pick a fair num of
>>> mapper according to number input file?
>>> Thanks,
>>> tamil
>>
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>>
>


-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: num of reducer

Posted by Thamizhannal Paramasivam <th...@gmail.com>.

Here are the input format for mapper.
Input Format: MultiFileInputFormat
MapperOutputKey : Text
MapperOutputValue: CustomWritable

I shall not be in the position to upgrade hadoop-0.19.2 for some reason.

I have checked in number of mapper on job-tracker.

Thanks,
Thamizh

On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> Hi Tamil,
>
> I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
> your question, most input formats should set the number mappers correctly.
> What input format are you using? Where did you see the number of tasks it
> assigned to the job?
>
> -Joey
>
>
> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
> thamizhannal.p@gmail.com> wrote:
>
>> Hi All,
>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
>> input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
>> gave number Of reducer=0. So, it is using 2 mapper to run all the input
>> files. If we did not state the number of mapper, would n't it pick the 1
>> mapper per input file? Or Does the default won't it pick a fair num of
>> mapper according to number input file?
>> Thanks,
>> tamil
>
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>
>

Re: num of reducer

Posted by Joey Echeverria <jo...@cloudera.com>.

Hi Tamil,

I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
your question, most input formats should set the number mappers correctly.
What input format are you using? Where did you see the number of tasks it
assigned to the job?

-Joey

On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
thamizhannal.p@gmail.com> wrote:

> Hi All,
> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's
> input path has >1000 files of 100-200MB. Since, it is Mapper only job, I
> gave number Of reducer=0. So, it is using 2 mapper to run all the input
> files. If we did not state the number of mapper, would n't it pick the 1
> mapper per input file? Or Does the default won't it pick a fair num of
> mapper according to number input file?
> Thanks,
> tamil




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434