You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Goel, Ankur" <an...@corp.aol.com> on 2008/07/11 13:15:17 UTC
MultiFileInputFormat - Not enough mappers
Hi Folks,
I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB)
Using TextInputFormat resulted in too many mappers (1 per file) creating
a lot of overhead so I switched to
MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
in just 1 mapper.
I was hoping to set the no of mappers to 1 so that hadoop automatically
takes care of generating the right
number of map tasks.
Looks like when using MultiFileInputFormat one has to rely on the
application to specify the right number of mappers
or am I missing something ? Please advise.
Thanks
-Ankur
Re: MultiFileInputFormat - Not enough mappers
Posted by Enis Soztutar <en...@gmail.com>.
Yes, please open a jira for this. We should ensure that
avgLengthPerSplit in MultiFileInputFormat should not exceed default file
block size. However unlike FileInputFormat, all the files will come from
a different block.
Goel, Ankur wrote:
> In this case I have to compute the number of map tasks in the
> application - (totalSize / blockSize), which is what I am doing as a
> work-around.
> I think this should be the default behaviour in MultiFileInputFormat.
> Should a JIRA be opened for the same ?
>
> -Ankur
>
>
> -----Original Message-----
> From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com]
> Sent: Friday, July 11, 2008 7:21 PM
> To: core-user@hadoop.apache.org
> Subject: Re: MultiFileInputFormat - Not enough mappers
>
> MultiFileSplit currently does not support automatic map task count
> computation. You can manually set the number of maps via
> jobConf#setNumMapTasks() or via command line arg -D
> mapred.map.tasks=<number>
>
>
> Goel, Ankur wrote:
>
>> Hi Folks,
>> I am using hadoop to process some temporal data which is
>>
>
>
>> split in lot of small files (~ 3 - 4 MB) Using TextInputFormat
>> resulted in too many mappers (1 per file) creating a lot of overhead
>> so I switched to MultiFileInputFormat -
>> (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
>>
>> I was hoping to set the no of mappers to 1 so that hadoop
>> automatically takes care of generating the right number of map tasks.
>>
>> Looks like when using MultiFileInputFormat one has to rely on the
>> application to specify the right number of mappers or am I missing
>> something ? Please advise.
>>
>> Thanks
>> -Ankur
>>
>>
>>
>
>
RE: MultiFileInputFormat - Not enough mappers
Posted by "Goel, Ankur" <an...@corp.aol.com>.
In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?
-Ankur
-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com]
Sent: Friday, July 11, 2008 7:21 PM
To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers
MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=<number>
Goel, Ankur wrote:
> Hi Folks,
> I am using hadoop to process some temporal data which is
> split in lot of small files (~ 3 - 4 MB) Using TextInputFormat
> resulted in too many mappers (1 per file) creating a lot of overhead
> so I switched to MultiFileInputFormat -
> (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
>
> I was hoping to set the no of mappers to 1 so that hadoop
> automatically takes care of generating the right number of map tasks.
>
> Looks like when using MultiFileInputFormat one has to rely on the
> application to specify the right number of mappers or am I missing
> something ? Please advise.
>
> Thanks
> -Ankur
>
>
Re: MultiFileInputFormat - Not enough mappers
Posted by Enis Soztutar <en...@gmail.com>.
MultiFileSplit currently does not support automatic map task count
computation. You can manually
set the number of maps via jobConf#setNumMapTasks() or via command line
arg -D mapred.map.tasks=<number>
Goel, Ankur wrote:
> Hi Folks,
> I am using hadoop to process some temporal data which is
> split in lot of small files (~ 3 - 4 MB)
> Using TextInputFormat resulted in too many mappers (1 per file) creating
> a lot of overhead so I switched to
> MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
> in just 1 mapper.
>
> I was hoping to set the no of mappers to 1 so that hadoop automatically
> takes care of generating the right
> number of map tasks.
>
> Looks like when using MultiFileInputFormat one has to rely on the
> application to specify the right number of mappers
> or am I missing something ? Please advise.
>
> Thanks
> -Ankur
>
>