You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Goel, Ankur" <an...@corp.aol.com> on 2008/07/11 13:15:17 UTC

MultiFileInputFormat - Not enough mappers

Hi Folks,
              I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB)
Using TextInputFormat resulted in too many mappers (1 per file) creating
a lot of overhead so I switched to
MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop automatically
takes care of generating the right
number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the
application to specify the right number of mappers
or am I missing something ? Please advise.
 
Thanks
-Ankur

Re: MultiFileInputFormat - Not enough mappers

Posted by Enis Soztutar <en...@gmail.com>.

Yes, please open a jira for this. We should ensure that 
avgLengthPerSplit in MultiFileInputFormat should not exceed default file 
block size. However unlike FileInputFormat, all the files will come from 
a different block.


Goel, Ankur wrote:
> In this case I have to compute the number of map tasks in the
> application - (totalSize / blockSize), which is what I am doing as a
> work-around.
> I think this should be the default behaviour in MultiFileInputFormat.
> Should a JIRA be opened for the same ?
>
> -Ankur
>
>
> -----Original Message-----
> From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com] 
> Sent: Friday, July 11, 2008 7:21 PM
> To: core-user@hadoop.apache.org
> Subject: Re: MultiFileInputFormat - Not enough mappers
>
> MultiFileSplit currently does not support automatic map task count
> computation. You can manually set the number of maps via
> jobConf#setNumMapTasks() or via command line arg -D
> mapred.map.tasks=<number>
>
>
> Goel, Ankur wrote:
>   
>> Hi Folks,
>>               I am using hadoop to process some temporal data which is
>>     
>
>   
>> split in lot of small files (~ 3 - 4 MB) Using TextInputFormat 
>> resulted in too many mappers (1 per file) creating a lot of overhead 
>> so I switched to MultiFileInputFormat - 
>> (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
>>  
>> I was hoping to set the no of mappers to 1 so that hadoop 
>> automatically takes care of generating the right number of map tasks.
>>  
>> Looks like when using MultiFileInputFormat one has to rely on the 
>> application to specify the right number of mappers or am I missing 
>> something ? Please advise.
>>  
>> Thanks
>> -Ankur
>>
>>   
>>     
>
>

RE: MultiFileInputFormat - Not enough mappers

Posted by "Goel, Ankur" <an...@corp.aol.com>.

In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?

-Ankur

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com] 
Sent: Friday, July 11, 2008 7:21 PM
To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers

MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=<number>

Goel, Ankur wrote:
> Hi Folks,
>               I am using hadoop to process some temporal data which is

> split in lot of small files (~ 3 - 4 MB) Using TextInputFormat 
> resulted in too many mappers (1 per file) creating a lot of overhead 
> so I switched to MultiFileInputFormat - 
> (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
>  
> I was hoping to set the no of mappers to 1 so that hadoop 
> automatically takes care of generating the right number of map tasks.
>  
> Looks like when using MultiFileInputFormat one has to rely on the 
> application to specify the right number of mappers or am I missing 
> something ? Please advise.
>  
> Thanks
> -Ankur
>
>

Re: MultiFileInputFormat - Not enough mappers

Posted by Enis Soztutar <en...@gmail.com>.

MultiFileSplit currently does not support automatic map task count 
computation. You can manually
set the number of maps via jobConf#setNumMapTasks() or via command line 
arg -D mapred.map.tasks=<number>


Goel, Ankur wrote:
> Hi Folks,
>               I am using hadoop to process some temporal data which is
> split in lot of small files (~ 3 - 4 MB)
> Using TextInputFormat resulted in too many mappers (1 per file) creating
> a lot of overhead so I switched to
> MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
> in just 1 mapper.
>  
> I was hoping to set the no of mappers to 1 so that hadoop automatically
> takes care of generating the right
> number of map tasks.
>  
> Looks like when using MultiFileInputFormat one has to rely on the
> application to specify the right number of mappers
> or am I missing something ? Please advise.
>  
> Thanks
> -Ankur
>
>