You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mapred Learn <ma...@gmail.com> on 2011/05/25 02:17:55 UTC

how to use mapred.min.split.size option ?

Hi,
I have few input splits that are few MB in size.
I want to submit 1 GB of input to every mapper. How can I do it ?
Currently each mapper gets one input split that results in many small
map-output files.

I tried setting -Dmapred.map.min.split.size=<number> , but still it does not
take effect.

Thanks,
-JJ

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Sorry it is working,, i was not giving right value with
-Dmapred.max.split.size.

Thanks for your help !

On Wed, May 25, 2011 at 11:34 AM, Mapred Learn <ma...@gmail.com>wrote:

> Hi Harsh,
> I just implemented a combineFile InputFormat and its record reader for my
> case.
>
> Now my input has 10 files each of 233 MB and by using this, My job just
> runs 1 mapper that processes  them.
>
> How can I control it by split size i.e. if i say make every split 1 GB i.e.
> run 3 mappers for these 10 files not 1 ?
>
> Thanks,
> -JJ
>
>
> On Wed, May 25, 2011 at 10:05 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> This is the correct behavior. Regular FileInputFormat derivatives
>> would transform, at the least, one file == one mapper. You need to
>> look at CombineFileInputFormat/etc. to have multiple files per map
>> task.
>>
>> On Wed, May 25, 2011 at 10:28 PM, Mapred Learn <ma...@gmail.com>
>> wrote:
>> > I gave mapred.min.size=1000000000L i.e. 1 GB and each input file is 233
>> MB
>> > and block size = 64 MB.
>> > With all these values, i thought my split size would work and 4 input
>> files
>> > would be combined to get 1 GB input split but somehow this does not
>> happen
>> > and I get 10 mappers , each corresponding to 233 MB file.
>> >
>> > On Wed, May 25, 2011 at 7:59 AM, Mapred Learn <ma...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks Juwei !
>> >> I will go through this..
>> >>
>> >> Sent from my iPhone
>> >> On May 25, 2011, at 7:51 AM, Juwei Shi <sh...@gmail.com> wrote:
>> >>
>> >> The following are suitable for hadoop 0.20.2.
>> >>
>> >> 2011/5/25 Juwei Shi <sh...@gmail.com>
>> >>>
>> >>> The input split size is detemined by map.min.split.size,
>> dfs.block.size
>> >>> and mapred.map.tasks.
>> >>>
>> >>> goalSize = totalSize / mapred.map.tasks
>> >>> minSize = max {mapred.min.split.size, minSplitSize}
>> >>> splitSize= max (minSize, min(goalSize, dfs.block.size))
>> >>>
>> >>> minSplitSize is determined by each InputFormat such as
>> >>> SequenceFileInputFormat.
>> >>>
>> >>> You may want to refer to FileInputFormat.java for more details.
>> >>>
>> >>>
>> >>> 2011/5/25 Mapred Learn <ma...@gmail.com>
>> >>>>
>> >>>> Resending ====>
>> >>>>
>> >>>>
>> >>>> > Hi,
>> >>>> > I have few input splits that are few MB in size.
>> >>>> > I want to submit 1 GB of input to every mapper. Does anyone know
>> how
>> >>>> > can I do it ?
>> >>>> > Currently each mapper gets one input split that results in many
>> small
>> >>>> > map-output files.
>> >>>> >
>> >>>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it
>> >>>> > does not take effect.
>> >>>> >
>> >>>> > Thanks,
>> >>>> > -JJ
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> - Juwei Shi
>> >>
>> >>
>> >>
>> >> --
>> >> - Juwei Shi (史巨伟)
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Hi Harsh,
I just implemented a combineFile InputFormat and its record reader for my
case.

Now my input has 10 files each of 233 MB and by using this, My job just runs
1 mapper that processes  them.

How can I control it by split size i.e. if i say make every split 1 GB i.e.
run 3 mappers for these 10 files not 1 ?

Thanks,
-JJ


On Wed, May 25, 2011 at 10:05 AM, Harsh J <ha...@cloudera.com> wrote:

> This is the correct behavior. Regular FileInputFormat derivatives
> would transform, at the least, one file == one mapper. You need to
> look at CombineFileInputFormat/etc. to have multiple files per map
> task.
>
> On Wed, May 25, 2011 at 10:28 PM, Mapred Learn <ma...@gmail.com>
> wrote:
> > I gave mapred.min.size=1000000000L i.e. 1 GB and each input file is 233
> MB
> > and block size = 64 MB.
> > With all these values, i thought my split size would work and 4 input
> files
> > would be combined to get 1 GB input split but somehow this does not
> happen
> > and I get 10 mappers , each corresponding to 233 MB file.
> >
> > On Wed, May 25, 2011 at 7:59 AM, Mapred Learn <ma...@gmail.com>
> > wrote:
> >>
> >> Thanks Juwei !
> >> I will go through this..
> >>
> >> Sent from my iPhone
> >> On May 25, 2011, at 7:51 AM, Juwei Shi <sh...@gmail.com> wrote:
> >>
> >> The following are suitable for hadoop 0.20.2.
> >>
> >> 2011/5/25 Juwei Shi <sh...@gmail.com>
> >>>
> >>> The input split size is detemined by map.min.split.size, dfs.block.size
> >>> and mapred.map.tasks.
> >>>
> >>> goalSize = totalSize / mapred.map.tasks
> >>> minSize = max {mapred.min.split.size, minSplitSize}
> >>> splitSize= max (minSize, min(goalSize, dfs.block.size))
> >>>
> >>> minSplitSize is determined by each InputFormat such as
> >>> SequenceFileInputFormat.
> >>>
> >>> You may want to refer to FileInputFormat.java for more details.
> >>>
> >>>
> >>> 2011/5/25 Mapred Learn <ma...@gmail.com>
> >>>>
> >>>> Resending ====>
> >>>>
> >>>>
> >>>> > Hi,
> >>>> > I have few input splits that are few MB in size.
> >>>> > I want to submit 1 GB of input to every mapper. Does anyone know how
> >>>> > can I do it ?
> >>>> > Currently each mapper gets one input split that results in many
> small
> >>>> > map-output files.
> >>>> >
> >>>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it
> >>>> > does not take effect.
> >>>> >
> >>>> > Thanks,
> >>>> > -JJ
> >>>
> >>>
> >>>
> >>> --
> >>> - Juwei Shi
> >>
> >>
> >>
> >> --
> >> - Juwei Shi (史巨伟)
> >
> >
>
>
>
> --
> Harsh J
>

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
I gave mapred.min.size=1000000000L i.e. 1 GB and each input file is 233 MB
and block size = 64 MB.
With all these values, i thought my split size would work and 4 input files
would be combined to get 1 GB input split but somehow this does not happen
and I get 10 mappers , each corresponding to 233 MB file.

On Wed, May 25, 2011 at 7:59 AM, Mapred Learn <ma...@gmail.com>wrote:

>  Thanks Juwei !
> I will go through this..
>
> Sent from my iPhone
>
> On May 25, 2011, at 7:51 AM, Juwei Shi <sh...@gmail.com> wrote:
>
> The following are suitable for hadoop 0.20.2.
>
> 2011/5/25 Juwei Shi <sh...@gmail.com>
>
>> The input split size is detemined by map.min.split.size, dfs.block.size
>> and mapred.map.tasks.
>>
>> goalSize = totalSize / mapred.map.tasks
>> minSize = max {mapred.min.split.size, minSplitSize}
>> splitSize= max (minSize, min(goalSize, dfs.block.size))
>>
>> minSplitSize is determined by each InputFormat such as
>> SequenceFileInputFormat.
>>
>> You may want to refer to FileInputFormat.java for more details.
>>
>>
>> 2011/5/25 Mapred Learn <ma...@gmail.com>
>>
>>> Resending ====>
>>>
>>>
>>> > Hi,
>>> > I have few input splits that are few MB in size.
>>> > I want to submit 1 GB of input to every mapper. Does anyone know how
>>> can I do it ?
>>>  > Currently each mapper gets one input split that results in many small
>>> map-output files.
>>> >
>>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it
>>> does not take effect.
>>> >
>>> > Thanks,
>>> > -JJ
>>>
>>
>>
>>
>> --
>> - Juwei Shi
>>
>
>
>
> --
> - Juwei Shi (史巨伟)
>
>

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Thanks Juwei !
I will go through this..

Sent from my iPhone

On May 25, 2011, at 7:51 AM, Juwei Shi <sh...@gmail.com> wrote:

> The following are suitable for hadoop 0.20.2. 
> 
> 2011/5/25 Juwei Shi <sh...@gmail.com>
> The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks. 
> 
> goalSize = totalSize / mapred.map.tasks 
> minSize = max {mapred.min.split.size, minSplitSize}
> splitSize= max (minSize, min(goalSize, dfs.block.size))
> 
> minSplitSize is determined by each InputFormat such as SequenceFileInputFormat. 
> 
> You may want to refer to FileInputFormat.java for more details. 
> 
> 
> 2011/5/25 Mapred Learn <ma...@gmail.com>
> Resending ====>
> 
> 
> > Hi,
> > I have few input splits that are few MB in size.
> > I want to submit 1 GB of input to every mapper. Does anyone know how can I do it ?
> > Currently each mapper gets one input split that results in many small map-output files.
> >
> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does not take effect.
> >
> > Thanks,
> > -JJ
> 
> 
> 
> -- 
> - Juwei Shi
> 
> 
> 
> -- 
> - Juwei Shi (史巨伟)

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Thanks Juwei !
I will go through this..

Sent from my iPhone

On May 25, 2011, at 7:51 AM, Juwei Shi <sh...@gmail.com> wrote:

> The following are suitable for hadoop 0.20.2. 
> 
> 2011/5/25 Juwei Shi <sh...@gmail.com>
> The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks. 
> 
> goalSize = totalSize / mapred.map.tasks 
> minSize = max {mapred.min.split.size, minSplitSize}
> splitSize= max (minSize, min(goalSize, dfs.block.size))
> 
> minSplitSize is determined by each InputFormat such as SequenceFileInputFormat. 
> 
> You may want to refer to FileInputFormat.java for more details. 
> 
> 
> 2011/5/25 Mapred Learn <ma...@gmail.com>
> Resending ====>
> 
> 
> > Hi,
> > I have few input splits that are few MB in size.
> > I want to submit 1 GB of input to every mapper. Does anyone know how can I do it ?
> > Currently each mapper gets one input split that results in many small map-output files.
> >
> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does not take effect.
> >
> > Thanks,
> > -JJ
> 
> 
> 
> -- 
> - Juwei Shi
> 
> 
> 
> -- 
> - Juwei Shi (史巨伟)

Re: how to use mapred.min.split.size option ?

Posted by Juwei Shi <sh...@gmail.com>.
The following are suitable for hadoop 0.20.2.

2011/5/25 Juwei Shi <sh...@gmail.com>

> The input split size is detemined by map.min.split.size, dfs.block.size and
> mapred.map.tasks.
>
> goalSize = totalSize / mapred.map.tasks
> minSize = max {mapred.min.split.size, minSplitSize}
> splitSize= max (minSize, min(goalSize, dfs.block.size))
>
> minSplitSize is determined by each InputFormat such as
> SequenceFileInputFormat.
>
> You may want to refer to FileInputFormat.java for more details.
>
>
> 2011/5/25 Mapred Learn <ma...@gmail.com>
>
>> Resending ====>
>>
>>
>> > Hi,
>> > I have few input splits that are few MB in size.
>> > I want to submit 1 GB of input to every mapper. Does anyone know how can
>> I do it ?
>> > Currently each mapper gets one input split that results in many small
>> map-output files.
>> >
>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does
>> not take effect.
>> >
>> > Thanks,
>> > -JJ
>>
>
>
>
> --
> - Juwei Shi
>



-- 
- Juwei Shi (史巨伟)

Re: how to use mapred.min.split.size option ?

Posted by Juwei Shi <sh...@gmail.com>.
The following are suitable for hadoop 0.20.2.

2011/5/25 Juwei Shi <sh...@gmail.com>

> The input split size is detemined by map.min.split.size, dfs.block.size and
> mapred.map.tasks.
>
> goalSize = totalSize / mapred.map.tasks
> minSize = max {mapred.min.split.size, minSplitSize}
> splitSize= max (minSize, min(goalSize, dfs.block.size))
>
> minSplitSize is determined by each InputFormat such as
> SequenceFileInputFormat.
>
> You may want to refer to FileInputFormat.java for more details.
>
>
> 2011/5/25 Mapred Learn <ma...@gmail.com>
>
>> Resending ====>
>>
>>
>> > Hi,
>> > I have few input splits that are few MB in size.
>> > I want to submit 1 GB of input to every mapper. Does anyone know how can
>> I do it ?
>> > Currently each mapper gets one input split that results in many small
>> map-output files.
>> >
>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does
>> not take effect.
>> >
>> > Thanks,
>> > -JJ
>>
>
>
>
> --
> - Juwei Shi
>



-- 
- Juwei Shi (史巨伟)

Re: how to use mapred.min.split.size option ?

Posted by Juwei Shi <sh...@gmail.com>.
The input split size is detemined by map.min.split.size, dfs.block.size and
mapred.map.tasks.

goalSize = totalSize / mapred.map.tasks
minSize = max {mapred.min.split.size, minSplitSize}
splitSize= max (minSize, min(goalSize, dfs.block.size))

minSplitSize is determined by each InputFormat such as
SequenceFileInputFormat.

You may want to refer to FileInputFormat.java for more details.


2011/5/25 Mapred Learn <ma...@gmail.com>

> Resending ====>
>
>
> > Hi,
> > I have few input splits that are few MB in size.
> > I want to submit 1 GB of input to every mapper. Does anyone know how can
> I do it ?
> > Currently each mapper gets one input split that results in many small
> map-output files.
> >
> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does
> not take effect.
> >
> > Thanks,
> > -JJ
>



-- 
- Juwei Shi

Re: how to use mapred.min.split.size option ?

Posted by Juwei Shi <sh...@gmail.com>.
The input split size is detemined by map.min.split.size, dfs.block.size and
mapred.map.tasks.

goalSize = totalSize / mapred.map.tasks
minSize = max {mapred.min.split.size, minSplitSize}
splitSize= max (minSize, min(goalSize, dfs.block.size))

minSplitSize is determined by each InputFormat such as
SequenceFileInputFormat.

You may want to refer to FileInputFormat.java for more details.


2011/5/25 Mapred Learn <ma...@gmail.com>

> Resending ====>
>
>
> > Hi,
> > I have few input splits that are few MB in size.
> > I want to submit 1 GB of input to every mapper. Does anyone know how can
> I do it ?
> > Currently each mapper gets one input split that results in many small
> map-output files.
> >
> > I tried setting -Dmapred.map.min.split.size=<number> , but still it does
> not take effect.
> >
> > Thanks,
> > -JJ
>



-- 
- Juwei Shi

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Resending ====>


> Hi,
> I have few input splits that are few MB in size.
> I want to submit 1 GB of input to every mapper. Does anyone know how can I do it ?
> Currently each mapper gets one input split that results in many small map-output files.
>  
> I tried setting -Dmapred.map.min.split.size=<number> , but still it does not take effect.
>  
> Thanks,
> -JJ

Re: how to use mapred.min.split.size option ?

Posted by Mapred Learn <ma...@gmail.com>.
Resending ====>


> Hi,
> I have few input splits that are few MB in size.
> I want to submit 1 GB of input to every mapper. Does anyone know how can I do it ?
> Currently each mapper gets one input split that results in many small map-output files.
>  
> I tried setting -Dmapred.map.min.split.size=<number> , but still it does not take effect.
>  
> Thanks,
> -JJ