You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Florin P <fl...@yahoo.com> on 2011/07/20 08:41:27 UTC

Order of files in Map class

Hello!
  Suppose that we have the files F1, F2,..Fk given by the input splitter to the map class, what is the order in which they will arrive when map function  is applied? 
  What is interesting me  if  it is possible that in the map function to arrive mixed key-value pairs from different files? They keys will arrive related with their file, till no more keys are left from source file or they can arrive one key from F1 one key from Fk and so on.
  Example:
   Mixed key value pairs at the map function:
    K1 from F1
    K5 from F5
    K7 from F8 
  etc

 ordered key-value pairs:
    K1 from F1
   ..
    K_end_F1 from F1
    K5 from F5
..  
  K_end_F5 from F5
  and so on.

I'll look forward for your answer.
  Regards,
 Florin  
    

Re: Order of files in Map class

Posted by Harsh J <ha...@cloudera.com>.
Missed link: [1] -
http://wiki.apache.org/hadoop/FAQ#If_a_block_size_of_64MB_is_used_and_a_file_is_written_that_uses_less_than_64MB.2C_will_64MB_of_disk_space_be_consumed.3F

On Wed, Jul 20, 2011 at 3:37 PM, Harsh J <ha...@cloudera.com> wrote:
> Florin,
>
> On Wed, Jul 20, 2011 at 2:03 PM, Florin P <fl...@yahoo.com> wrote:
>> Hello, Harsh!
>>  Thank you for your quick response. I have another questions:
>> 1. your are saying that each map task will take as an input one file, but when the files size are less than the block size then it is possible that a map task to take more than one file, isn't it?
>
> If you have a 2 MB file on DFS with a configured block size of 256 MB,
> the file still takes only 2 MB. See [1]. The block size factor is a
> mere splitting enforcer, not a fill-up thing. No two files can reside
> on the same 'block'.
>
>> 2.In this particular case, the same behavior will happen (meaning each file will be processed till end and then next one)?
>
> Unless you pack in more blocks per split with an input format like
> CombineFileInputFormat, this does not happen.
>
> But if you do use CombineFileInputFormat then yes it does happen like that.
>
> Of course, you can also write your own custom InputFormat+RecordReader
> that can mix files' records as you want it to (the mapjoin example
> reads off multiple files at a time, for example, to join).
>
>> --- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote:
>>
>>> From: Harsh J <ha...@cloudera.com>
>>> Subject: Re: Order of files in Map class
>>> To: hdfs-user@hadoop.apache.org
>>> Date: Wednesday, July 20, 2011, 3:44 AM
>>> Florin,
>>>
>>> Your second example is how it happens in Hadoop, but
>>> there's more here
>>> to understand.
>>>
>>> To start with, your InputFormat (input splitter) computes
>>> and
>>> publishes a set amount of InputSplits. The total number of
>>> input
>>> splits is gonna be your total number of 'Map Tasks' in
>>> Hadoop as the
>>> job proceeds. The input splits are generally block splits,
>>> i.e.,
>>> start-and-stop lengths over the same file.
>>>
>>> Each 'MapTask' is designated one split from this list of
>>> splits. So
>>> every map task would initialize separately, in its own JVM
>>> (no shared
>>> resources -- again, its a different instance of mappers per
>>> file or
>>> block!) and read the input split alone, into its map(key,
>>> value,
>>> context) function.
>>>
>>> So to summarize, your second example is what will happen,
>>> but it would
>>> be in parallel instead, such as:
>>>
>>> map1 | map2 | …
>>> file1 | file2 | …
>>> row1 | row1 | …
>>> row2 | row 2 | …
>>>
>>> P.s. What I've explained here is the default behavior. Of
>>> course
>>> things can be highly tweaked to achieve other things, like
>>> your first
>>> example, but those probably come with greater read costs
>>> attached. The
>>> 'hadoop' way is data local, and one-file-per-task.
>>>
>>> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <fl...@yahoo.com>
>>> wrote:
>>> > Hello!
>>> >  Suppose that we have the files F1, F2,..Fk given by
>>> the input splitter to the map class, what is the order in
>>> which they will arrive when map function  is applied?
>>> >  What is interesting me  if  it is possible that in
>>> the map function to arrive mixed key-value pairs from
>>> different files? They keys will arrive related with their
>>> file, till no more keys are left from source file or they
>>> can arrive one key from F1 one key from Fk and so on.
>>> >  Example:
>>> >   Mixed key value pairs at the map function:
>>> >    K1 from F1
>>> >    K5 from F5
>>> >    K7 from F8
>>> >  etc
>>> >
>>> >  ordered key-value pairs:
>>> >    K1 from F1
>>> >   ..
>>> >    K_end_F1 from F1
>>> >    K5 from F5
>>> > ..
>>> >  K_end_F5 from F5
>>> >  and so on.
>>> >
>>> > I'll look forward for your answer.
>>> >  Regards,
>>> >  Florin
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>
>
>
> --
> Harsh J
>



-- 
Harsh J

Re: Order of files in Map class

Posted by Harsh J <ha...@cloudera.com>.
Florin,

On Wed, Jul 20, 2011 at 2:03 PM, Florin P <fl...@yahoo.com> wrote:
> Hello, Harsh!
>  Thank you for your quick response. I have another questions:
> 1. your are saying that each map task will take as an input one file, but when the files size are less than the block size then it is possible that a map task to take more than one file, isn't it?

If you have a 2 MB file on DFS with a configured block size of 256 MB,
the file still takes only 2 MB. See [1]. The block size factor is a
mere splitting enforcer, not a fill-up thing. No two files can reside
on the same 'block'.

> 2.In this particular case, the same behavior will happen (meaning each file will be processed till end and then next one)?

Unless you pack in more blocks per split with an input format like
CombineFileInputFormat, this does not happen.

But if you do use CombineFileInputFormat then yes it does happen like that.

Of course, you can also write your own custom InputFormat+RecordReader
that can mix files' records as you want it to (the mapjoin example
reads off multiple files at a time, for example, to join).

> --- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote:
>
>> From: Harsh J <ha...@cloudera.com>
>> Subject: Re: Order of files in Map class
>> To: hdfs-user@hadoop.apache.org
>> Date: Wednesday, July 20, 2011, 3:44 AM
>> Florin,
>>
>> Your second example is how it happens in Hadoop, but
>> there's more here
>> to understand.
>>
>> To start with, your InputFormat (input splitter) computes
>> and
>> publishes a set amount of InputSplits. The total number of
>> input
>> splits is gonna be your total number of 'Map Tasks' in
>> Hadoop as the
>> job proceeds. The input splits are generally block splits,
>> i.e.,
>> start-and-stop lengths over the same file.
>>
>> Each 'MapTask' is designated one split from this list of
>> splits. So
>> every map task would initialize separately, in its own JVM
>> (no shared
>> resources -- again, its a different instance of mappers per
>> file or
>> block!) and read the input split alone, into its map(key,
>> value,
>> context) function.
>>
>> So to summarize, your second example is what will happen,
>> but it would
>> be in parallel instead, such as:
>>
>> map1 | map2 | …
>> file1 | file2 | …
>> row1 | row1 | …
>> row2 | row 2 | …
>>
>> P.s. What I've explained here is the default behavior. Of
>> course
>> things can be highly tweaked to achieve other things, like
>> your first
>> example, but those probably come with greater read costs
>> attached. The
>> 'hadoop' way is data local, and one-file-per-task.
>>
>> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <fl...@yahoo.com>
>> wrote:
>> > Hello!
>> >  Suppose that we have the files F1, F2,..Fk given by
>> the input splitter to the map class, what is the order in
>> which they will arrive when map function  is applied?
>> >  What is interesting me  if  it is possible that in
>> the map function to arrive mixed key-value pairs from
>> different files? They keys will arrive related with their
>> file, till no more keys are left from source file or they
>> can arrive one key from F1 one key from Fk and so on.
>> >  Example:
>> >   Mixed key value pairs at the map function:
>> >    K1 from F1
>> >    K5 from F5
>> >    K7 from F8
>> >  etc
>> >
>> >  ordered key-value pairs:
>> >    K1 from F1
>> >   ..
>> >    K_end_F1 from F1
>> >    K5 from F5
>> > ..
>> >  K_end_F5 from F5
>> >  and so on.
>> >
>> > I'll look forward for your answer.
>> >  Regards,
>> >  Florin
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J

Re: Order of files in Map class

Posted by Florin P <fl...@yahoo.com>.
Hello, Harsh!
  Thank you for your quick response. I have another questions:
1. your are saying that each map task will take as an input one file, but when the files size are less than the block size then it is possible that a map task to take more than one file, isn't it? 
2.In this particular case, the same behavior will happen (meaning each file will be processed till end and then next one)?
   Regards,
   Florin 

--- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote:

> From: Harsh J <ha...@cloudera.com>
> Subject: Re: Order of files in Map class
> To: hdfs-user@hadoop.apache.org
> Date: Wednesday, July 20, 2011, 3:44 AM
> Florin,
> 
> Your second example is how it happens in Hadoop, but
> there's more here
> to understand.
> 
> To start with, your InputFormat (input splitter) computes
> and
> publishes a set amount of InputSplits. The total number of
> input
> splits is gonna be your total number of 'Map Tasks' in
> Hadoop as the
> job proceeds. The input splits are generally block splits,
> i.e.,
> start-and-stop lengths over the same file.
> 
> Each 'MapTask' is designated one split from this list of
> splits. So
> every map task would initialize separately, in its own JVM
> (no shared
> resources -- again, its a different instance of mappers per
> file or
> block!) and read the input split alone, into its map(key,
> value,
> context) function.
> 
> So to summarize, your second example is what will happen,
> but it would
> be in parallel instead, such as:
> 
> map1 | map2 | …
> file1 | file2 | …
> row1 | row1 | …
> row2 | row 2 | …
> 
> P.s. What I've explained here is the default behavior. Of
> course
> things can be highly tweaked to achieve other things, like
> your first
> example, but those probably come with greater read costs
> attached. The
> 'hadoop' way is data local, and one-file-per-task.
> 
> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <fl...@yahoo.com>
> wrote:
> > Hello!
> >  Suppose that we have the files F1, F2,..Fk given by
> the input splitter to the map class, what is the order in
> which they will arrive when map function  is applied?
> >  What is interesting me  if  it is possible that in
> the map function to arrive mixed key-value pairs from
> different files? They keys will arrive related with their
> file, till no more keys are left from source file or they
> can arrive one key from F1 one key from Fk and so on.
> >  Example:
> >   Mixed key value pairs at the map function:
> >    K1 from F1
> >    K5 from F5
> >    K7 from F8
> >  etc
> >
> >  ordered key-value pairs:
> >    K1 from F1
> >   ..
> >    K_end_F1 from F1
> >    K5 from F5
> > ..
> >  K_end_F5 from F5
> >  and so on.
> >
> > I'll look forward for your answer.
> >  Regards,
> >  Florin
> >
> >
> 
> 
> 
> -- 
> Harsh J
> 

Re: Order of files in Map class

Posted by Harsh J <ha...@cloudera.com>.
Florin,

Your second example is how it happens in Hadoop, but there's more here
to understand.

To start with, your InputFormat (input splitter) computes and
publishes a set amount of InputSplits. The total number of input
splits is gonna be your total number of 'Map Tasks' in Hadoop as the
job proceeds. The input splits are generally block splits, i.e.,
start-and-stop lengths over the same file.

Each 'MapTask' is designated one split from this list of splits. So
every map task would initialize separately, in its own JVM (no shared
resources -- again, its a different instance of mappers per file or
block!) and read the input split alone, into its map(key, value,
context) function.

So to summarize, your second example is what will happen, but it would
be in parallel instead, such as:

map1 | map2 | …
file1 | file2 | …
row1 | row1 | …
row2 | row 2 | …

P.s. What I've explained here is the default behavior. Of course
things can be highly tweaked to achieve other things, like your first
example, but those probably come with greater read costs attached. The
'hadoop' way is data local, and one-file-per-task.

On Wed, Jul 20, 2011 at 12:11 PM, Florin P <fl...@yahoo.com> wrote:
> Hello!
>  Suppose that we have the files F1, F2,..Fk given by the input splitter to the map class, what is the order in which they will arrive when map function  is applied?
>  What is interesting me  if  it is possible that in the map function to arrive mixed key-value pairs from different files? They keys will arrive related with their file, till no more keys are left from source file or they can arrive one key from F1 one key from Fk and so on.
>  Example:
>   Mixed key value pairs at the map function:
>    K1 from F1
>    K5 from F5
>    K7 from F8
>  etc
>
>  ordered key-value pairs:
>    K1 from F1
>   ..
>    K_end_F1 from F1
>    K5 from F5
> ..
>  K_end_F5 from F5
>  and so on.
>
> I'll look forward for your answer.
>  Regards,
>  Florin
>
>



-- 
Harsh J