You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sugandha Naolekar <su...@gmail.com> on 2014/02/25 06:57:08 UTC

Mappers vs. Map tasks

Hello,

As per the various articles I went through till date, the File(s) are split
in chunks/blocks. On the same note, would like to ask few things:


   1. No. of mappers are decided as: Total_File_Size/Max. Block Size. Thus,
   if the file is smaller than the block size, only one mapper will be
   invoked. Right?
   2. If yes, it means, the map() will be called only once. Right? In this
   case, if there are two datanodes with a replication factor as 1: only one
   datanode(mapper machine) will perform the task. Right?
   3. The map() function is called by all the datanodes/slaves right? If
   the no. of mappers are more than the no. of slaves, what happens?

--
Thanks & Regards,
Sugandha Naolekar

Re: Mappers vs. Map tasks

Posted by shashwat shriparv <dw...@gmail.com>.

You are really confused :) Please read this :

http://developer.yahoo.com/hadoop/tutorial/module4.html#closer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces



* Warm Regards_**∞_*
* Shashwat Shriparv*
 [image: http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]<http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9>[image:
https://twitter.com/shriparv] <https://twitter.com/shriparv>[image:
https://www.facebook.com/shriparv] <https://www.facebook.com/shriparv>[image:
http://google.com/+ShashwatShriparv]
<http://google.com/+ShashwatShriparv>[image:
http://www.youtube.com/user/sShriparv/videos]<http://www.youtube.com/user/sShriparv/videos>[image:
http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] <sh...@yahoo.com>



On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by shashwat shriparv <dw...@gmail.com>.

You are really confused :) Please read this :

http://developer.yahoo.com/hadoop/tutorial/module4.html#closer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces



* Warm Regards_**∞_*
* Shashwat Shriparv*
 [image: http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]<http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9>[image:
https://twitter.com/shriparv] <https://twitter.com/shriparv>[image:
https://www.facebook.com/shriparv] <https://www.facebook.com/shriparv>[image:
http://google.com/+ShashwatShriparv]
<http://google.com/+ShashwatShriparv>[image:
http://www.youtube.com/user/sShriparv/videos]<http://www.youtube.com/user/sShriparv/videos>[image:
http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] <sh...@yahoo.com>



On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

The code which generates split is part of the InputFormat, which is the
getSplits<http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf,
int)> method.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Feb 27, 2014 at 9:58 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> Hi
>
> I have worked on a similiar project, I had a M/R for preprocessing that
> would remove the NL chars and ensure that 1 line has a json object.
> Then the consequent MRs do the processing. You can have a work flow
> defined for binding the MRs
>
> Thanks and Regards
> Rajesh Nagaraju
>
>
> On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Joao Paulo,
>>
>> Your suggestion is appreciated. Although, on a side note, what is more
>> tedious: Writing a custom InputFormat or changing the code which is
>> generating the input splits.?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>>
>>> If I understood your problem correctly, you have one huge JSON, which is
>>> basically a JSONArray, and you want to process one JSONObject of the array
>>> at a time.
>>>
>>> I have faced the same issue some time ago and instead of changing the
>>> input format, I changed the code that was generating this input, to
>>> generate lots of JSONObjects, one per line. Hence, using the default
>>> TextInputFormat, the map function was getting called with the entire JSON.
>>>
>>> A JSONArray is not good for a mapreduce input since it has a first [ and
>>> a last ] and commas between the JSONs of the array. The array can be
>>> represented as the file that the JSONs belong.
>>>
>>> Of course, this approach works only if you can modify what is generating
>>> the input you're talking about.
>>>
>>>
>>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>>
>>> In that case you have to convert your JSON data into seq files first and
>>>> then do the processing.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Can I use SequenceFileInputFormat to do the same?
>>>>>
>>>>>  --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>>
>>>>>>> 1 simple way is to remove the new line characters so that the
>>>>>>> default record reader and default way the block is read will take care of
>>>>>>> the input splits and JSON will not get affected by the removal of NL
>>>>>>> character
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>>
>>>>>>>> For. e.g:
>>>>>>>>
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>>> 1458938.970140 ] ] } }
>>>>>>>> ,
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>>> 1458829.592709 ] ] } }
>>>>>>>>
>>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>>> map function.*
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Sugandha,
>>>>>>>>>
>>>>>>>>> Please find my comments embedded below :
>>>>>>>>>
>>>>>>>>>                   No. of mappers are decided as:
>>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>>> Right?
>>>>>>>>>                   This is true(but not always). The basic
>>>>>>>>> criteria behind map creation is the logic inside *getSplits*method of
>>>>>>>>> *InputFormat* being used in your                     MR job. It
>>>>>>>>> is the behavior of *file based InputFormats*, typically
>>>>>>>>> sub-classes of *FileInputFormat*, to split the input data into
>>>>>>>>> splits based                     on the total size, in bytes, of the input
>>>>>>>>> files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>>
>>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>>> machine) will perform the task. Right?
>>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>>> the node where this file resides.
>>>>>>>>>
>>>>>>>>>                   The map() function is called by all the
>>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>> slaves, what happens?
>>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>>> availability of CPU slots.
>>>>>>>>>
>>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>>> right?
>>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Bertrand,
>>>>>>>>>>
>>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>>
>>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>>
>>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>>
>>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>>> A task is a jvm that can be launched only if there is a free
>>>>>>>>>>> slot ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>>
>>>>>>>>>>> Always :
>>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>>
>>>>>>>>>>> And generally :
>>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Bertrand
>>>>>>>>>>>
>>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>>> writting a mail.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>>
>>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>>
>>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>>> things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

The code which generates split is part of the InputFormat, which is the
getSplits<http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf,
int)> method.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Feb 27, 2014 at 9:58 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> Hi
>
> I have worked on a similiar project, I had a M/R for preprocessing that
> would remove the NL chars and ensure that 1 line has a json object.
> Then the consequent MRs do the processing. You can have a work flow
> defined for binding the MRs
>
> Thanks and Regards
> Rajesh Nagaraju
>
>
> On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Joao Paulo,
>>
>> Your suggestion is appreciated. Although, on a side note, what is more
>> tedious: Writing a custom InputFormat or changing the code which is
>> generating the input splits.?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>>
>>> If I understood your problem correctly, you have one huge JSON, which is
>>> basically a JSONArray, and you want to process one JSONObject of the array
>>> at a time.
>>>
>>> I have faced the same issue some time ago and instead of changing the
>>> input format, I changed the code that was generating this input, to
>>> generate lots of JSONObjects, one per line. Hence, using the default
>>> TextInputFormat, the map function was getting called with the entire JSON.
>>>
>>> A JSONArray is not good for a mapreduce input since it has a first [ and
>>> a last ] and commas between the JSONs of the array. The array can be
>>> represented as the file that the JSONs belong.
>>>
>>> Of course, this approach works only if you can modify what is generating
>>> the input you're talking about.
>>>
>>>
>>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>>
>>> In that case you have to convert your JSON data into seq files first and
>>>> then do the processing.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Can I use SequenceFileInputFormat to do the same?
>>>>>
>>>>>  --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>>
>>>>>>> 1 simple way is to remove the new line characters so that the
>>>>>>> default record reader and default way the block is read will take care of
>>>>>>> the input splits and JSON will not get affected by the removal of NL
>>>>>>> character
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>>
>>>>>>>> For. e.g:
>>>>>>>>
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>>> 1458938.970140 ] ] } }
>>>>>>>> ,
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>>> 1458829.592709 ] ] } }
>>>>>>>>
>>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>>> map function.*
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Sugandha,
>>>>>>>>>
>>>>>>>>> Please find my comments embedded below :
>>>>>>>>>
>>>>>>>>>                   No. of mappers are decided as:
>>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>>> Right?
>>>>>>>>>                   This is true(but not always). The basic
>>>>>>>>> criteria behind map creation is the logic inside *getSplits*method of
>>>>>>>>> *InputFormat* being used in your                     MR job. It
>>>>>>>>> is the behavior of *file based InputFormats*, typically
>>>>>>>>> sub-classes of *FileInputFormat*, to split the input data into
>>>>>>>>> splits based                     on the total size, in bytes, of the input
>>>>>>>>> files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>>
>>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>>> machine) will perform the task. Right?
>>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>>> the node where this file resides.
>>>>>>>>>
>>>>>>>>>                   The map() function is called by all the
>>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>> slaves, what happens?
>>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>>> availability of CPU slots.
>>>>>>>>>
>>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>>> right?
>>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Bertrand,
>>>>>>>>>>
>>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>>
>>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>>
>>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>>
>>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>>> A task is a jvm that can be launched only if there is a free
>>>>>>>>>>> slot ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>>
>>>>>>>>>>> Always :
>>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>>
>>>>>>>>>>> And generally :
>>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Bertrand
>>>>>>>>>>>
>>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>>> writting a mail.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>>
>>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>>
>>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>>> things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

The code which generates split is part of the InputFormat, which is the
getSplits<http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf,
int)> method.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Feb 27, 2014 at 9:58 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> Hi
>
> I have worked on a similiar project, I had a M/R for preprocessing that
> would remove the NL chars and ensure that 1 line has a json object.
> Then the consequent MRs do the processing. You can have a work flow
> defined for binding the MRs
>
> Thanks and Regards
> Rajesh Nagaraju
>
>
> On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Joao Paulo,
>>
>> Your suggestion is appreciated. Although, on a side note, what is more
>> tedious: Writing a custom InputFormat or changing the code which is
>> generating the input splits.?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>>
>>> If I understood your problem correctly, you have one huge JSON, which is
>>> basically a JSONArray, and you want to process one JSONObject of the array
>>> at a time.
>>>
>>> I have faced the same issue some time ago and instead of changing the
>>> input format, I changed the code that was generating this input, to
>>> generate lots of JSONObjects, one per line. Hence, using the default
>>> TextInputFormat, the map function was getting called with the entire JSON.
>>>
>>> A JSONArray is not good for a mapreduce input since it has a first [ and
>>> a last ] and commas between the JSONs of the array. The array can be
>>> represented as the file that the JSONs belong.
>>>
>>> Of course, this approach works only if you can modify what is generating
>>> the input you're talking about.
>>>
>>>
>>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>>
>>> In that case you have to convert your JSON data into seq files first and
>>>> then do the processing.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Can I use SequenceFileInputFormat to do the same?
>>>>>
>>>>>  --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>>
>>>>>>> 1 simple way is to remove the new line characters so that the
>>>>>>> default record reader and default way the block is read will take care of
>>>>>>> the input splits and JSON will not get affected by the removal of NL
>>>>>>> character
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>>
>>>>>>>> For. e.g:
>>>>>>>>
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>>> 1458938.970140 ] ] } }
>>>>>>>> ,
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>>> 1458829.592709 ] ] } }
>>>>>>>>
>>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>>> map function.*
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Sugandha,
>>>>>>>>>
>>>>>>>>> Please find my comments embedded below :
>>>>>>>>>
>>>>>>>>>                   No. of mappers are decided as:
>>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>>> Right?
>>>>>>>>>                   This is true(but not always). The basic
>>>>>>>>> criteria behind map creation is the logic inside *getSplits*method of
>>>>>>>>> *InputFormat* being used in your                     MR job. It
>>>>>>>>> is the behavior of *file based InputFormats*, typically
>>>>>>>>> sub-classes of *FileInputFormat*, to split the input data into
>>>>>>>>> splits based                     on the total size, in bytes, of the input
>>>>>>>>> files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>>
>>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>>> machine) will perform the task. Right?
>>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>>> the node where this file resides.
>>>>>>>>>
>>>>>>>>>                   The map() function is called by all the
>>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>> slaves, what happens?
>>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>>> availability of CPU slots.
>>>>>>>>>
>>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>>> right?
>>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Bertrand,
>>>>>>>>>>
>>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>>
>>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>>
>>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>>
>>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>>> A task is a jvm that can be launched only if there is a free
>>>>>>>>>>> slot ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>>
>>>>>>>>>>> Always :
>>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>>
>>>>>>>>>>> And generally :
>>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Bertrand
>>>>>>>>>>>
>>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>>> writting a mail.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>>
>>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>>
>>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>>> things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

The code which generates split is part of the InputFormat, which is the
getSplits<http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf,
int)> method.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Feb 27, 2014 at 9:58 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> Hi
>
> I have worked on a similiar project, I had a M/R for preprocessing that
> would remove the NL chars and ensure that 1 line has a json object.
> Then the consequent MRs do the processing. You can have a work flow
> defined for binding the MRs
>
> Thanks and Regards
> Rajesh Nagaraju
>
>
> On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Joao Paulo,
>>
>> Your suggestion is appreciated. Although, on a side note, what is more
>> tedious: Writing a custom InputFormat or changing the code which is
>> generating the input splits.?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>>
>>> If I understood your problem correctly, you have one huge JSON, which is
>>> basically a JSONArray, and you want to process one JSONObject of the array
>>> at a time.
>>>
>>> I have faced the same issue some time ago and instead of changing the
>>> input format, I changed the code that was generating this input, to
>>> generate lots of JSONObjects, one per line. Hence, using the default
>>> TextInputFormat, the map function was getting called with the entire JSON.
>>>
>>> A JSONArray is not good for a mapreduce input since it has a first [ and
>>> a last ] and commas between the JSONs of the array. The array can be
>>> represented as the file that the JSONs belong.
>>>
>>> Of course, this approach works only if you can modify what is generating
>>> the input you're talking about.
>>>
>>>
>>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>>
>>> In that case you have to convert your JSON data into seq files first and
>>>> then do the processing.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Can I use SequenceFileInputFormat to do the same?
>>>>>
>>>>>  --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>>
>>>>>>> 1 simple way is to remove the new line characters so that the
>>>>>>> default record reader and default way the block is read will take care of
>>>>>>> the input splits and JSON will not get affected by the removal of NL
>>>>>>> character
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>>
>>>>>>>> For. e.g:
>>>>>>>>
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>>> 1458938.970140 ] ] } }
>>>>>>>> ,
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>>> 1458829.592709 ] ] } }
>>>>>>>>
>>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>>> map function.*
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Sugandha,
>>>>>>>>>
>>>>>>>>> Please find my comments embedded below :
>>>>>>>>>
>>>>>>>>>                   No. of mappers are decided as:
>>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>>> Right?
>>>>>>>>>                   This is true(but not always). The basic
>>>>>>>>> criteria behind map creation is the logic inside *getSplits*method of
>>>>>>>>> *InputFormat* being used in your                     MR job. It
>>>>>>>>> is the behavior of *file based InputFormats*, typically
>>>>>>>>> sub-classes of *FileInputFormat*, to split the input data into
>>>>>>>>> splits based                     on the total size, in bytes, of the input
>>>>>>>>> files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>>
>>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>>> machine) will perform the task. Right?
>>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>>> the node where this file resides.
>>>>>>>>>
>>>>>>>>>                   The map() function is called by all the
>>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>> slaves, what happens?
>>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>>> availability of CPU slots.
>>>>>>>>>
>>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>>> right?
>>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Bertrand,
>>>>>>>>>>
>>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>>
>>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>>
>>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>>
>>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>>> A task is a jvm that can be launched only if there is a free
>>>>>>>>>>> slot ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>>
>>>>>>>>>>> Always :
>>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>>
>>>>>>>>>>> And generally :
>>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Bertrand
>>>>>>>>>>>
>>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>>> writting a mail.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>>
>>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>>
>>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>>> things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

Hi

I have worked on a similiar project, I had a M/R for preprocessing that
would remove the NL chars and ensure that 1 line has a json object.
Then the consequent MRs do the processing. You can have a work flow defined
for binding the MRs

Thanks and Regards
Rajesh Nagaraju


On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

Hi

I have worked on a similiar project, I had a M/R for preprocessing that
would remove the NL chars and ensure that 1 line has a json object.
Then the consequent MRs do the processing. You can have a work flow defined
for binding the MRs

Thanks and Regards
Rajesh Nagaraju


On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Shekhar Sharma <sh...@gmail.com>.

Writing a custom input format would be much easier and you would have
better control
You might be tempted to use Jackson lib to do the process json  but that
requires that you need to know your Json data and this assumption would
break if your format of data changes

I would suggest write a custom record reader where you parse the json and
create your own key value pairs
On 27 Feb 2014 09:52, "Sugandha Naolekar" <su...@gmail.com> wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Shekhar Sharma <sh...@gmail.com>.

Writing a custom input format would be much easier and you would have
better control
You might be tempted to use Jackson lib to do the process json  but that
requires that you need to know your Json data and this assumption would
break if your format of data changes

I would suggest write a custom record reader where you parse the json and
create your own key value pairs
On 27 Feb 2014 09:52, "Sugandha Naolekar" <su...@gmail.com> wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

Hi

I have worked on a similiar project, I had a M/R for preprocessing that
would remove the NL chars and ensure that 1 line has a json object.
Then the consequent MRs do the processing. You can have a work flow defined
for binding the MRs

Thanks and Regards
Rajesh Nagaraju


On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

Hi

I have worked on a similiar project, I had a M/R for preprocessing that
would remove the NL chars and ensure that 1 line has a json object.
Then the consequent MRs do the processing. You can have a work flow defined
for binding the MRs

Thanks and Regards
Rajesh Nagaraju


On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Shekhar Sharma <sh...@gmail.com>.

Writing a custom input format would be much easier and you would have
better control
You might be tempted to use Jackson lib to do the process json  but that
requires that you need to know your Json data and this assumption would
break if your format of data changes

I would suggest write a custom record reader where you parse the json and
create your own key value pairs
On 27 Feb 2014 09:52, "Sugandha Naolekar" <su...@gmail.com> wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Shekhar Sharma <sh...@gmail.com>.

Writing a custom input format would be much easier and you would have
better control
You might be tempted to use Jackson lib to do the process json  but that
requires that you need to know your Json data and this assumption would
break if your format of data changes

I would suggest write a custom record reader where you parse the json and
create your own key value pairs
On 27 Feb 2014 09:52, "Sugandha Naolekar" <su...@gmail.com> wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com>wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Can I use SequenceFileInputFormat to do the same?
>>>>
>>>>  --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>> your custom InputFormat to handle JSON data. Alternatively you could make
>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>
>>>>>> 1 simple way is to remove the new line characters so that the default
>>>>>> record reader and default way the block is read will take care of the input
>>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>>>>>> will be split into two blocks. Now, since my file is a json file, I cannot
>>>>>>> use textinputformat. As, every input split(logical) will be a single line
>>>>>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>>>>>> custom input format and a custom record reader so that, every input
>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>
>>>>>>> For. e.g:
>>>>>>>
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>> 1458938.970140 ] ] } }
>>>>>>> ,
>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>> 1458829.592709 ] ] } }
>>>>>>>
>>>>>>> *I want here the every input split to consist of entire type data
>>>>>>> and thus, I can process it accordingly by giving relevant k,V pairs to the
>>>>>>> map function.*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Sugandha,
>>>>>>>>
>>>>>>>> Please find my comments embedded below :
>>>>>>>>
>>>>>>>>                   No. of mappers are decided as:
>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>>> Right?
>>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
>>>>>>>> of *FileInputFormat*, to split the input data into splits based
>>>>>>>>                   on the total size, in bytes, of the input files. See
>>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>
>>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>> machine) will perform the task. Right?
>>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>> the node where this file resides.
>>>>>>>>
>>>>>>>>                   The map() function is called by all the
>>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>> slaves, what happens?
>>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>>> availability of CPU slots.
>>>>>>>>
>>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>>> right?
>>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bertrand,
>>>>>>>>>
>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
>>>>>>>>> is only true when you set isSplittable() as false or when your input file
>>>>>>>>> size is less than the block size. Also, when it comes to text files, the
>>>>>>>>> default textinputformat considers each line as one input split which can be
>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>
>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>>
>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>
>>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>>
>>>>>>>>>> Always :
>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>
>>>>>>>>>> And generally :
>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Bertrand
>>>>>>>>>>
>>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>>> writting a mail.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>>
>>>>>>>>>>> The map function is called once for each input record. A block
>>>>>>>>>>> is typically 64MB and can contain a multitude of record, therefore a map
>>>>>>>>>>> task = run the map() function on all records in the block.
>>>>>>>>>>>
>>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>>
>>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>
>>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>>> things:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>>    3. The map() function is called by all the
>>>>>>>>>>>>>    datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com> wrote:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the array
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write your
>>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>>> input format and a custom record reader so that, every input split(logical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data and
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files. See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>> right?
>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com> wrote:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the array
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write your
>>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>>> input format and a custom record reader so that, every input split(logical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data and
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files. See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>> right?
>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com> wrote:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the array
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write your
>>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>>> input format and a custom record reader so that, every input split(logical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data and
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files. See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>> right?
>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jp...@gmail.com> wrote:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the array
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write your
>>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>>> input format and a custom record reader so that, every input split(logical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data and
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than the
>>>>>>> block size, only one mapper will be                               invoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files. See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>>> a logical partitioning. If you have a                       file which is
>>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed resides. A
>>>>>>> slave node can run                       multiple mappers based on the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>>> right?
>>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>>> map is called only once per split per node where that split is present.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are more map
>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>> layers. You have a layer for computations which consists of a jobtracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for storage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated with this
>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block size, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by João Paulo Forny <jp...@gmail.com>.

If I understood your problem correctly, you have one huge JSON, which is
basically a JSONArray, and you want to process one JSONObject of the array
at a time.

I have faced the same issue some time ago and instead of changing the input
format, I changed the code that was generating this input, to generate lots
of JSONObjects, one per line. Hence, using the default TextInputFormat, the
map function was getting called with the entire JSON.

A JSONArray is not good for a mapreduce input since it has a first [ and a
last ] and commas between the JSONs of the array. The array can be
represented as the file that the JSONs belong.

Of course, this approach works only if you can modify what is generating
the input you're talking about.


2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:

> In that case you have to convert your JSON data into seq files first and
> then do the processing.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Can I use SequenceFileInputFormat to do the same?
>>
>>  --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since there is no OOTB feature that allows this, you have to write your
>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>> Pig or Hive as they have builtin JSON support.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>> rajeshnagaraju@gmail.com> wrote:
>>>
>>>> 1 simple way is to remove the new line characters so that the default
>>>> record reader and default way the block is read will take care of the input
>>>> splits and JSON will not get affected by the removal of NL character
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>> input format and a custom record reader so that, every input split(logical)
>>>>> will have only that part of data which I require.
>>>>>
>>>>> For. e.g:
>>>>>
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>> 1458938.970140 ] ] } }
>>>>> ,
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>> 1458829.592709 ] ] } }
>>>>>
>>>>> *I want here the every input split to consist of entire type data and
>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>> function.*
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sugandha,
>>>>>>
>>>>>> Please find my comments embedded below :
>>>>>>
>>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>>> mapper will be                               invoked. Right?
>>>>>>                   This is true(but not always). The basic criteria
>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>             on the total size, in bytes, of the input files. See
>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>> only 1 mapper will                     be created.
>>>>>>
>>>>>>                   If yes, it means, the map() will be called only
>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>> factor as 1: only one                               datanode(mapper
>>>>>> machine) will perform the task. Right?
>>>>>>                   A mapper is called for each split. Don't get
>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>> a logical partitioning. If you have a                       file which is
>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>> only 1 mapper will be called. And this will happen on
>>>>>> the node where this file resides.
>>>>>>
>>>>>>                   The map() function is called by all the
>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>> slaves, what happens?
>>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>>> created on the node where the chunk of data to be processed resides. A
>>>>>> slave node can run                       multiple mappers based on the
>>>>>> availability of CPU slots.
>>>>>>
>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>> right?
>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>> map is called only once per split per node where that split is present.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Bertrand,
>>>>>>>
>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>> read by RecordReader in K,V format.
>>>>>>>
>>>>>>> Please correct me if I don't make sense.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>
>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>
>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>
>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>
>>>>>>>> Always :
>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>
>>>>>>>> And generally :
>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Bertrand
>>>>>>>>
>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>> writting a mail.
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>
>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>
>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>
>>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>>> namenode and a set of datanodes.
>>>>>>>>>
>>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>
>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>
>>>>>>>>> Regards, Dieter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>
>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>> things:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by João Paulo Forny <jp...@gmail.com>.

If I understood your problem correctly, you have one huge JSON, which is
basically a JSONArray, and you want to process one JSONObject of the array
at a time.

I have faced the same issue some time ago and instead of changing the input
format, I changed the code that was generating this input, to generate lots
of JSONObjects, one per line. Hence, using the default TextInputFormat, the
map function was getting called with the entire JSON.

A JSONArray is not good for a mapreduce input since it has a first [ and a
last ] and commas between the JSONs of the array. The array can be
represented as the file that the JSONs belong.

Of course, this approach works only if you can modify what is generating
the input you're talking about.


2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:

> In that case you have to convert your JSON data into seq files first and
> then do the processing.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Can I use SequenceFileInputFormat to do the same?
>>
>>  --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since there is no OOTB feature that allows this, you have to write your
>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>> Pig or Hive as they have builtin JSON support.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>> rajeshnagaraju@gmail.com> wrote:
>>>
>>>> 1 simple way is to remove the new line characters so that the default
>>>> record reader and default way the block is read will take care of the input
>>>> splits and JSON will not get affected by the removal of NL character
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>> input format and a custom record reader so that, every input split(logical)
>>>>> will have only that part of data which I require.
>>>>>
>>>>> For. e.g:
>>>>>
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>> 1458938.970140 ] ] } }
>>>>> ,
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>> 1458829.592709 ] ] } }
>>>>>
>>>>> *I want here the every input split to consist of entire type data and
>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>> function.*
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sugandha,
>>>>>>
>>>>>> Please find my comments embedded below :
>>>>>>
>>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>>> mapper will be                               invoked. Right?
>>>>>>                   This is true(but not always). The basic criteria
>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>             on the total size, in bytes, of the input files. See
>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>> only 1 mapper will                     be created.
>>>>>>
>>>>>>                   If yes, it means, the map() will be called only
>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>> factor as 1: only one                               datanode(mapper
>>>>>> machine) will perform the task. Right?
>>>>>>                   A mapper is called for each split. Don't get
>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>> a logical partitioning. If you have a                       file which is
>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>> only 1 mapper will be called. And this will happen on
>>>>>> the node where this file resides.
>>>>>>
>>>>>>                   The map() function is called by all the
>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>> slaves, what happens?
>>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>>> created on the node where the chunk of data to be processed resides. A
>>>>>> slave node can run                       multiple mappers based on the
>>>>>> availability of CPU slots.
>>>>>>
>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>> right?
>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>> map is called only once per split per node where that split is present.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Bertrand,
>>>>>>>
>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>> read by RecordReader in K,V format.
>>>>>>>
>>>>>>> Please correct me if I don't make sense.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>
>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>
>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>
>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>
>>>>>>>> Always :
>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>
>>>>>>>> And generally :
>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Bertrand
>>>>>>>>
>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>> writting a mail.
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>
>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>
>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>
>>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>>> namenode and a set of datanodes.
>>>>>>>>>
>>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>
>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>
>>>>>>>>> Regards, Dieter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>
>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>> things:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by João Paulo Forny <jp...@gmail.com>.

If I understood your problem correctly, you have one huge JSON, which is
basically a JSONArray, and you want to process one JSONObject of the array
at a time.

I have faced the same issue some time ago and instead of changing the input
format, I changed the code that was generating this input, to generate lots
of JSONObjects, one per line. Hence, using the default TextInputFormat, the
map function was getting called with the entire JSON.

A JSONArray is not good for a mapreduce input since it has a first [ and a
last ] and commas between the JSONs of the array. The array can be
represented as the file that the JSONs belong.

Of course, this approach works only if you can modify what is generating
the input you're talking about.


2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:

> In that case you have to convert your JSON data into seq files first and
> then do the processing.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Can I use SequenceFileInputFormat to do the same?
>>
>>  --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since there is no OOTB feature that allows this, you have to write your
>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>> Pig or Hive as they have builtin JSON support.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>> rajeshnagaraju@gmail.com> wrote:
>>>
>>>> 1 simple way is to remove the new line characters so that the default
>>>> record reader and default way the block is read will take care of the input
>>>> splits and JSON will not get affected by the removal of NL character
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>> input format and a custom record reader so that, every input split(logical)
>>>>> will have only that part of data which I require.
>>>>>
>>>>> For. e.g:
>>>>>
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>> 1458938.970140 ] ] } }
>>>>> ,
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>> 1458829.592709 ] ] } }
>>>>>
>>>>> *I want here the every input split to consist of entire type data and
>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>> function.*
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sugandha,
>>>>>>
>>>>>> Please find my comments embedded below :
>>>>>>
>>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>>> mapper will be                               invoked. Right?
>>>>>>                   This is true(but not always). The basic criteria
>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>             on the total size, in bytes, of the input files. See
>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>> only 1 mapper will                     be created.
>>>>>>
>>>>>>                   If yes, it means, the map() will be called only
>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>> factor as 1: only one                               datanode(mapper
>>>>>> machine) will perform the task. Right?
>>>>>>                   A mapper is called for each split. Don't get
>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>> a logical partitioning. If you have a                       file which is
>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>> only 1 mapper will be called. And this will happen on
>>>>>> the node where this file resides.
>>>>>>
>>>>>>                   The map() function is called by all the
>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>> slaves, what happens?
>>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>>> created on the node where the chunk of data to be processed resides. A
>>>>>> slave node can run                       multiple mappers based on the
>>>>>> availability of CPU slots.
>>>>>>
>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>> right?
>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>> map is called only once per split per node where that split is present.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Bertrand,
>>>>>>>
>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>> read by RecordReader in K,V format.
>>>>>>>
>>>>>>> Please correct me if I don't make sense.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>
>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>
>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>
>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>
>>>>>>>> Always :
>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>
>>>>>>>> And generally :
>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Bertrand
>>>>>>>>
>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>> writting a mail.
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>
>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>
>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>
>>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>>> namenode and a set of datanodes.
>>>>>>>>>
>>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>
>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>
>>>>>>>>> Regards, Dieter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>
>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>> things:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by João Paulo Forny <jp...@gmail.com>.

If I understood your problem correctly, you have one huge JSON, which is
basically a JSONArray, and you want to process one JSONObject of the array
at a time.

I have faced the same issue some time ago and instead of changing the input
format, I changed the code that was generating this input, to generate lots
of JSONObjects, one per line. Hence, using the default TextInputFormat, the
map function was getting called with the entire JSON.

A JSONArray is not good for a mapreduce input since it has a first [ and a
last ] and commas between the JSONs of the array. The array can be
represented as the file that the JSONs belong.

Of course, this approach works only if you can modify what is generating
the input you're talking about.


2014-02-26 8:25 GMT-03:00 Mohammad Tariq <do...@gmail.com>:

> In that case you have to convert your JSON data into seq files first and
> then do the processing.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Can I use SequenceFileInputFormat to do the same?
>>
>>  --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since there is no OOTB feature that allows this, you have to write your
>>> custom InputFormat to handle JSON data. Alternatively you could make use of
>>> Pig or Hive as they have builtin JSON support.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>> rajeshnagaraju@gmail.com> wrote:
>>>
>>>> 1 simple way is to remove the new line characters so that the default
>>>> record reader and default way the block is read will take care of the input
>>>> splits and JSON will not get affected by the removal of NL character
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>>> textinputformat. As, every input split(logical) will be a single line of
>>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>>> input format and a custom record reader so that, every input split(logical)
>>>>> will have only that part of data which I require.
>>>>>
>>>>> For. e.g:
>>>>>
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM":
>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>> 1458938.970140 ] ] } }
>>>>> ,
>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>> 1458829.592709 ] ] } }
>>>>>
>>>>> *I want here the every input split to consist of entire type data and
>>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>>> function.*
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sugandha,
>>>>>>
>>>>>> Please find my comments embedded below :
>>>>>>
>>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>>> mapper will be                               invoked. Right?
>>>>>>                   This is true(but not always). The basic criteria
>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>             on the total size, in bytes, of the input files. See
>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>>> only 1 mapper will                     be created.
>>>>>>
>>>>>>                   If yes, it means, the map() will be called only
>>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>>> factor as 1: only one                               datanode(mapper
>>>>>> machine) will perform the task. Right?
>>>>>>                   A mapper is called for each split. Don't get
>>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>>> a logical partitioning. If you have a                       file which is
>>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>>> only 1 mapper will be called. And this will happen on
>>>>>> the node where this file resides.
>>>>>>
>>>>>>                   The map() function is called by all the
>>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>>> slaves, what happens?
>>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>>> created on the node where the chunk of data to be processed resides. A
>>>>>> slave node can run                       multiple mappers based on the
>>>>>> availability of CPU slots.
>>>>>>
>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>>> right?
>>>>>>                  No. of blocks = no. of splits = no. of mappers. A
>>>>>> map is called only once per split per node where that split is present.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Bertrand,
>>>>>>>
>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>>> textinputformat considers each line as one input split which can be then
>>>>>>> read by RecordReader in K,V format.
>>>>>>>
>>>>>>> Please correct me if I don't make sense.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>
>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>
>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>
>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>> ie for a given slot, at a given time, there will be at maximum only a
>>>>>>>> single task. During the task, the configured Mapper will be instantiated.
>>>>>>>>
>>>>>>>> Always :
>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>
>>>>>>>> And generally :
>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Bertrand
>>>>>>>>
>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>> writting a mail.
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>>
>>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>
>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>
>>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>>> namenode and a set of datanodes.
>>>>>>>>>
>>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>>
>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>
>>>>>>>>> Regards, Dieter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>
>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
>>>>>>>>>> those many no. of times the map() function will be called right?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would like to ask few
>>>>>>>>>>> things:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

In that case you have to convert your JSON data into seq files first and
then do the processing.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Can I use SequenceFileInputFormat to do the same?
>
>  --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since there is no OOTB feature that allows this, you have to write your
>> custom InputFormat to handle JSON data. Alternatively you could make use of
>> Pig or Hive as they have builtin JSON support.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>> rajeshnagaraju@gmail.com> wrote:
>>
>>> 1 simple way is to remove the new line characters so that the default
>>> record reader and default way the block is read will take care of the input
>>> splits and JSON will not get affected by the removal of NL character
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>> textinputformat. As, every input split(logical) will be a single line of
>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>> input format and a custom record reader so that, every input split(logical)
>>>> will have only that part of data which I require.
>>>>
>>>> For. e.g:
>>>>
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>>> ,
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>>
>>>> *I want here the every input split to consist of entire type data and
>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>> function.*
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Please find my comments embedded below :
>>>>>
>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>> mapper will be                               invoked. Right?
>>>>>                   This is true(but not always). The basic criteria
>>>>> behind map creation is the logic inside *getSplits* method of
>>>>> *InputFormat* being used in your                     MR job. It is
>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>             on the total size, in bytes, of the input files. See
>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>> only 1 mapper will                     be created.
>>>>>
>>>>>                   If yes, it means, the map() will be called only
>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>> factor as 1: only one                               datanode(mapper
>>>>> machine) will perform the task. Right?
>>>>>                   A mapper is called for each split. Don't get
>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>> a logical partitioning. If you have a                       file which is
>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>> only 1 mapper will be called. And this will happen on
>>>>> the node where this file resides.
>>>>>
>>>>>                   The map() function is called by all the
>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>> slaves, what happens?
>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>> created on the node where the chunk of data to be processed resides. A
>>>>> slave node can run                       multiple mappers based on the
>>>>> availability of CPU slots.
>>>>>
>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>> right?
>>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>>> is called only once per split per node where that split is present.
>>>>>
>>>>> HTH
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>> textinputformat considers each line as one input split which can be then
>>>>>> read by RecordReader in K,V format.
>>>>>>
>>>>>> Please correct me if I don't make sense.
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>
>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>
>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>> really make sense to talk about number of mappers.
>>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>>
>>>>>>> Always :
>>>>>>> Number of input splits = no. of map tasks
>>>>>>>
>>>>>>> And generally :
>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Bertrand
>>>>>>>
>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>> writting a mail.
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>
>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>
>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>
>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>> namenode and a set of datanodes.
>>>>>>>>
>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>
>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>
>>>>>>>> Regards, Dieter
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

In that case you have to convert your JSON data into seq files first and
then do the processing.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Can I use SequenceFileInputFormat to do the same?
>
>  --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since there is no OOTB feature that allows this, you have to write your
>> custom InputFormat to handle JSON data. Alternatively you could make use of
>> Pig or Hive as they have builtin JSON support.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>> rajeshnagaraju@gmail.com> wrote:
>>
>>> 1 simple way is to remove the new line characters so that the default
>>> record reader and default way the block is read will take care of the input
>>> splits and JSON will not get affected by the removal of NL character
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>> textinputformat. As, every input split(logical) will be a single line of
>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>> input format and a custom record reader so that, every input split(logical)
>>>> will have only that part of data which I require.
>>>>
>>>> For. e.g:
>>>>
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>>> ,
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>>
>>>> *I want here the every input split to consist of entire type data and
>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>> function.*
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Please find my comments embedded below :
>>>>>
>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>> mapper will be                               invoked. Right?
>>>>>                   This is true(but not always). The basic criteria
>>>>> behind map creation is the logic inside *getSplits* method of
>>>>> *InputFormat* being used in your                     MR job. It is
>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>             on the total size, in bytes, of the input files. See
>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>> only 1 mapper will                     be created.
>>>>>
>>>>>                   If yes, it means, the map() will be called only
>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>> factor as 1: only one                               datanode(mapper
>>>>> machine) will perform the task. Right?
>>>>>                   A mapper is called for each split. Don't get
>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>> a logical partitioning. If you have a                       file which is
>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>> only 1 mapper will be called. And this will happen on
>>>>> the node where this file resides.
>>>>>
>>>>>                   The map() function is called by all the
>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>> slaves, what happens?
>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>> created on the node where the chunk of data to be processed resides. A
>>>>> slave node can run                       multiple mappers based on the
>>>>> availability of CPU slots.
>>>>>
>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>> right?
>>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>>> is called only once per split per node where that split is present.
>>>>>
>>>>> HTH
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>> textinputformat considers each line as one input split which can be then
>>>>>> read by RecordReader in K,V format.
>>>>>>
>>>>>> Please correct me if I don't make sense.
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>
>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>
>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>> really make sense to talk about number of mappers.
>>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>>
>>>>>>> Always :
>>>>>>> Number of input splits = no. of map tasks
>>>>>>>
>>>>>>> And generally :
>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Bertrand
>>>>>>>
>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>> writting a mail.
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>
>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>
>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>
>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>> namenode and a set of datanodes.
>>>>>>>>
>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>
>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>
>>>>>>>> Regards, Dieter
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

In that case you have to convert your JSON data into seq files first and
then do the processing.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Can I use SequenceFileInputFormat to do the same?
>
>  --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since there is no OOTB feature that allows this, you have to write your
>> custom InputFormat to handle JSON data. Alternatively you could make use of
>> Pig or Hive as they have builtin JSON support.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>> rajeshnagaraju@gmail.com> wrote:
>>
>>> 1 simple way is to remove the new line characters so that the default
>>> record reader and default way the block is read will take care of the input
>>> splits and JSON will not get affected by the removal of NL character
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>> textinputformat. As, every input split(logical) will be a single line of
>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>> input format and a custom record reader so that, every input split(logical)
>>>> will have only that part of data which I require.
>>>>
>>>> For. e.g:
>>>>
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>>> ,
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>>
>>>> *I want here the every input split to consist of entire type data and
>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>> function.*
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Please find my comments embedded below :
>>>>>
>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>> mapper will be                               invoked. Right?
>>>>>                   This is true(but not always). The basic criteria
>>>>> behind map creation is the logic inside *getSplits* method of
>>>>> *InputFormat* being used in your                     MR job. It is
>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>             on the total size, in bytes, of the input files. See
>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>> only 1 mapper will                     be created.
>>>>>
>>>>>                   If yes, it means, the map() will be called only
>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>> factor as 1: only one                               datanode(mapper
>>>>> machine) will perform the task. Right?
>>>>>                   A mapper is called for each split. Don't get
>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>> a logical partitioning. If you have a                       file which is
>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>> only 1 mapper will be called. And this will happen on
>>>>> the node where this file resides.
>>>>>
>>>>>                   The map() function is called by all the
>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>> slaves, what happens?
>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>> created on the node where the chunk of data to be processed resides. A
>>>>> slave node can run                       multiple mappers based on the
>>>>> availability of CPU slots.
>>>>>
>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>> right?
>>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>>> is called only once per split per node where that split is present.
>>>>>
>>>>> HTH
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>> textinputformat considers each line as one input split which can be then
>>>>>> read by RecordReader in K,V format.
>>>>>>
>>>>>> Please correct me if I don't make sense.
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>
>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>
>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>> really make sense to talk about number of mappers.
>>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>>
>>>>>>> Always :
>>>>>>> Number of input splits = no. of map tasks
>>>>>>>
>>>>>>> And generally :
>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Bertrand
>>>>>>>
>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>> writting a mail.
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>
>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>
>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>
>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>> namenode and a set of datanodes.
>>>>>>>>
>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>
>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>
>>>>>>>> Regards, Dieter
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

In that case you have to convert your JSON data into seq files first and
then do the processing.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Can I use SequenceFileInputFormat to do the same?
>
>  --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since there is no OOTB feature that allows this, you have to write your
>> custom InputFormat to handle JSON data. Alternatively you could make use of
>> Pig or Hive as they have builtin JSON support.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>> rajeshnagaraju@gmail.com> wrote:
>>
>>> 1 simple way is to remove the new line characters so that the default
>>> record reader and default way the block is read will take care of the input
>>> splits and JSON will not get affected by the removal of NL character
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>> textinputformat. As, every input split(logical) will be a single line of
>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>> input format and a custom record reader so that, every input split(logical)
>>>> will have only that part of data which I require.
>>>>
>>>> For. e.g:
>>>>
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>>> ,
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>>
>>>> *I want here the every input split to consist of entire type data and
>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>> function.*
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Please find my comments embedded below :
>>>>>
>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>> mapper will be                               invoked. Right?
>>>>>                   This is true(but not always). The basic criteria
>>>>> behind map creation is the logic inside *getSplits* method of
>>>>> *InputFormat* being used in your                     MR job. It is
>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>             on the total size, in bytes, of the input files. See
>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>>> only 1 mapper will                     be created.
>>>>>
>>>>>                   If yes, it means, the map() will be called only
>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>> factor as 1: only one                               datanode(mapper
>>>>> machine) will perform the task. Right?
>>>>>                   A mapper is called for each split. Don't get
>>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>>> a logical partitioning. If you have a                       file which is
>>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>>> only 1 mapper will be called. And this will happen on
>>>>> the node where this file resides.
>>>>>
>>>>>                   The map() function is called by all the
>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>> slaves, what happens?
>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>> created on the node where the chunk of data to be processed resides. A
>>>>> slave node can run                       multiple mappers based on the
>>>>> availability of CPU slots.
>>>>>
>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>> right?
>>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>>> is called only once per split per node where that split is present.
>>>>>
>>>>> HTH
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>>> only true when you set isSplittable() as false or when your input file size
>>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>>> textinputformat considers each line as one input split which can be then
>>>>>> read by RecordReader in K,V format.
>>>>>>
>>>>>> Please correct me if I don't make sense.
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>
>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>
>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>> really make sense to talk about number of mappers.
>>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>>
>>>>>>> Always :
>>>>>>> Number of input splits = no. of map tasks
>>>>>>>
>>>>>>> And generally :
>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Bertrand
>>>>>>>
>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>> writting a mail.
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>
>>>>>>>> The map function is called once for each input record. A block is
>>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>
>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>
>>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>>> namenode and a set of datanodes.
>>>>>>>>
>>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>
>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>
>>>>>>>> Regards, Dieter
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Can I use SequenceFileInputFormat to do the same?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since there is no OOTB feature that allows this, you have to write your
> custom InputFormat to handle JSON data. Alternatively you could make use of
> Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagaraju@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>>> split into two blocks. Now, since my file is a json file, I cannot use
>>> textinputformat. As, every input split(logical) will be a single line of
>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>> input format and a custom record reader so that, every input split(logical)
>>> will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data and
>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>> function.*
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Please find my comments embedded below :
>>>>
>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>> mapper will be                               invoked. Right?
>>>>                   This is true(but not always). The basic criteria
>>>> behind map creation is the logic inside *getSplits* method of
>>>> *InputFormat* being used in your                     MR job. It is the
>>>> behavior of *file based InputFormats*, typically sub-classes of
>>>> *FileInputFormat*, to split the input data into splits based
>>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>> only 1 mapper will                     be created.
>>>>
>>>>                   If yes, it means, the map() will be called only once.
>>>> Right? In this case, if there are two datanodes with a replication factor
>>>> as 1: only one                               datanode(mapper machine) will
>>>> perform the task. Right?
>>>>                   A mapper is called for each split. Don't get
>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>> a logical partitioning. If you have a                       file which is
>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>> only 1 mapper will be called. And this will happen on
>>>> the node where this file resides.
>>>>
>>>>                   The map() function is called by all the
>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>> slaves, what happens?
>>>>                   map() doesn't get called by anybody. It rather gets
>>>> created on the node where the chunk of data to be processed resides. A
>>>> slave node can run                       multiple mappers based on the
>>>> availability of CPU slots.
>>>>
>>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>>> Thus, those many no. of times the map() function will be called right?
>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>> is called only once per split per node where that split is present.
>>>>
>>>> HTH
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>> only true when you set isSplittable() as false or when your input file size
>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>> textinputformat considers each line as one input split which can be then
>>>>> read by RecordReader in K,V format.
>>>>>
>>>>> Please correct me if I don't make sense.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>>
>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>
>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>
>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>> really make sense to talk about number of mappers.
>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>
>>>>>> Always :
>>>>>> Number of input splits = no. of map tasks
>>>>>>
>>>>>> And generally :
>>>>>> number of hdfs blocks = number of input splits
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand
>>>>>>
>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>> writting a mail.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>>
>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>
>>>>>>> The map function is called once for each input record. A block is
>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>> = run the map() function on all records in the block.
>>>>>>>
>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>
>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>> namenode and a set of datanodes.
>>>>>>>
>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>
>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>
>>>>>>> Regards, Dieter
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>    will be invoked. Right?
>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Can I use SequenceFileInputFormat to do the same?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since there is no OOTB feature that allows this, you have to write your
> custom InputFormat to handle JSON data. Alternatively you could make use of
> Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagaraju@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>>> split into two blocks. Now, since my file is a json file, I cannot use
>>> textinputformat. As, every input split(logical) will be a single line of
>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>> input format and a custom record reader so that, every input split(logical)
>>> will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data and
>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>> function.*
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Please find my comments embedded below :
>>>>
>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>> mapper will be                               invoked. Right?
>>>>                   This is true(but not always). The basic criteria
>>>> behind map creation is the logic inside *getSplits* method of
>>>> *InputFormat* being used in your                     MR job. It is the
>>>> behavior of *file based InputFormats*, typically sub-classes of
>>>> *FileInputFormat*, to split the input data into splits based
>>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>> only 1 mapper will                     be created.
>>>>
>>>>                   If yes, it means, the map() will be called only once.
>>>> Right? In this case, if there are two datanodes with a replication factor
>>>> as 1: only one                               datanode(mapper machine) will
>>>> perform the task. Right?
>>>>                   A mapper is called for each split. Don't get
>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>> a logical partitioning. If you have a                       file which is
>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>> only 1 mapper will be called. And this will happen on
>>>> the node where this file resides.
>>>>
>>>>                   The map() function is called by all the
>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>> slaves, what happens?
>>>>                   map() doesn't get called by anybody. It rather gets
>>>> created on the node where the chunk of data to be processed resides. A
>>>> slave node can run                       multiple mappers based on the
>>>> availability of CPU slots.
>>>>
>>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>>> Thus, those many no. of times the map() function will be called right?
>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>> is called only once per split per node where that split is present.
>>>>
>>>> HTH
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>> only true when you set isSplittable() as false or when your input file size
>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>> textinputformat considers each line as one input split which can be then
>>>>> read by RecordReader in K,V format.
>>>>>
>>>>> Please correct me if I don't make sense.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>>
>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>
>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>
>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>> really make sense to talk about number of mappers.
>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>
>>>>>> Always :
>>>>>> Number of input splits = no. of map tasks
>>>>>>
>>>>>> And generally :
>>>>>> number of hdfs blocks = number of input splits
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand
>>>>>>
>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>> writting a mail.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>>
>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>
>>>>>>> The map function is called once for each input record. A block is
>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>> = run the map() function on all records in the block.
>>>>>>>
>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>
>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>> namenode and a set of datanodes.
>>>>>>>
>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>
>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>
>>>>>>> Regards, Dieter
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>    will be invoked. Right?
>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Can I use SequenceFileInputFormat to do the same?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since there is no OOTB feature that allows this, you have to write your
> custom InputFormat to handle JSON data. Alternatively you could make use of
> Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagaraju@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>>> split into two blocks. Now, since my file is a json file, I cannot use
>>> textinputformat. As, every input split(logical) will be a single line of
>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>> input format and a custom record reader so that, every input split(logical)
>>> will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data and
>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>> function.*
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Please find my comments embedded below :
>>>>
>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>> mapper will be                               invoked. Right?
>>>>                   This is true(but not always). The basic criteria
>>>> behind map creation is the logic inside *getSplits* method of
>>>> *InputFormat* being used in your                     MR job. It is the
>>>> behavior of *file based InputFormats*, typically sub-classes of
>>>> *FileInputFormat*, to split the input data into splits based
>>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>> only 1 mapper will                     be created.
>>>>
>>>>                   If yes, it means, the map() will be called only once.
>>>> Right? In this case, if there are two datanodes with a replication factor
>>>> as 1: only one                               datanode(mapper machine) will
>>>> perform the task. Right?
>>>>                   A mapper is called for each split. Don't get
>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>> a logical partitioning. If you have a                       file which is
>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>> only 1 mapper will be called. And this will happen on
>>>> the node where this file resides.
>>>>
>>>>                   The map() function is called by all the
>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>> slaves, what happens?
>>>>                   map() doesn't get called by anybody. It rather gets
>>>> created on the node where the chunk of data to be processed resides. A
>>>> slave node can run                       multiple mappers based on the
>>>> availability of CPU slots.
>>>>
>>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>>> Thus, those many no. of times the map() function will be called right?
>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>> is called only once per split per node where that split is present.
>>>>
>>>> HTH
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>> only true when you set isSplittable() as false or when your input file size
>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>> textinputformat considers each line as one input split which can be then
>>>>> read by RecordReader in K,V format.
>>>>>
>>>>> Please correct me if I don't make sense.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>>
>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>
>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>
>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>> really make sense to talk about number of mappers.
>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>
>>>>>> Always :
>>>>>> Number of input splits = no. of map tasks
>>>>>>
>>>>>> And generally :
>>>>>> number of hdfs blocks = number of input splits
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand
>>>>>>
>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>> writting a mail.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>>
>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>
>>>>>>> The map function is called once for each input record. A block is
>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>> = run the map() function on all records in the block.
>>>>>>>
>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>
>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>> namenode and a set of datanodes.
>>>>>>>
>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>
>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>
>>>>>>> Regards, Dieter
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>    will be invoked. Right?
>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Can I use SequenceFileInputFormat to do the same?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since there is no OOTB feature that allows this, you have to write your
> custom InputFormat to handle JSON data. Alternatively you could make use of
> Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagaraju@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>>> split into two blocks. Now, since my file is a json file, I cannot use
>>> textinputformat. As, every input split(logical) will be a single line of
>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>> input format and a custom record reader so that, every input split(logical)
>>> will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data and
>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>> function.*
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Please find my comments embedded below :
>>>>
>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>> mapper will be                               invoked. Right?
>>>>                   This is true(but not always). The basic criteria
>>>> behind map creation is the logic inside *getSplits* method of
>>>> *InputFormat* being used in your                     MR job. It is the
>>>> behavior of *file based InputFormats*, typically sub-classes of
>>>> *FileInputFormat*, to split the input data into splits based
>>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>>> only 1 mapper will                     be created.
>>>>
>>>>                   If yes, it means, the map() will be called only once.
>>>> Right? In this case, if there are two datanodes with a replication factor
>>>> as 1: only one                               datanode(mapper machine) will
>>>> perform the task. Right?
>>>>                   A mapper is called for each split. Don't get
>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>> a logical partitioning. If you have a                       file which is
>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>> only 1 mapper will be called. And this will happen on
>>>> the node where this file resides.
>>>>
>>>>                   The map() function is called by all the
>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>> slaves, what happens?
>>>>                   map() doesn't get called by anybody. It rather gets
>>>> created on the node where the chunk of data to be processed resides. A
>>>> slave node can run                       multiple mappers based on the
>>>> availability of CPU slots.
>>>>
>>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>>> Thus, those many no. of times the map() function will be called right?
>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>> is called only once per split per node where that split is present.
>>>>
>>>> HTH
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>> only true when you set isSplittable() as false or when your input file size
>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>> textinputformat considers each line as one input split which can be then
>>>>> read by RecordReader in K,V format.
>>>>>
>>>>> Please correct me if I don't make sense.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>>
>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>
>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>
>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>> really make sense to talk about number of mappers.
>>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>
>>>>>> Always :
>>>>>> Number of input splits = no. of map tasks
>>>>>>
>>>>>> And generally :
>>>>>> number of hdfs blocks = number of input splits
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand
>>>>>>
>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>> writting a mail.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>>
>>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>
>>>>>>> The map function is called once for each input record. A block is
>>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>>> = run the map() function on all records in the block.
>>>>>>>
>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>
>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>>> namenode and a set of datanodes.
>>>>>>>
>>>>>>> In mapreduce the code is executed where the data is. So if a block
>>>>>>> is in datanode 1, 2 and 3, then the map task associated with this block
>>>>>>> will likely be executed on one of those physical nodes, by tasktracker 1, 2
>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>
>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>
>>>>>>> Regards, Dieter
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>>    will be invoked. Right?
>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>    Right? In this case, if there are two datanodes with a replication factor
>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Since there is no OOTB feature that allows this, you have to write your
custom InputFormat to handle JSON data. Alternatively you could make use of
Pig or Hive as they have builtin JSON support.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> 1 simple way is to remove the new line characters so that the default
> record reader and default way the block is read will take care of the input
> splits and JSON will not get affected by the removal of NL character
>
>
> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>> split into two blocks. Now, since my file is a json file, I cannot use
>> textinputformat. As, every input split(logical) will be a single line of
>> the json file. Which I dont want. Thus, in this case, can I write a custom
>> input format and a custom record reader so that, every input split(logical)
>> will have only that part of data which I require.
>>
>> For. e.g:
>>
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>> ,
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>
>> *I want here the every input split to consist of entire type data and
>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>> function.*
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Please find my comments embedded below :
>>>
>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>> Block Size. Thus, if the file is smaller than the block size, only one
>>> mapper will be                               invoked. Right?
>>>                   This is true(but not always). The basic criteria
>>> behind map creation is the logic inside *getSplits* method of
>>> *InputFormat* being used in your                     MR job. It is the
>>> behavior of *file based InputFormats*, typically sub-classes of
>>> *FileInputFormat*, to split the input data into splits based
>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>> only 1 mapper will                     be created.
>>>
>>>                   If yes, it means, the map() will be called only once.
>>> Right? In this case, if there are two datanodes with a replication factor
>>> as 1: only one                               datanode(mapper machine) will
>>> perform the task. Right?
>>>                   A mapper is called for each split. Don't get confused
>>> with the MR's split and HDFS's block. Both are different(They may overlap
>>> though, as in                     case of FileInputFormat). HDFS blocks are
>>> physical partitioning of your data, while an InputSplit is just a logical
>>> partitioning. If you have a                       file which is smaller
>>> than the HDFS blocksize then only one split will be created, hence only 1
>>> mapper will be called. And this will happen on                     the node
>>> where this file resides.
>>>
>>>                   The map() function is called by all the
>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>> slaves, what happens?
>>>                   map() doesn't get called by anybody. It rather gets
>>> created on the node where the chunk of data to be processed resides. A
>>> slave node can run                       multiple mappers based on the
>>> availability of CPU slots.
>>>
>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>> Thus, those many no. of times the map() function will be called right?
>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>> is called only once per split per node where that split is present.
>>>
>>> HTH
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hi Bertrand,
>>>>
>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>> only true when you set isSplittable() as false or when your input file size
>>>> is less than the block size. Also, when it comes to text files, the default
>>>> textinputformat considers each line as one input split which can be then
>>>> read by RecordReader in K,V format.
>>>>
>>>> Please correct me if I don't make sense.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>
>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>
>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>
>>>>> Mapper is the name of the abstract class/interface. It does not really
>>>>> make sense to talk about number of mappers.
>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>
>>>>> Always :
>>>>> Number of input splits = no. of map tasks
>>>>>
>>>>> And generally :
>>>>> number of hdfs blocks = number of input splits
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>>> a mail.
>>>>>
>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>
>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>
>>>>>> The map function is called once for each input record. A block is
>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>> = run the map() function on all records in the block.
>>>>>>
>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>
>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>> namenode and a set of datanodes.
>>>>>>
>>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>>
>>>>>> Hopefully this gives you a little more insigth.
>>>>>>
>>>>>> Regards, Dieter
>>>>>>
>>>>>>
>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>>
>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>> many no. of times the map() function will be called right?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>    will be invoked. Right?
>>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Since there is no OOTB feature that allows this, you have to write your
custom InputFormat to handle JSON data. Alternatively you could make use of
Pig or Hive as they have builtin JSON support.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> 1 simple way is to remove the new line characters so that the default
> record reader and default way the block is read will take care of the input
> splits and JSON will not get affected by the removal of NL character
>
>
> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>> split into two blocks. Now, since my file is a json file, I cannot use
>> textinputformat. As, every input split(logical) will be a single line of
>> the json file. Which I dont want. Thus, in this case, can I write a custom
>> input format and a custom record reader so that, every input split(logical)
>> will have only that part of data which I require.
>>
>> For. e.g:
>>
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>> ,
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>
>> *I want here the every input split to consist of entire type data and
>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>> function.*
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Please find my comments embedded below :
>>>
>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>> Block Size. Thus, if the file is smaller than the block size, only one
>>> mapper will be                               invoked. Right?
>>>                   This is true(but not always). The basic criteria
>>> behind map creation is the logic inside *getSplits* method of
>>> *InputFormat* being used in your                     MR job. It is the
>>> behavior of *file based InputFormats*, typically sub-classes of
>>> *FileInputFormat*, to split the input data into splits based
>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>> only 1 mapper will                     be created.
>>>
>>>                   If yes, it means, the map() will be called only once.
>>> Right? In this case, if there are two datanodes with a replication factor
>>> as 1: only one                               datanode(mapper machine) will
>>> perform the task. Right?
>>>                   A mapper is called for each split. Don't get confused
>>> with the MR's split and HDFS's block. Both are different(They may overlap
>>> though, as in                     case of FileInputFormat). HDFS blocks are
>>> physical partitioning of your data, while an InputSplit is just a logical
>>> partitioning. If you have a                       file which is smaller
>>> than the HDFS blocksize then only one split will be created, hence only 1
>>> mapper will be called. And this will happen on                     the node
>>> where this file resides.
>>>
>>>                   The map() function is called by all the
>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>> slaves, what happens?
>>>                   map() doesn't get called by anybody. It rather gets
>>> created on the node where the chunk of data to be processed resides. A
>>> slave node can run                       multiple mappers based on the
>>> availability of CPU slots.
>>>
>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>> Thus, those many no. of times the map() function will be called right?
>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>> is called only once per split per node where that split is present.
>>>
>>> HTH
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hi Bertrand,
>>>>
>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>> only true when you set isSplittable() as false or when your input file size
>>>> is less than the block size. Also, when it comes to text files, the default
>>>> textinputformat considers each line as one input split which can be then
>>>> read by RecordReader in K,V format.
>>>>
>>>> Please correct me if I don't make sense.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>
>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>
>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>
>>>>> Mapper is the name of the abstract class/interface. It does not really
>>>>> make sense to talk about number of mappers.
>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>
>>>>> Always :
>>>>> Number of input splits = no. of map tasks
>>>>>
>>>>> And generally :
>>>>> number of hdfs blocks = number of input splits
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>>> a mail.
>>>>>
>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>
>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>
>>>>>> The map function is called once for each input record. A block is
>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>> = run the map() function on all records in the block.
>>>>>>
>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>
>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>> namenode and a set of datanodes.
>>>>>>
>>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>>
>>>>>> Hopefully this gives you a little more insigth.
>>>>>>
>>>>>> Regards, Dieter
>>>>>>
>>>>>>
>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>>
>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>> many no. of times the map() function will be called right?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>    will be invoked. Right?
>>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Since there is no OOTB feature that allows this, you have to write your
custom InputFormat to handle JSON data. Alternatively you could make use of
Pig or Hive as they have builtin JSON support.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> 1 simple way is to remove the new line characters so that the default
> record reader and default way the block is read will take care of the input
> splits and JSON will not get affected by the removal of NL character
>
>
> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>> split into two blocks. Now, since my file is a json file, I cannot use
>> textinputformat. As, every input split(logical) will be a single line of
>> the json file. Which I dont want. Thus, in this case, can I write a custom
>> input format and a custom record reader so that, every input split(logical)
>> will have only that part of data which I require.
>>
>> For. e.g:
>>
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>> ,
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>
>> *I want here the every input split to consist of entire type data and
>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>> function.*
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Please find my comments embedded below :
>>>
>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>> Block Size. Thus, if the file is smaller than the block size, only one
>>> mapper will be                               invoked. Right?
>>>                   This is true(but not always). The basic criteria
>>> behind map creation is the logic inside *getSplits* method of
>>> *InputFormat* being used in your                     MR job. It is the
>>> behavior of *file based InputFormats*, typically sub-classes of
>>> *FileInputFormat*, to split the input data into splits based
>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>> only 1 mapper will                     be created.
>>>
>>>                   If yes, it means, the map() will be called only once.
>>> Right? In this case, if there are two datanodes with a replication factor
>>> as 1: only one                               datanode(mapper machine) will
>>> perform the task. Right?
>>>                   A mapper is called for each split. Don't get confused
>>> with the MR's split and HDFS's block. Both are different(They may overlap
>>> though, as in                     case of FileInputFormat). HDFS blocks are
>>> physical partitioning of your data, while an InputSplit is just a logical
>>> partitioning. If you have a                       file which is smaller
>>> than the HDFS blocksize then only one split will be created, hence only 1
>>> mapper will be called. And this will happen on                     the node
>>> where this file resides.
>>>
>>>                   The map() function is called by all the
>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>> slaves, what happens?
>>>                   map() doesn't get called by anybody. It rather gets
>>> created on the node where the chunk of data to be processed resides. A
>>> slave node can run                       multiple mappers based on the
>>> availability of CPU slots.
>>>
>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>> Thus, those many no. of times the map() function will be called right?
>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>> is called only once per split per node where that split is present.
>>>
>>> HTH
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hi Bertrand,
>>>>
>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>> only true when you set isSplittable() as false or when your input file size
>>>> is less than the block size. Also, when it comes to text files, the default
>>>> textinputformat considers each line as one input split which can be then
>>>> read by RecordReader in K,V format.
>>>>
>>>> Please correct me if I don't make sense.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>
>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>
>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>
>>>>> Mapper is the name of the abstract class/interface. It does not really
>>>>> make sense to talk about number of mappers.
>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>
>>>>> Always :
>>>>> Number of input splits = no. of map tasks
>>>>>
>>>>> And generally :
>>>>> number of hdfs blocks = number of input splits
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>>> a mail.
>>>>>
>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>
>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>
>>>>>> The map function is called once for each input record. A block is
>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>> = run the map() function on all records in the block.
>>>>>>
>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>
>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>> namenode and a set of datanodes.
>>>>>>
>>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>>
>>>>>> Hopefully this gives you a little more insigth.
>>>>>>
>>>>>> Regards, Dieter
>>>>>>
>>>>>>
>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>>
>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>> many no. of times the map() function will be called right?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>    will be invoked. Right?
>>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Since there is no OOTB feature that allows this, you have to write your
custom InputFormat to handle JSON data. Alternatively you could make use of
Pig or Hive as they have builtin JSON support.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju
<ra...@gmail.com>wrote:

> 1 simple way is to remove the new line characters so that the default
> record reader and default way the block is read will take care of the input
> splits and JSON will not get affected by the removal of NL character
>
>
> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>> split into two blocks. Now, since my file is a json file, I cannot use
>> textinputformat. As, every input split(logical) will be a single line of
>> the json file. Which I dont want. Thus, in this case, can I write a custom
>> input format and a custom record reader so that, every input split(logical)
>> will have only that part of data which I require.
>>
>> For. e.g:
>>
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>> ,
>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>
>> *I want here the every input split to consist of entire type data and
>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>> function.*
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Please find my comments embedded below :
>>>
>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>> Block Size. Thus, if the file is smaller than the block size, only one
>>> mapper will be                               invoked. Right?
>>>                   This is true(but not always). The basic criteria
>>> behind map creation is the logic inside *getSplits* method of
>>> *InputFormat* being used in your                     MR job. It is the
>>> behavior of *file based InputFormats*, typically sub-classes of
>>> *FileInputFormat*, to split the input data into splits based
>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>>> only 1 mapper will                     be created.
>>>
>>>                   If yes, it means, the map() will be called only once.
>>> Right? In this case, if there are two datanodes with a replication factor
>>> as 1: only one                               datanode(mapper machine) will
>>> perform the task. Right?
>>>                   A mapper is called for each split. Don't get confused
>>> with the MR's split and HDFS's block. Both are different(They may overlap
>>> though, as in                     case of FileInputFormat). HDFS blocks are
>>> physical partitioning of your data, while an InputSplit is just a logical
>>> partitioning. If you have a                       file which is smaller
>>> than the HDFS blocksize then only one split will be created, hence only 1
>>> mapper will be called. And this will happen on                     the node
>>> where this file resides.
>>>
>>>                   The map() function is called by all the
>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>> slaves, what happens?
>>>                   map() doesn't get called by anybody. It rather gets
>>> created on the node where the chunk of data to be processed resides. A
>>> slave node can run                       multiple mappers based on the
>>> availability of CPU slots.
>>>
>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>> Thus, those many no. of times the map() function will be called right?
>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>> is called only once per split per node where that split is present.
>>>
>>> HTH
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hi Bertrand,
>>>>
>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>> only true when you set isSplittable() as false or when your input file size
>>>> is less than the block size. Also, when it comes to text files, the default
>>>> textinputformat considers each line as one input split which can be then
>>>> read by RecordReader in K,V format.
>>>>
>>>> Please correct me if I don't make sense.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>>
>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>
>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>
>>>>> Mapper is the name of the abstract class/interface. It does not really
>>>>> make sense to talk about number of mappers.
>>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>>> for a given slot, at a given time, there will be at maximum only a single
>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>
>>>>> Always :
>>>>> Number of input splits = no. of map tasks
>>>>>
>>>>> And generally :
>>>>> number of hdfs blocks = number of input splits
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand
>>>>>
>>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>>> a mail.
>>>>>
>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>>
>>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>
>>>>>> The map function is called once for each input record. A block is
>>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>>> = run the map() function on all records in the block.
>>>>>>
>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>
>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>> You have a layer for computations which consists of a jobtracker and a set
>>>>>> of tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>>> namenode and a set of datanodes.
>>>>>>
>>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>>
>>>>>> Hopefully this gives you a little more insigth.
>>>>>>
>>>>>> Regards, Dieter
>>>>>>
>>>>>>
>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>>
>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>>> many no. of times the map() function will be called right?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>>    will be invoked. Right?
>>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

1 simple way is to remove the new line characters so that the default
record reader and default way the block is read will take care of the input
splits and JSON will not get affected by the removal of NL character


On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
> split into two blocks. Now, since my file is a json file, I cannot use
> textinputformat. As, every input split(logical) will be a single line of
> the json file. Which I dont want. Thus, in this case, can I write a custom
> input format and a custom record reader so that, every input split(logical)
> will have only that part of data which I require.
>
> For. e.g:
>
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>
> *I want here the every input split to consist of entire type data and
> thus, I can process it accordingly by giving relevant k,V pairs to the map
> function.*
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hi Sugandha,
>>
>> Please find my comments embedded below :
>>
>>                   No. of mappers are decided as: Total_File_Size/Max.
>> Block Size. Thus, if the file is smaller than the block size, only one
>> mapper will be                               invoked. Right?
>>                   This is true(but not always). The basic criteria
>> behind map creation is the logic inside *getSplits* method of
>> *InputFormat* being used in your                     MR job. It is the
>> behavior of *file based InputFormats*, typically sub-classes of
>> *FileInputFormat*, to split the input data into splits based
>>         on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>> only 1 mapper will                     be created.
>>
>>                   If yes, it means, the map() will be called only once.
>> Right? In this case, if there are two datanodes with a replication factor
>> as 1: only one                               datanode(mapper machine) will
>> perform the task. Right?
>>                   A mapper is called for each split. Don't get confused
>> with the MR's split and HDFS's block. Both are different(They may overlap
>> though, as in                     case of FileInputFormat). HDFS blocks are
>> physical partitioning of your data, while an InputSplit is just a logical
>> partitioning. If you have a                       file which is smaller
>> than the HDFS blocksize then only one split will be created, hence only 1
>> mapper will be called. And this will happen on                     the node
>> where this file resides.
>>
>>                   The map() function is called by all the
>> datanodes/slaves right? If the no. of mappers are more than the no. of
>> slaves, what happens?
>>                   map() doesn't get called by anybody. It rather gets
>> created on the node where the chunk of data to be processed resides. A
>> slave node can run                       multiple mappers based on the
>> availability of CPU slots.
>>
>>                  One more thing to ask: No. of blocks = no. of mappers.
>> Thus, those many no. of times the map() function will be called right?
>>                  No. of blocks = no. of splits = no. of mappers. A map is
>> called only once per split per node where that split is present.
>>
>> HTH
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hi Bertrand,
>>>
>>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>>> true when you set isSplittable() as false or when your input file size is
>>> less than the block size. Also, when it comes to text files, the default
>>> textinputformat considers each line as one input split which can be then
>>> read by RecordReader in K,V format.
>>>
>>> Please correct me if I don't make sense.
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>
>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>
>>>> Mapper is the name of the abstract class/interface. It does not really
>>>> make sense to talk about number of mappers.
>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>> for a given slot, at a given time, there will be at maximum only a single
>>>> task. During the task, the configured Mapper will be instantiated.
>>>>
>>>> Always :
>>>> Number of input splits = no. of map tasks
>>>>
>>>> And generally :
>>>> number of hdfs blocks = number of input splits
>>>>
>>>> Regards
>>>>
>>>> Bertrand
>>>>
>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>> a mail.
>>>>
>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>
>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>
>>>>> The map function is called once for each input record. A block is
>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>> = run the map() function on all records in the block.
>>>>>
>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>
>>>>> Furthermore you have to make a distinction between the two layers. You
>>>>> have a layer for computations which consists of a jobtracker and a set of
>>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>> namenode and a set of datanodes.
>>>>>
>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>
>>>>> Hopefully this gives you a little more insigth.
>>>>>
>>>>> Regards, Dieter
>>>>>
>>>>>
>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>
>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>> many no. of times the map() function will be called right?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>
>>>>>>>
>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>    will be invoked. Right?
>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

1 simple way is to remove the new line characters so that the default
record reader and default way the block is read will take care of the input
splits and JSON will not get affected by the removal of NL character


On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
> split into two blocks. Now, since my file is a json file, I cannot use
> textinputformat. As, every input split(logical) will be a single line of
> the json file. Which I dont want. Thus, in this case, can I write a custom
> input format and a custom record reader so that, every input split(logical)
> will have only that part of data which I require.
>
> For. e.g:
>
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>
> *I want here the every input split to consist of entire type data and
> thus, I can process it accordingly by giving relevant k,V pairs to the map
> function.*
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hi Sugandha,
>>
>> Please find my comments embedded below :
>>
>>                   No. of mappers are decided as: Total_File_Size/Max.
>> Block Size. Thus, if the file is smaller than the block size, only one
>> mapper will be                               invoked. Right?
>>                   This is true(but not always). The basic criteria
>> behind map creation is the logic inside *getSplits* method of
>> *InputFormat* being used in your                     MR job. It is the
>> behavior of *file based InputFormats*, typically sub-classes of
>> *FileInputFormat*, to split the input data into splits based
>>         on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>> only 1 mapper will                     be created.
>>
>>                   If yes, it means, the map() will be called only once.
>> Right? In this case, if there are two datanodes with a replication factor
>> as 1: only one                               datanode(mapper machine) will
>> perform the task. Right?
>>                   A mapper is called for each split. Don't get confused
>> with the MR's split and HDFS's block. Both are different(They may overlap
>> though, as in                     case of FileInputFormat). HDFS blocks are
>> physical partitioning of your data, while an InputSplit is just a logical
>> partitioning. If you have a                       file which is smaller
>> than the HDFS blocksize then only one split will be created, hence only 1
>> mapper will be called. And this will happen on                     the node
>> where this file resides.
>>
>>                   The map() function is called by all the
>> datanodes/slaves right? If the no. of mappers are more than the no. of
>> slaves, what happens?
>>                   map() doesn't get called by anybody. It rather gets
>> created on the node where the chunk of data to be processed resides. A
>> slave node can run                       multiple mappers based on the
>> availability of CPU slots.
>>
>>                  One more thing to ask: No. of blocks = no. of mappers.
>> Thus, those many no. of times the map() function will be called right?
>>                  No. of blocks = no. of splits = no. of mappers. A map is
>> called only once per split per node where that split is present.
>>
>> HTH
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hi Bertrand,
>>>
>>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>>> true when you set isSplittable() as false or when your input file size is
>>> less than the block size. Also, when it comes to text files, the default
>>> textinputformat considers each line as one input split which can be then
>>> read by RecordReader in K,V format.
>>>
>>> Please correct me if I don't make sense.
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>
>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>
>>>> Mapper is the name of the abstract class/interface. It does not really
>>>> make sense to talk about number of mappers.
>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>> for a given slot, at a given time, there will be at maximum only a single
>>>> task. During the task, the configured Mapper will be instantiated.
>>>>
>>>> Always :
>>>> Number of input splits = no. of map tasks
>>>>
>>>> And generally :
>>>> number of hdfs blocks = number of input splits
>>>>
>>>> Regards
>>>>
>>>> Bertrand
>>>>
>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>> a mail.
>>>>
>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>
>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>
>>>>> The map function is called once for each input record. A block is
>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>> = run the map() function on all records in the block.
>>>>>
>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>
>>>>> Furthermore you have to make a distinction between the two layers. You
>>>>> have a layer for computations which consists of a jobtracker and a set of
>>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>> namenode and a set of datanodes.
>>>>>
>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>
>>>>> Hopefully this gives you a little more insigth.
>>>>>
>>>>> Regards, Dieter
>>>>>
>>>>>
>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>
>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>> many no. of times the map() function will be called right?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>
>>>>>>>
>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>    will be invoked. Right?
>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

1 simple way is to remove the new line characters so that the default
record reader and default way the block is read will take care of the input
splits and JSON will not get affected by the removal of NL character


On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
> split into two blocks. Now, since my file is a json file, I cannot use
> textinputformat. As, every input split(logical) will be a single line of
> the json file. Which I dont want. Thus, in this case, can I write a custom
> input format and a custom record reader so that, every input split(logical)
> will have only that part of data which I require.
>
> For. e.g:
>
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>
> *I want here the every input split to consist of entire type data and
> thus, I can process it accordingly by giving relevant k,V pairs to the map
> function.*
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hi Sugandha,
>>
>> Please find my comments embedded below :
>>
>>                   No. of mappers are decided as: Total_File_Size/Max.
>> Block Size. Thus, if the file is smaller than the block size, only one
>> mapper will be                               invoked. Right?
>>                   This is true(but not always). The basic criteria
>> behind map creation is the logic inside *getSplits* method of
>> *InputFormat* being used in your                     MR job. It is the
>> behavior of *file based InputFormats*, typically sub-classes of
>> *FileInputFormat*, to split the input data into splits based
>>         on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>> only 1 mapper will                     be created.
>>
>>                   If yes, it means, the map() will be called only once.
>> Right? In this case, if there are two datanodes with a replication factor
>> as 1: only one                               datanode(mapper machine) will
>> perform the task. Right?
>>                   A mapper is called for each split. Don't get confused
>> with the MR's split and HDFS's block. Both are different(They may overlap
>> though, as in                     case of FileInputFormat). HDFS blocks are
>> physical partitioning of your data, while an InputSplit is just a logical
>> partitioning. If you have a                       file which is smaller
>> than the HDFS blocksize then only one split will be created, hence only 1
>> mapper will be called. And this will happen on                     the node
>> where this file resides.
>>
>>                   The map() function is called by all the
>> datanodes/slaves right? If the no. of mappers are more than the no. of
>> slaves, what happens?
>>                   map() doesn't get called by anybody. It rather gets
>> created on the node where the chunk of data to be processed resides. A
>> slave node can run                       multiple mappers based on the
>> availability of CPU slots.
>>
>>                  One more thing to ask: No. of blocks = no. of mappers.
>> Thus, those many no. of times the map() function will be called right?
>>                  No. of blocks = no. of splits = no. of mappers. A map is
>> called only once per split per node where that split is present.
>>
>> HTH
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hi Bertrand,
>>>
>>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>>> true when you set isSplittable() as false or when your input file size is
>>> less than the block size. Also, when it comes to text files, the default
>>> textinputformat considers each line as one input split which can be then
>>> read by RecordReader in K,V format.
>>>
>>> Please correct me if I don't make sense.
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>
>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>
>>>> Mapper is the name of the abstract class/interface. It does not really
>>>> make sense to talk about number of mappers.
>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>> for a given slot, at a given time, there will be at maximum only a single
>>>> task. During the task, the configured Mapper will be instantiated.
>>>>
>>>> Always :
>>>> Number of input splits = no. of map tasks
>>>>
>>>> And generally :
>>>> number of hdfs blocks = number of input splits
>>>>
>>>> Regards
>>>>
>>>> Bertrand
>>>>
>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>> a mail.
>>>>
>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>
>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>
>>>>> The map function is called once for each input record. A block is
>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>> = run the map() function on all records in the block.
>>>>>
>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>
>>>>> Furthermore you have to make a distinction between the two layers. You
>>>>> have a layer for computations which consists of a jobtracker and a set of
>>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>> namenode and a set of datanodes.
>>>>>
>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>
>>>>> Hopefully this gives you a little more insigth.
>>>>>
>>>>> Regards, Dieter
>>>>>
>>>>>
>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>
>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>> many no. of times the map() function will be called right?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>
>>>>>>>
>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>    will be invoked. Right?
>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Rajesh Nagaraju <ra...@gmail.com>.

1 simple way is to remove the new line characters so that the default
record reader and default way the block is read will take care of the input
splits and JSON will not get affected by the removal of NL character


On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
> split into two blocks. Now, since my file is a json file, I cannot use
> textinputformat. As, every input split(logical) will be a single line of
> the json file. Which I dont want. Thus, in this case, can I write a custom
> input format and a custom record reader so that, every input split(logical)
> will have only that part of data which I require.
>
> For. e.g:
>
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>
> *I want here the every input split to consist of entire type data and
> thus, I can process it accordingly by giving relevant k,V pairs to the map
> function.*
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hi Sugandha,
>>
>> Please find my comments embedded below :
>>
>>                   No. of mappers are decided as: Total_File_Size/Max.
>> Block Size. Thus, if the file is smaller than the block size, only one
>> mapper will be                               invoked. Right?
>>                   This is true(but not always). The basic criteria
>> behind map creation is the logic inside *getSplits* method of
>> *InputFormat* being used in your                     MR job. It is the
>> behavior of *file based InputFormats*, typically sub-classes of
>> *FileInputFormat*, to split the input data into splits based
>>         on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
>> only 1 mapper will                     be created.
>>
>>                   If yes, it means, the map() will be called only once.
>> Right? In this case, if there are two datanodes with a replication factor
>> as 1: only one                               datanode(mapper machine) will
>> perform the task. Right?
>>                   A mapper is called for each split. Don't get confused
>> with the MR's split and HDFS's block. Both are different(They may overlap
>> though, as in                     case of FileInputFormat). HDFS blocks are
>> physical partitioning of your data, while an InputSplit is just a logical
>> partitioning. If you have a                       file which is smaller
>> than the HDFS blocksize then only one split will be created, hence only 1
>> mapper will be called. And this will happen on                     the node
>> where this file resides.
>>
>>                   The map() function is called by all the
>> datanodes/slaves right? If the no. of mappers are more than the no. of
>> slaves, what happens?
>>                   map() doesn't get called by anybody. It rather gets
>> created on the node where the chunk of data to be processed resides. A
>> slave node can run                       multiple mappers based on the
>> availability of CPU slots.
>>
>>                  One more thing to ask: No. of blocks = no. of mappers.
>> Thus, those many no. of times the map() function will be called right?
>>                  No. of blocks = no. of splits = no. of mappers. A map is
>> called only once per split per node where that split is present.
>>
>> HTH
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hi Bertrand,
>>>
>>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>>> true when you set isSplittable() as false or when your input file size is
>>> less than the block size. Also, when it comes to text files, the default
>>> textinputformat considers each line as one input split which can be then
>>> read by RecordReader in K,V format.
>>>
>>> Please correct me if I don't make sense.
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>
>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>
>>>> Mapper is the name of the abstract class/interface. It does not really
>>>> make sense to talk about number of mappers.
>>>> A task is a jvm that can be launched only if there is a free slot ie
>>>> for a given slot, at a given time, there will be at maximum only a single
>>>> task. During the task, the configured Mapper will be instantiated.
>>>>
>>>> Always :
>>>> Number of input splits = no. of map tasks
>>>>
>>>> And generally :
>>>> number of hdfs blocks = number of input splits
>>>>
>>>> Regards
>>>>
>>>> Bertrand
>>>>
>>>> PS : I don't know if it is only my client, but avoid red when writting
>>>> a mail.
>>>>
>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>>
>>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>
>>>>> The map function is called once for each input record. A block is
>>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>>> = run the map() function on all records in the block.
>>>>>
>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>
>>>>> Furthermore you have to make a distinction between the two layers. You
>>>>> have a layer for computations which consists of a jobtracker and a set of
>>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>>> namenode and a set of datanodes.
>>>>>
>>>>> In mapreduce the code is executed where the data is. So if a block is
>>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>>> 3. But this is not necessary, thing can be rearranged.
>>>>>
>>>>> Hopefully this gives you a little more insigth.
>>>>>
>>>>> Regards, Dieter
>>>>>
>>>>>
>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>>
>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>>> many no. of times the map() function will be called right?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> As per the various articles I went through till date, the File(s)
>>>>>>> are split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>>
>>>>>>>
>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>>    will be invoked. Right?
>>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>> Sugandha Naolekar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
split into two blocks. Now, since my file is a json file, I cannot use
textinputformat. As, every input split(logical) will be a single line of
the json file. Which I dont want. Thus, in this case, can I write a custom
input format and a custom record reader so that, every input split(logical)
will have only that part of data which I require.

For. e.g:

{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
"OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
"REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
"START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
"REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

*I want here the every input split to consist of entire type data and thus,
I can process it accordingly by giving relevant k,V pairs to the map
function.*


--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Sugandha,
>
> Please find my comments embedded below :
>
>                   No. of mappers are decided as: Total_File_Size/Max.
> Block Size. Thus, if the file is smaller than the block size, only one
> mapper will be                               invoked. Right?
>                   This is true(but not always). The basic criteria behind
> map creation is the logic inside *getSplits* method of *InputFormat*being used in your                     MR job. It is the behavior of *file
> based InputFormats*, typically sub-classes of *FileInputFormat*, to split
> the input data into splits based                     on the total size, in
> bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
> only 1 mapper will                     be created.
>
>                   If yes, it means, the map() will be called only once.
> Right? In this case, if there are two datanodes with a replication factor
> as 1: only one                               datanode(mapper machine) will
> perform the task. Right?
>                   A mapper is called for each split. Don't get confused
> with the MR's split and HDFS's block. Both are different(They may overlap
> though, as in                     case of FileInputFormat). HDFS blocks are
> physical partitioning of your data, while an InputSplit is just a logical
> partitioning. If you have a                       file which is smaller
> than the HDFS blocksize then only one split will be created, hence only 1
> mapper will be called. And this will happen on                     the node
> where this file resides.
>
>                   The map() function is called by all the datanodes/slaves
> right? If the no. of mappers are more than the no. of slaves, what happens?
>                   map() doesn't get called by anybody. It rather gets
> created on the node where the chunk of data to be processed resides. A
> slave node can run                       multiple mappers based on the
> availability of CPU slots.
>
>                  One more thing to ask: No. of blocks = no. of mappers.
> Thus, those many no. of times the map() function will be called right?
>                  No. of blocks = no. of splits = no. of mappers. A map is
> called only once per split per node where that split is present.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hi Bertrand,
>>
>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>> true when you set isSplittable() as false or when your input file size is
>> less than the block size. Also, when it comes to text files, the default
>> textinputformat considers each line as one input split which can be then
>> read by RecordReader in K,V format.
>>
>> Please correct me if I don't make sense.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>
>>> Mapper is the name of the abstract class/interface. It does not really
>>> make sense to talk about number of mappers.
>>> A task is a jvm that can be launched only if there is a free slot ie for
>>> a given slot, at a given time, there will be at maximum only a single task.
>>> During the task, the configured Mapper will be instantiated.
>>>
>>> Always :
>>> Number of input splits = no. of map tasks
>>>
>>> And generally :
>>> number of hdfs blocks = number of input splits
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> PS : I don't know if it is only my client, but avoid red when writting a
>>> mail.
>>>
>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>
>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>> than slots obviously there will be multiple rounds of mapping.
>>>>
>>>> The map function is called once for each input record. A block is
>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>> = run the map() function on all records in the block.
>>>>
>>>> Number of blocks = no. of map tasks (not mappers)
>>>>
>>>> Furthermore you have to make a distinction between the two layers. You
>>>> have a layer for computations which consists of a jobtracker and a set of
>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>> namenode and a set of datanodes.
>>>>
>>>> In mapreduce the code is executed where the data is. So if a block is
>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>> 3. But this is not necessary, thing can be rearranged.
>>>>
>>>> Hopefully this gives you a little more insigth.
>>>>
>>>> Regards, Dieter
>>>>
>>>>
>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>
>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>> many no. of times the map() function will be called right?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As per the various articles I went through till date, the File(s) are
>>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>
>>>>>>
>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>    will be invoked. Right?
>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
split into two blocks. Now, since my file is a json file, I cannot use
textinputformat. As, every input split(logical) will be a single line of
the json file. Which I dont want. Thus, in this case, can I write a custom
input format and a custom record reader so that, every input split(logical)
will have only that part of data which I require.

For. e.g:

{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
"OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
"REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
"START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
"REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

*I want here the every input split to consist of entire type data and thus,
I can process it accordingly by giving relevant k,V pairs to the map
function.*


--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Sugandha,
>
> Please find my comments embedded below :
>
>                   No. of mappers are decided as: Total_File_Size/Max.
> Block Size. Thus, if the file is smaller than the block size, only one
> mapper will be                               invoked. Right?
>                   This is true(but not always). The basic criteria behind
> map creation is the logic inside *getSplits* method of *InputFormat*being used in your                     MR job. It is the behavior of *file
> based InputFormats*, typically sub-classes of *FileInputFormat*, to split
> the input data into splits based                     on the total size, in
> bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
> only 1 mapper will                     be created.
>
>                   If yes, it means, the map() will be called only once.
> Right? In this case, if there are two datanodes with a replication factor
> as 1: only one                               datanode(mapper machine) will
> perform the task. Right?
>                   A mapper is called for each split. Don't get confused
> with the MR's split and HDFS's block. Both are different(They may overlap
> though, as in                     case of FileInputFormat). HDFS blocks are
> physical partitioning of your data, while an InputSplit is just a logical
> partitioning. If you have a                       file which is smaller
> than the HDFS blocksize then only one split will be created, hence only 1
> mapper will be called. And this will happen on                     the node
> where this file resides.
>
>                   The map() function is called by all the datanodes/slaves
> right? If the no. of mappers are more than the no. of slaves, what happens?
>                   map() doesn't get called by anybody. It rather gets
> created on the node where the chunk of data to be processed resides. A
> slave node can run                       multiple mappers based on the
> availability of CPU slots.
>
>                  One more thing to ask: No. of blocks = no. of mappers.
> Thus, those many no. of times the map() function will be called right?
>                  No. of blocks = no. of splits = no. of mappers. A map is
> called only once per split per node where that split is present.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hi Bertrand,
>>
>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>> true when you set isSplittable() as false or when your input file size is
>> less than the block size. Also, when it comes to text files, the default
>> textinputformat considers each line as one input split which can be then
>> read by RecordReader in K,V format.
>>
>> Please correct me if I don't make sense.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>
>>> Mapper is the name of the abstract class/interface. It does not really
>>> make sense to talk about number of mappers.
>>> A task is a jvm that can be launched only if there is a free slot ie for
>>> a given slot, at a given time, there will be at maximum only a single task.
>>> During the task, the configured Mapper will be instantiated.
>>>
>>> Always :
>>> Number of input splits = no. of map tasks
>>>
>>> And generally :
>>> number of hdfs blocks = number of input splits
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> PS : I don't know if it is only my client, but avoid red when writting a
>>> mail.
>>>
>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>
>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>> than slots obviously there will be multiple rounds of mapping.
>>>>
>>>> The map function is called once for each input record. A block is
>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>> = run the map() function on all records in the block.
>>>>
>>>> Number of blocks = no. of map tasks (not mappers)
>>>>
>>>> Furthermore you have to make a distinction between the two layers. You
>>>> have a layer for computations which consists of a jobtracker and a set of
>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>> namenode and a set of datanodes.
>>>>
>>>> In mapreduce the code is executed where the data is. So if a block is
>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>> 3. But this is not necessary, thing can be rearranged.
>>>>
>>>> Hopefully this gives you a little more insigth.
>>>>
>>>> Regards, Dieter
>>>>
>>>>
>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>
>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>> many no. of times the map() function will be called right?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As per the various articles I went through till date, the File(s) are
>>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>
>>>>>>
>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>    will be invoked. Right?
>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
split into two blocks. Now, since my file is a json file, I cannot use
textinputformat. As, every input split(logical) will be a single line of
the json file. Which I dont want. Thus, in this case, can I write a custom
input format and a custom record reader so that, every input split(logical)
will have only that part of data which I require.

For. e.g:

{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
"OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
"REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
"START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
"REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

*I want here the every input split to consist of entire type data and thus,
I can process it accordingly by giving relevant k,V pairs to the map
function.*


--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Sugandha,
>
> Please find my comments embedded below :
>
>                   No. of mappers are decided as: Total_File_Size/Max.
> Block Size. Thus, if the file is smaller than the block size, only one
> mapper will be                               invoked. Right?
>                   This is true(but not always). The basic criteria behind
> map creation is the logic inside *getSplits* method of *InputFormat*being used in your                     MR job. It is the behavior of *file
> based InputFormats*, typically sub-classes of *FileInputFormat*, to split
> the input data into splits based                     on the total size, in
> bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
> only 1 mapper will                     be created.
>
>                   If yes, it means, the map() will be called only once.
> Right? In this case, if there are two datanodes with a replication factor
> as 1: only one                               datanode(mapper machine) will
> perform the task. Right?
>                   A mapper is called for each split. Don't get confused
> with the MR's split and HDFS's block. Both are different(They may overlap
> though, as in                     case of FileInputFormat). HDFS blocks are
> physical partitioning of your data, while an InputSplit is just a logical
> partitioning. If you have a                       file which is smaller
> than the HDFS blocksize then only one split will be created, hence only 1
> mapper will be called. And this will happen on                     the node
> where this file resides.
>
>                   The map() function is called by all the datanodes/slaves
> right? If the no. of mappers are more than the no. of slaves, what happens?
>                   map() doesn't get called by anybody. It rather gets
> created on the node where the chunk of data to be processed resides. A
> slave node can run                       multiple mappers based on the
> availability of CPU slots.
>
>                  One more thing to ask: No. of blocks = no. of mappers.
> Thus, those many no. of times the map() function will be called right?
>                  No. of blocks = no. of splits = no. of mappers. A map is
> called only once per split per node where that split is present.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hi Bertrand,
>>
>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>> true when you set isSplittable() as false or when your input file size is
>> less than the block size. Also, when it comes to text files, the default
>> textinputformat considers each line as one input split which can be then
>> read by RecordReader in K,V format.
>>
>> Please correct me if I don't make sense.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>
>>> Mapper is the name of the abstract class/interface. It does not really
>>> make sense to talk about number of mappers.
>>> A task is a jvm that can be launched only if there is a free slot ie for
>>> a given slot, at a given time, there will be at maximum only a single task.
>>> During the task, the configured Mapper will be instantiated.
>>>
>>> Always :
>>> Number of input splits = no. of map tasks
>>>
>>> And generally :
>>> number of hdfs blocks = number of input splits
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> PS : I don't know if it is only my client, but avoid red when writting a
>>> mail.
>>>
>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>
>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>> than slots obviously there will be multiple rounds of mapping.
>>>>
>>>> The map function is called once for each input record. A block is
>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>> = run the map() function on all records in the block.
>>>>
>>>> Number of blocks = no. of map tasks (not mappers)
>>>>
>>>> Furthermore you have to make a distinction between the two layers. You
>>>> have a layer for computations which consists of a jobtracker and a set of
>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>> namenode and a set of datanodes.
>>>>
>>>> In mapreduce the code is executed where the data is. So if a block is
>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>> 3. But this is not necessary, thing can be rearranged.
>>>>
>>>> Hopefully this gives you a little more insigth.
>>>>
>>>> Regards, Dieter
>>>>
>>>>
>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>
>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>> many no. of times the map() function will be called right?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As per the various articles I went through till date, the File(s) are
>>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>
>>>>>>
>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>    will be invoked. Right?
>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
split into two blocks. Now, since my file is a json file, I cannot use
textinputformat. As, every input split(logical) will be a single line of
the json file. Which I dont want. Thus, in this case, can I write a custom
input format and a custom record reader so that, every input split(logical)
will have only that part of data which I require.

For. e.g:

{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
"OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
"REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
"START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
"REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

*I want here the every input split to consist of entire type data and thus,
I can process it accordingly by giving relevant k,V pairs to the map
function.*


--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Sugandha,
>
> Please find my comments embedded below :
>
>                   No. of mappers are decided as: Total_File_Size/Max.
> Block Size. Thus, if the file is smaller than the block size, only one
> mapper will be                               invoked. Right?
>                   This is true(but not always). The basic criteria behind
> map creation is the logic inside *getSplits* method of *InputFormat*being used in your                     MR job. It is the behavior of *file
> based InputFormats*, typically sub-classes of *FileInputFormat*, to split
> the input data into splits based                     on the total size, in
> bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for more details. And yes, if the file is smaller than the block size then
> only 1 mapper will                     be created.
>
>                   If yes, it means, the map() will be called only once.
> Right? In this case, if there are two datanodes with a replication factor
> as 1: only one                               datanode(mapper machine) will
> perform the task. Right?
>                   A mapper is called for each split. Don't get confused
> with the MR's split and HDFS's block. Both are different(They may overlap
> though, as in                     case of FileInputFormat). HDFS blocks are
> physical partitioning of your data, while an InputSplit is just a logical
> partitioning. If you have a                       file which is smaller
> than the HDFS blocksize then only one split will be created, hence only 1
> mapper will be called. And this will happen on                     the node
> where this file resides.
>
>                   The map() function is called by all the datanodes/slaves
> right? If the no. of mappers are more than the no. of slaves, what happens?
>                   map() doesn't get called by anybody. It rather gets
> created on the node where the chunk of data to be processed resides. A
> slave node can run                       multiple mappers based on the
> availability of CPU slots.
>
>                  One more thing to ask: No. of blocks = no. of mappers.
> Thus, those many no. of times the map() function will be called right?
>                  No. of blocks = no. of splits = no. of mappers. A map is
> called only once per split per node where that split is present.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hi Bertrand,
>>
>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>> true when you set isSplittable() as false or when your input file size is
>> less than the block size. Also, when it comes to text files, the default
>> textinputformat considers each line as one input split which can be then
>> read by RecordReader in K,V format.
>>
>> Please correct me if I don't make sense.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>
>>> Mapper is the name of the abstract class/interface. It does not really
>>> make sense to talk about number of mappers.
>>> A task is a jvm that can be launched only if there is a free slot ie for
>>> a given slot, at a given time, there will be at maximum only a single task.
>>> During the task, the configured Mapper will be instantiated.
>>>
>>> Always :
>>> Number of input splits = no. of map tasks
>>>
>>> And generally :
>>> number of hdfs blocks = number of input splits
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> PS : I don't know if it is only my client, but avoid red when writting a
>>> mail.
>>>
>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>>
>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>> than slots obviously there will be multiple rounds of mapping.
>>>>
>>>> The map function is called once for each input record. A block is
>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>> = run the map() function on all records in the block.
>>>>
>>>> Number of blocks = no. of map tasks (not mappers)
>>>>
>>>> Furthermore you have to make a distinction between the two layers. You
>>>> have a layer for computations which consists of a jobtracker and a set of
>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>> namenode and a set of datanodes.
>>>>
>>>> In mapreduce the code is executed where the data is. So if a block is
>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>> 3. But this is not necessary, thing can be rearranged.
>>>>
>>>> Hopefully this gives you a little more insigth.
>>>>
>>>> Regards, Dieter
>>>>
>>>>
>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>>
>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>> many no. of times the map() function will be called right?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As per the various articles I went through till date, the File(s) are
>>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>
>>>>>>
>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>    Size. Thus, if the file is smaller than the block size, only one mapper
>>>>>>    will be invoked. Right?
>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>    In this case, if there are two datanodes with a replication factor as 1:
>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>    right? If the no. of mappers are more than the no. of slaves, what happens?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sugandha,

Please find my comments embedded below :

                  No. of mappers are decided as: Total_File_Size/Max. Block
Size. Thus, if the file is smaller than the block size, only one mapper
will be                               invoked. Right?
                  This is true(but not always). The basic criteria behind
map creation is the logic inside *getSplits* method of *InputFormat* being
used in your                     MR job. It is the behavior of *file based
InputFormats*, typically sub-classes of *FileInputFormat*, to split the
input data into splits based                     on the total size, in
bytes, of the input files. See
*this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
only 1 mapper will                     be created.

                  If yes, it means, the map() will be called only once.
Right? In this case, if there are two datanodes with a replication factor
as 1: only one                               datanode(mapper machine) will
perform the task. Right?
                  A mapper is called for each split. Don't get confused
with the MR's split and HDFS's block. Both are different(They may overlap
though, as in                     case of FileInputFormat). HDFS blocks are
physical partitioning of your data, while an InputSplit is just a logical
partitioning. If you have a                       file which is smaller
than the HDFS blocksize then only one split will be created, hence only 1
mapper will be called. And this will happen on                     the node
where this file resides.

                  The map() function is called by all the datanodes/slaves
right? If the no. of mappers are more than the no. of slaves, what happens?
                  map() doesn't get called by anybody. It rather gets
created on the node where the chunk of data to be processed resides. A
slave node can run                       multiple mappers based on the
availability of CPU slots.

                 One more thing to ask: No. of blocks = no. of mappers.
Thus, those many no. of times the map() function will be called right?
                 No. of blocks = no. of splits = no. of mappers. A map is
called only once per split per node where that split is present.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hi Bertrand,
>
> As you said, no. of HDFS blocks =  no. of input splits. But this is only
> true when you set isSplittable() as false or when your input file size is
> less than the block size. Also, when it comes to text files, the default
> textinputformat considers each line as one input split which can be then
> read by RecordReader in K,V format.
>
> Please correct me if I don't make sense.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>
>> Mapper is the name of the abstract class/interface. It does not really
>> make sense to talk about number of mappers.
>> A task is a jvm that can be launched only if there is a free slot ie for
>> a given slot, at a given time, there will be at maximum only a single task.
>> During the task, the configured Mapper will be instantiated.
>>
>> Always :
>> Number of input splits = no. of map tasks
>>
>> And generally :
>> number of hdfs blocks = number of input splits
>>
>> Regards
>>
>> Bertrand
>>
>> PS : I don't know if it is only my client, but avoid red when writting a
>> mail.
>>
>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Each node has a tasktracker with a number of map slots. A map slot hosts
>>> as mapper. A mapper executes map tasks. If there are more map tasks than
>>> slots obviously there will be multiple rounds of mapping.
>>>
>>> The map function is called once for each input record. A block is
>>> typically 64MB and can contain a multitude of record, therefore a map task
>>> = run the map() function on all records in the block.
>>>
>>> Number of blocks = no. of map tasks (not mappers)
>>>
>>> Furthermore you have to make a distinction between the two layers. You
>>> have a layer for computations which consists of a jobtracker and a set of
>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>> namenode and a set of datanodes.
>>>
>>> In mapreduce the code is executed where the data is. So if a block is in
>>> datanode 1, 2 and 3, then the map task associated with this block will
>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>> 3. But this is not necessary, thing can be rearranged.
>>>
>>> Hopefully this gives you a little more insigth.
>>>
>>> Regards, Dieter
>>>
>>>
>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>>> no. of times the map() function will be called right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As per the various articles I went through till date, the File(s) are
>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>
>>>>>
>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>>    invoked. Right?
>>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sugandha,

Please find my comments embedded below :

                  No. of mappers are decided as: Total_File_Size/Max. Block
Size. Thus, if the file is smaller than the block size, only one mapper
will be                               invoked. Right?
                  This is true(but not always). The basic criteria behind
map creation is the logic inside *getSplits* method of *InputFormat* being
used in your                     MR job. It is the behavior of *file based
InputFormats*, typically sub-classes of *FileInputFormat*, to split the
input data into splits based                     on the total size, in
bytes, of the input files. See
*this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
only 1 mapper will                     be created.

                  If yes, it means, the map() will be called only once.
Right? In this case, if there are two datanodes with a replication factor
as 1: only one                               datanode(mapper machine) will
perform the task. Right?
                  A mapper is called for each split. Don't get confused
with the MR's split and HDFS's block. Both are different(They may overlap
though, as in                     case of FileInputFormat). HDFS blocks are
physical partitioning of your data, while an InputSplit is just a logical
partitioning. If you have a                       file which is smaller
than the HDFS blocksize then only one split will be created, hence only 1
mapper will be called. And this will happen on                     the node
where this file resides.

                  The map() function is called by all the datanodes/slaves
right? If the no. of mappers are more than the no. of slaves, what happens?
                  map() doesn't get called by anybody. It rather gets
created on the node where the chunk of data to be processed resides. A
slave node can run                       multiple mappers based on the
availability of CPU slots.

                 One more thing to ask: No. of blocks = no. of mappers.
Thus, those many no. of times the map() function will be called right?
                 No. of blocks = no. of splits = no. of mappers. A map is
called only once per split per node where that split is present.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hi Bertrand,
>
> As you said, no. of HDFS blocks =  no. of input splits. But this is only
> true when you set isSplittable() as false or when your input file size is
> less than the block size. Also, when it comes to text files, the default
> textinputformat considers each line as one input split which can be then
> read by RecordReader in K,V format.
>
> Please correct me if I don't make sense.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>
>> Mapper is the name of the abstract class/interface. It does not really
>> make sense to talk about number of mappers.
>> A task is a jvm that can be launched only if there is a free slot ie for
>> a given slot, at a given time, there will be at maximum only a single task.
>> During the task, the configured Mapper will be instantiated.
>>
>> Always :
>> Number of input splits = no. of map tasks
>>
>> And generally :
>> number of hdfs blocks = number of input splits
>>
>> Regards
>>
>> Bertrand
>>
>> PS : I don't know if it is only my client, but avoid red when writting a
>> mail.
>>
>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Each node has a tasktracker with a number of map slots. A map slot hosts
>>> as mapper. A mapper executes map tasks. If there are more map tasks than
>>> slots obviously there will be multiple rounds of mapping.
>>>
>>> The map function is called once for each input record. A block is
>>> typically 64MB and can contain a multitude of record, therefore a map task
>>> = run the map() function on all records in the block.
>>>
>>> Number of blocks = no. of map tasks (not mappers)
>>>
>>> Furthermore you have to make a distinction between the two layers. You
>>> have a layer for computations which consists of a jobtracker and a set of
>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>> namenode and a set of datanodes.
>>>
>>> In mapreduce the code is executed where the data is. So if a block is in
>>> datanode 1, 2 and 3, then the map task associated with this block will
>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>> 3. But this is not necessary, thing can be rearranged.
>>>
>>> Hopefully this gives you a little more insigth.
>>>
>>> Regards, Dieter
>>>
>>>
>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>>> no. of times the map() function will be called right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As per the various articles I went through till date, the File(s) are
>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>
>>>>>
>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>>    invoked. Right?
>>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sugandha,

Please find my comments embedded below :

                  No. of mappers are decided as: Total_File_Size/Max. Block
Size. Thus, if the file is smaller than the block size, only one mapper
will be                               invoked. Right?
                  This is true(but not always). The basic criteria behind
map creation is the logic inside *getSplits* method of *InputFormat* being
used in your                     MR job. It is the behavior of *file based
InputFormats*, typically sub-classes of *FileInputFormat*, to split the
input data into splits based                     on the total size, in
bytes, of the input files. See
*this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
only 1 mapper will                     be created.

                  If yes, it means, the map() will be called only once.
Right? In this case, if there are two datanodes with a replication factor
as 1: only one                               datanode(mapper machine) will
perform the task. Right?
                  A mapper is called for each split. Don't get confused
with the MR's split and HDFS's block. Both are different(They may overlap
though, as in                     case of FileInputFormat). HDFS blocks are
physical partitioning of your data, while an InputSplit is just a logical
partitioning. If you have a                       file which is smaller
than the HDFS blocksize then only one split will be created, hence only 1
mapper will be called. And this will happen on                     the node
where this file resides.

                  The map() function is called by all the datanodes/slaves
right? If the no. of mappers are more than the no. of slaves, what happens?
                  map() doesn't get called by anybody. It rather gets
created on the node where the chunk of data to be processed resides. A
slave node can run                       multiple mappers based on the
availability of CPU slots.

                 One more thing to ask: No. of blocks = no. of mappers.
Thus, those many no. of times the map() function will be called right?
                 No. of blocks = no. of splits = no. of mappers. A map is
called only once per split per node where that split is present.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hi Bertrand,
>
> As you said, no. of HDFS blocks =  no. of input splits. But this is only
> true when you set isSplittable() as false or when your input file size is
> less than the block size. Also, when it comes to text files, the default
> textinputformat considers each line as one input split which can be then
> read by RecordReader in K,V format.
>
> Please correct me if I don't make sense.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>
>> Mapper is the name of the abstract class/interface. It does not really
>> make sense to talk about number of mappers.
>> A task is a jvm that can be launched only if there is a free slot ie for
>> a given slot, at a given time, there will be at maximum only a single task.
>> During the task, the configured Mapper will be instantiated.
>>
>> Always :
>> Number of input splits = no. of map tasks
>>
>> And generally :
>> number of hdfs blocks = number of input splits
>>
>> Regards
>>
>> Bertrand
>>
>> PS : I don't know if it is only my client, but avoid red when writting a
>> mail.
>>
>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Each node has a tasktracker with a number of map slots. A map slot hosts
>>> as mapper. A mapper executes map tasks. If there are more map tasks than
>>> slots obviously there will be multiple rounds of mapping.
>>>
>>> The map function is called once for each input record. A block is
>>> typically 64MB and can contain a multitude of record, therefore a map task
>>> = run the map() function on all records in the block.
>>>
>>> Number of blocks = no. of map tasks (not mappers)
>>>
>>> Furthermore you have to make a distinction between the two layers. You
>>> have a layer for computations which consists of a jobtracker and a set of
>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>> namenode and a set of datanodes.
>>>
>>> In mapreduce the code is executed where the data is. So if a block is in
>>> datanode 1, 2 and 3, then the map task associated with this block will
>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>> 3. But this is not necessary, thing can be rearranged.
>>>
>>> Hopefully this gives you a little more insigth.
>>>
>>> Regards, Dieter
>>>
>>>
>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>>> no. of times the map() function will be called right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As per the various articles I went through till date, the File(s) are
>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>
>>>>>
>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>>    invoked. Right?
>>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sugandha,

Please find my comments embedded below :

                  No. of mappers are decided as: Total_File_Size/Max. Block
Size. Thus, if the file is smaller than the block size, only one mapper
will be                               invoked. Right?
                  This is true(but not always). The basic criteria behind
map creation is the logic inside *getSplits* method of *InputFormat* being
used in your                     MR job. It is the behavior of *file based
InputFormats*, typically sub-classes of *FileInputFormat*, to split the
input data into splits based                     on the total size, in
bytes, of the input files. See
*this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
only 1 mapper will                     be created.

                  If yes, it means, the map() will be called only once.
Right? In this case, if there are two datanodes with a replication factor
as 1: only one                               datanode(mapper machine) will
perform the task. Right?
                  A mapper is called for each split. Don't get confused
with the MR's split and HDFS's block. Both are different(They may overlap
though, as in                     case of FileInputFormat). HDFS blocks are
physical partitioning of your data, while an InputSplit is just a logical
partitioning. If you have a                       file which is smaller
than the HDFS blocksize then only one split will be created, hence only 1
mapper will be called. And this will happen on                     the node
where this file resides.

                  The map() function is called by all the datanodes/slaves
right? If the no. of mappers are more than the no. of slaves, what happens?
                  map() doesn't get called by anybody. It rather gets
created on the node where the chunk of data to be processed resides. A
slave node can run                       multiple mappers based on the
availability of CPU slots.

                 One more thing to ask: No. of blocks = no. of mappers.
Thus, those many no. of times the map() function will be called right?
                 No. of blocks = no. of splits = no. of mappers. A map is
called only once per split per node where that split is present.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hi Bertrand,
>
> As you said, no. of HDFS blocks =  no. of input splits. But this is only
> true when you set isSplittable() as false or when your input file size is
> less than the block size. Also, when it comes to text files, the default
> textinputformat considers each line as one input split which can be then
> read by RecordReader in K,V format.
>
> Please correct me if I don't make sense.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>
>> Mapper is the name of the abstract class/interface. It does not really
>> make sense to talk about number of mappers.
>> A task is a jvm that can be launched only if there is a free slot ie for
>> a given slot, at a given time, there will be at maximum only a single task.
>> During the task, the configured Mapper will be instantiated.
>>
>> Always :
>> Number of input splits = no. of map tasks
>>
>> And generally :
>> number of hdfs blocks = number of input splits
>>
>> Regards
>>
>> Bertrand
>>
>> PS : I don't know if it is only my client, but avoid red when writting a
>> mail.
>>
>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> Each node has a tasktracker with a number of map slots. A map slot hosts
>>> as mapper. A mapper executes map tasks. If there are more map tasks than
>>> slots obviously there will be multiple rounds of mapping.
>>>
>>> The map function is called once for each input record. A block is
>>> typically 64MB and can contain a multitude of record, therefore a map task
>>> = run the map() function on all records in the block.
>>>
>>> Number of blocks = no. of map tasks (not mappers)
>>>
>>> Furthermore you have to make a distinction between the two layers. You
>>> have a layer for computations which consists of a jobtracker and a set of
>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>> namenode and a set of datanodes.
>>>
>>> In mapreduce the code is executed where the data is. So if a block is in
>>> datanode 1, 2 and 3, then the map task associated with this block will
>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>> 3. But this is not necessary, thing can be rearranged.
>>>
>>> Hopefully this gives you a little more insigth.
>>>
>>> Regards, Dieter
>>>
>>>
>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>>> no. of times the map() function will be called right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As per the various articles I went through till date, the File(s) are
>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>
>>>>>
>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>>    invoked. Right?
>>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Hi Bertrand,

As you said, no. of HDFS blocks =  no. of input splits. But this is only
true when you set isSplittable() as false or when your input file size is
less than the block size. Also, when it comes to text files, the default
textinputformat considers each line as one input split which can be then
read by RecordReader in K,V format.

Please correct me if I don't make sense.

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> The wiki (or Hadoop The Definitive Guide) are good ressources.
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>
> Mapper is the name of the abstract class/interface. It does not really
> make sense to talk about number of mappers.
> A task is a jvm that can be launched only if there is a free slot ie for a
> given slot, at a given time, there will be at maximum only a single task.
> During the task, the configured Mapper will be instantiated.
>
> Always :
> Number of input splits = no. of map tasks
>
> And generally :
> number of hdfs blocks = number of input splits
>
> Regards
>
> Bertrand
>
> PS : I don't know if it is only my client, but avoid red when writting a
> mail.
>
> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Each node has a tasktracker with a number of map slots. A map slot hosts
>> as mapper. A mapper executes map tasks. If there are more map tasks than
>> slots obviously there will be multiple rounds of mapping.
>>
>> The map function is called once for each input record. A block is
>> typically 64MB and can contain a multitude of record, therefore a map task
>> = run the map() function on all records in the block.
>>
>> Number of blocks = no. of map tasks (not mappers)
>>
>> Furthermore you have to make a distinction between the two layers. You
>> have a layer for computations which consists of a jobtracker and a set of
>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>> namenode and a set of datanodes.
>>
>> In mapreduce the code is executed where the data is. So if a block is in
>> datanode 1, 2 and 3, then the map task associated with this block will
>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>> 3. But this is not necessary, thing can be rearranged.
>>
>> Hopefully this gives you a little more insigth.
>>
>> Regards, Dieter
>>
>>
>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>> no. of times the map() function will be called right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> As per the various articles I went through till date, the File(s) are
>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>
>>>>
>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>    invoked. Right?
>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Hi Bertrand,

As you said, no. of HDFS blocks =  no. of input splits. But this is only
true when you set isSplittable() as false or when your input file size is
less than the block size. Also, when it comes to text files, the default
textinputformat considers each line as one input split which can be then
read by RecordReader in K,V format.

Please correct me if I don't make sense.

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> The wiki (or Hadoop The Definitive Guide) are good ressources.
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>
> Mapper is the name of the abstract class/interface. It does not really
> make sense to talk about number of mappers.
> A task is a jvm that can be launched only if there is a free slot ie for a
> given slot, at a given time, there will be at maximum only a single task.
> During the task, the configured Mapper will be instantiated.
>
> Always :
> Number of input splits = no. of map tasks
>
> And generally :
> number of hdfs blocks = number of input splits
>
> Regards
>
> Bertrand
>
> PS : I don't know if it is only my client, but avoid red when writting a
> mail.
>
> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Each node has a tasktracker with a number of map slots. A map slot hosts
>> as mapper. A mapper executes map tasks. If there are more map tasks than
>> slots obviously there will be multiple rounds of mapping.
>>
>> The map function is called once for each input record. A block is
>> typically 64MB and can contain a multitude of record, therefore a map task
>> = run the map() function on all records in the block.
>>
>> Number of blocks = no. of map tasks (not mappers)
>>
>> Furthermore you have to make a distinction between the two layers. You
>> have a layer for computations which consists of a jobtracker and a set of
>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>> namenode and a set of datanodes.
>>
>> In mapreduce the code is executed where the data is. So if a block is in
>> datanode 1, 2 and 3, then the map task associated with this block will
>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>> 3. But this is not necessary, thing can be rearranged.
>>
>> Hopefully this gives you a little more insigth.
>>
>> Regards, Dieter
>>
>>
>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>> no. of times the map() function will be called right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> As per the various articles I went through till date, the File(s) are
>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>
>>>>
>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>    invoked. Right?
>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Hi Bertrand,

As you said, no. of HDFS blocks =  no. of input splits. But this is only
true when you set isSplittable() as false or when your input file size is
less than the block size. Also, when it comes to text files, the default
textinputformat considers each line as one input split which can be then
read by RecordReader in K,V format.

Please correct me if I don't make sense.

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> The wiki (or Hadoop The Definitive Guide) are good ressources.
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>
> Mapper is the name of the abstract class/interface. It does not really
> make sense to talk about number of mappers.
> A task is a jvm that can be launched only if there is a free slot ie for a
> given slot, at a given time, there will be at maximum only a single task.
> During the task, the configured Mapper will be instantiated.
>
> Always :
> Number of input splits = no. of map tasks
>
> And generally :
> number of hdfs blocks = number of input splits
>
> Regards
>
> Bertrand
>
> PS : I don't know if it is only my client, but avoid red when writting a
> mail.
>
> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Each node has a tasktracker with a number of map slots. A map slot hosts
>> as mapper. A mapper executes map tasks. If there are more map tasks than
>> slots obviously there will be multiple rounds of mapping.
>>
>> The map function is called once for each input record. A block is
>> typically 64MB and can contain a multitude of record, therefore a map task
>> = run the map() function on all records in the block.
>>
>> Number of blocks = no. of map tasks (not mappers)
>>
>> Furthermore you have to make a distinction between the two layers. You
>> have a layer for computations which consists of a jobtracker and a set of
>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>> namenode and a set of datanodes.
>>
>> In mapreduce the code is executed where the data is. So if a block is in
>> datanode 1, 2 and 3, then the map task associated with this block will
>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>> 3. But this is not necessary, thing can be rearranged.
>>
>> Hopefully this gives you a little more insigth.
>>
>> Regards, Dieter
>>
>>
>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>> no. of times the map() function will be called right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> As per the various articles I went through till date, the File(s) are
>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>
>>>>
>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>    invoked. Right?
>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

Hi Bertrand,

As you said, no. of HDFS blocks =  no. of input splits. But this is only
true when you set isSplittable() as false or when your input file size is
less than the block size. Also, when it comes to text files, the default
textinputformat considers each line as one input split which can be then
read by RecordReader in K,V format.

Please correct me if I don't make sense.

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> The wiki (or Hadoop The Definitive Guide) are good ressources.
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>
> Mapper is the name of the abstract class/interface. It does not really
> make sense to talk about number of mappers.
> A task is a jvm that can be launched only if there is a free slot ie for a
> given slot, at a given time, there will be at maximum only a single task.
> During the task, the configured Mapper will be instantiated.
>
> Always :
> Number of input splits = no. of map tasks
>
> And generally :
> number of hdfs blocks = number of input splits
>
> Regards
>
> Bertrand
>
> PS : I don't know if it is only my client, but avoid red when writting a
> mail.
>
> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> Each node has a tasktracker with a number of map slots. A map slot hosts
>> as mapper. A mapper executes map tasks. If there are more map tasks than
>> slots obviously there will be multiple rounds of mapping.
>>
>> The map function is called once for each input record. A block is
>> typically 64MB and can contain a multitude of record, therefore a map task
>> = run the map() function on all records in the block.
>>
>> Number of blocks = no. of map tasks (not mappers)
>>
>> Furthermore you have to make a distinction between the two layers. You
>> have a layer for computations which consists of a jobtracker and a set of
>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>> namenode and a set of datanodes.
>>
>> In mapreduce the code is executed where the data is. So if a block is in
>> datanode 1, 2 and 3, then the map task associated with this block will
>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>> 3. But this is not necessary, thing can be rearranged.
>>
>> Hopefully this gives you a little more insigth.
>>
>> Regards, Dieter
>>
>>
>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>>> no. of times the map() function will be called right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> As per the various articles I went through till date, the File(s) are
>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>
>>>>
>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>>    invoked. Right?
>>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>>    one datanode(mapper machine) will perform the task. Right?
>>>>    3. The map() function is called by all the datanodes/slaves right?
>>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Bertrand Dechoux <de...@gmail.com>.

The wiki (or Hadoop The Definitive Guide) are good ressources.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats

Mapper is the name of the abstract class/interface. It does not really make
sense to talk about number of mappers.
A task is a jvm that can be launched only if there is a free slot ie for a
given slot, at a given time, there will be at maximum only a single task.
During the task, the configured Mapper will be instantiated.

Always :
Number of input splits = no. of map tasks

And generally :
number of hdfs blocks = number of input splits

Regards

Bertrand

PS : I don't know if it is only my client, but avoid red when writting a
mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Each node has a tasktracker with a number of map slots. A map slot hosts
> as mapper. A mapper executes map tasks. If there are more map tasks than
> slots obviously there will be multiple rounds of mapping.
>
> The map function is called once for each input record. A block is
> typically 64MB and can contain a multitude of record, therefore a map task
> = run the map() function on all records in the block.
>
> Number of blocks = no. of map tasks (not mappers)
>
> Furthermore you have to make a distinction between the two layers. You
> have a layer for computations which consists of a jobtracker and a set of
> tasktrackers. The other layer is responsible for storage. The HDFS has a
> namenode and a set of datanodes.
>
> In mapreduce the code is executed where the data is. So if a block is in
> datanode 1, 2 and 3, then the map task associated with this block will
> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
> 3. But this is not necessary, thing can be rearranged.
>
> Hopefully this gives you a little more insigth.
>
> Regards, Dieter
>
>
> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>> no. of times the map() function will be called right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As per the various articles I went through till date, the File(s) are
>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>
>>>
>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>    invoked. Right?
>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>    one datanode(mapper machine) will perform the task. Right?
>>>    3. The map() function is called by all the datanodes/slaves right?
>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Bertrand Dechoux <de...@gmail.com>.

The wiki (or Hadoop The Definitive Guide) are good ressources.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats

Mapper is the name of the abstract class/interface. It does not really make
sense to talk about number of mappers.
A task is a jvm that can be launched only if there is a free slot ie for a
given slot, at a given time, there will be at maximum only a single task.
During the task, the configured Mapper will be instantiated.

Always :
Number of input splits = no. of map tasks

And generally :
number of hdfs blocks = number of input splits

Regards

Bertrand

PS : I don't know if it is only my client, but avoid red when writting a
mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Each node has a tasktracker with a number of map slots. A map slot hosts
> as mapper. A mapper executes map tasks. If there are more map tasks than
> slots obviously there will be multiple rounds of mapping.
>
> The map function is called once for each input record. A block is
> typically 64MB and can contain a multitude of record, therefore a map task
> = run the map() function on all records in the block.
>
> Number of blocks = no. of map tasks (not mappers)
>
> Furthermore you have to make a distinction between the two layers. You
> have a layer for computations which consists of a jobtracker and a set of
> tasktrackers. The other layer is responsible for storage. The HDFS has a
> namenode and a set of datanodes.
>
> In mapreduce the code is executed where the data is. So if a block is in
> datanode 1, 2 and 3, then the map task associated with this block will
> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
> 3. But this is not necessary, thing can be rearranged.
>
> Hopefully this gives you a little more insigth.
>
> Regards, Dieter
>
>
> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>> no. of times the map() function will be called right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As per the various articles I went through till date, the File(s) are
>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>
>>>
>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>    invoked. Right?
>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>    one datanode(mapper machine) will perform the task. Right?
>>>    3. The map() function is called by all the datanodes/slaves right?
>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Bertrand Dechoux <de...@gmail.com>.

The wiki (or Hadoop The Definitive Guide) are good ressources.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats

Mapper is the name of the abstract class/interface. It does not really make
sense to talk about number of mappers.
A task is a jvm that can be launched only if there is a free slot ie for a
given slot, at a given time, there will be at maximum only a single task.
During the task, the configured Mapper will be instantiated.

Always :
Number of input splits = no. of map tasks

And generally :
number of hdfs blocks = number of input splits

Regards

Bertrand

PS : I don't know if it is only my client, but avoid red when writting a
mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Each node has a tasktracker with a number of map slots. A map slot hosts
> as mapper. A mapper executes map tasks. If there are more map tasks than
> slots obviously there will be multiple rounds of mapping.
>
> The map function is called once for each input record. A block is
> typically 64MB and can contain a multitude of record, therefore a map task
> = run the map() function on all records in the block.
>
> Number of blocks = no. of map tasks (not mappers)
>
> Furthermore you have to make a distinction between the two layers. You
> have a layer for computations which consists of a jobtracker and a set of
> tasktrackers. The other layer is responsible for storage. The HDFS has a
> namenode and a set of datanodes.
>
> In mapreduce the code is executed where the data is. So if a block is in
> datanode 1, 2 and 3, then the map task associated with this block will
> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
> 3. But this is not necessary, thing can be rearranged.
>
> Hopefully this gives you a little more insigth.
>
> Regards, Dieter
>
>
> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>> no. of times the map() function will be called right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As per the various articles I went through till date, the File(s) are
>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>
>>>
>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>    invoked. Right?
>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>    one datanode(mapper machine) will perform the task. Right?
>>>    3. The map() function is called by all the datanodes/slaves right?
>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Bertrand Dechoux <de...@gmail.com>.

The wiki (or Hadoop The Definitive Guide) are good ressources.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats

Mapper is the name of the abstract class/interface. It does not really make
sense to talk about number of mappers.
A task is a jvm that can be launched only if there is a free slot ie for a
given slot, at a given time, there will be at maximum only a single task.
During the task, the configured Mapper will be instantiated.

Always :
Number of input splits = no. of map tasks

And generally :
number of hdfs blocks = number of input splits

Regards

Bertrand

PS : I don't know if it is only my client, but avoid red when writting a
mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <dr...@gmail.com> wrote:

> Each node has a tasktracker with a number of map slots. A map slot hosts
> as mapper. A mapper executes map tasks. If there are more map tasks than
> slots obviously there will be multiple rounds of mapping.
>
> The map function is called once for each input record. A block is
> typically 64MB and can contain a multitude of record, therefore a map task
> = run the map() function on all records in the block.
>
> Number of blocks = no. of map tasks (not mappers)
>
> Furthermore you have to make a distinction between the two layers. You
> have a layer for computations which consists of a jobtracker and a set of
> tasktrackers. The other layer is responsible for storage. The HDFS has a
> namenode and a set of datanodes.
>
> In mapreduce the code is executed where the data is. So if a block is in
> datanode 1, 2 and 3, then the map task associated with this block will
> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
> 3. But this is not necessary, thing can be rearranged.
>
> Hopefully this gives you a little more insigth.
>
> Regards, Dieter
>
>
> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
>> no. of times the map() function will be called right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As per the various articles I went through till date, the File(s) are
>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>
>>>
>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>>    Thus, if the file is smaller than the block size, only one mapper will be
>>>    invoked. Right?
>>>    2. If yes, it means, the map() will be called only once. Right? In
>>>    this case, if there are two datanodes with a replication factor as 1: only
>>>    one datanode(mapper machine) will perform the task. Right?
>>>    3. The map() function is called by all the datanodes/slaves right?
>>>    If the no. of mappers are more than the no. of slaves, what happens?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Mappers vs. Map tasks

Posted by Dieter De Witte <dr...@gmail.com>.

Each node has a tasktracker with a number of map slots. A map slot hosts as
mapper. A mapper executes map tasks. If there are more map tasks than slots
obviously there will be multiple rounds of mapping.

The map function is called once for each input record. A block is typically
64MB and can contain a multitude of record, therefore a map task = run the
map() function on all records in the block.

Number of blocks = no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers. You have
a layer for computations which consists of a jobtracker and a set of
tasktrackers. The other layer is responsible for storage. The HDFS has a
namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a block is in
datanode 1, 2 and 3, then the map task associated with this block will
likely be executed on one of those physical nodes, by tasktracker 1, 2 or
3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, Dieter

2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
> no. of times the map() function will be called right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Hello,
>>
>> As per the various articles I went through till date, the File(s) are
>> split in chunks/blocks. On the same note, would like to ask few things:
>>
>>
>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>    Thus, if the file is smaller than the block size, only one mapper will be
>>    invoked. Right?
>>    2. If yes, it means, the map() will be called only once. Right? In
>>    this case, if there are two datanodes with a replication factor as 1: only
>>    one datanode(mapper machine) will perform the task. Right?
>>    3. The map() function is called by all the datanodes/slaves right? If
>>    the no. of mappers are more than the no. of slaves, what happens?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Mappers vs. Map tasks

Posted by Dieter De Witte <dr...@gmail.com>.

Each node has a tasktracker with a number of map slots. A map slot hosts as
mapper. A mapper executes map tasks. If there are more map tasks than slots
obviously there will be multiple rounds of mapping.

The map function is called once for each input record. A block is typically
64MB and can contain a multitude of record, therefore a map task = run the
map() function on all records in the block.

Number of blocks = no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers. You have
a layer for computations which consists of a jobtracker and a set of
tasktrackers. The other layer is responsible for storage. The HDFS has a
namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a block is in
datanode 1, 2 and 3, then the map task associated with this block will
likely be executed on one of those physical nodes, by tasktracker 1, 2 or
3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, Dieter

2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
> no. of times the map() function will be called right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Hello,
>>
>> As per the various articles I went through till date, the File(s) are
>> split in chunks/blocks. On the same note, would like to ask few things:
>>
>>
>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>    Thus, if the file is smaller than the block size, only one mapper will be
>>    invoked. Right?
>>    2. If yes, it means, the map() will be called only once. Right? In
>>    this case, if there are two datanodes with a replication factor as 1: only
>>    one datanode(mapper machine) will perform the task. Right?
>>    3. The map() function is called by all the datanodes/slaves right? If
>>    the no. of mappers are more than the no. of slaves, what happens?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Mappers vs. Map tasks

Posted by Dieter De Witte <dr...@gmail.com>.

Each node has a tasktracker with a number of map slots. A map slot hosts as
mapper. A mapper executes map tasks. If there are more map tasks than slots
obviously there will be multiple rounds of mapping.

The map function is called once for each input record. A block is typically
64MB and can contain a multitude of record, therefore a map task = run the
map() function on all records in the block.

Number of blocks = no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers. You have
a layer for computations which consists of a jobtracker and a set of
tasktrackers. The other layer is responsible for storage. The HDFS has a
namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a block is in
datanode 1, 2 and 3, then the map task associated with this block will
likely be executed on one of those physical nodes, by tasktracker 1, 2 or
3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, Dieter

2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
> no. of times the map() function will be called right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Hello,
>>
>> As per the various articles I went through till date, the File(s) are
>> split in chunks/blocks. On the same note, would like to ask few things:
>>
>>
>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>    Thus, if the file is smaller than the block size, only one mapper will be
>>    invoked. Right?
>>    2. If yes, it means, the map() will be called only once. Right? In
>>    this case, if there are two datanodes with a replication factor as 1: only
>>    one datanode(mapper machine) will perform the task. Right?
>>    3. The map() function is called by all the datanodes/slaves right? If
>>    the no. of mappers are more than the no. of slaves, what happens?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Mappers vs. Map tasks

Posted by Dieter De Witte <dr...@gmail.com>.

Each node has a tasktracker with a number of map slots. A map slot hosts as
mapper. A mapper executes map tasks. If there are more map tasks than slots
obviously there will be multiple rounds of mapping.

The map function is called once for each input record. A block is typically
64MB and can contain a multitude of record, therefore a map task = run the
map() function on all records in the block.

Number of blocks = no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers. You have
a layer for computations which consists of a jobtracker and a set of
tasktrackers. The other layer is responsible for storage. The HDFS has a
namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a block is in
datanode 1, 2 and 3, then the map task associated with this block will
likely be executed on one of those physical nodes, by tasktracker 1, 2 or
3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, Dieter

2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> One more thing to ask: No. of blocks = no. of mappers. Thus, those many
> no. of times the map() function will be called right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
> sugandha.n87@gmail.com> wrote:
>
>> Hello,
>>
>> As per the various articles I went through till date, the File(s) are
>> split in chunks/blocks. On the same note, would like to ask few things:
>>
>>
>>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>>    Thus, if the file is smaller than the block size, only one mapper will be
>>    invoked. Right?
>>    2. If yes, it means, the map() will be called only once. Right? In
>>    this case, if there are two datanodes with a replication factor as 1: only
>>    one datanode(mapper machine) will perform the task. Right?
>>    3. The map() function is called by all the datanodes/slaves right? If
>>    the no. of mappers are more than the no. of slaves, what happens?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

One more thing to ask: No. of blocks = no. of mappers. Thus, those many no.
of times the map() function will be called right?

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

One more thing to ask: No. of blocks = no. of mappers. Thus, those many no.
of times the map() function will be called right?

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

One more thing to ask: No. of blocks = no. of mappers. Thus, those many no.
of times the map() function will be called right?

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by shashwat shriparv <dw...@gmail.com>.

You are really confused :) Please read this :

http://developer.yahoo.com/hadoop/tutorial/module4.html#closer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces



* Warm Regards_**∞_*
* Shashwat Shriparv*
 [image: http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]<http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9>[image:
https://twitter.com/shriparv] <https://twitter.com/shriparv>[image:
https://www.facebook.com/shriparv] <https://www.facebook.com/shriparv>[image:
http://google.com/+ShashwatShriparv]
<http://google.com/+ShashwatShriparv>[image:
http://www.youtube.com/user/sShriparv/videos]<http://www.youtube.com/user/sShriparv/videos>[image:
http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] <sh...@yahoo.com>



On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by shashwat shriparv <dw...@gmail.com>.

You are really confused :) Please read this :

http://developer.yahoo.com/hadoop/tutorial/module4.html#closer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces



* Warm Regards_**∞_*
* Shashwat Shriparv*
 [image: http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]<http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9>[image:
https://twitter.com/shriparv] <https://twitter.com/shriparv>[image:
https://www.facebook.com/shriparv] <https://www.facebook.com/shriparv>[image:
http://google.com/+ShashwatShriparv]
<http://google.com/+ShashwatShriparv>[image:
http://www.youtube.com/user/sShriparv/videos]<http://www.youtube.com/user/sShriparv/videos>[image:
http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] <sh...@yahoo.com>



On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Mappers vs. Map tasks

Posted by Sugandha Naolekar <su...@gmail.com>.

One more thing to ask: No. of blocks = no. of mappers. Thus, those many no.
of times the map() function will be called right?

--
Thanks & Regards,
Sugandha Naolekar





On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> As per the various articles I went through till date, the File(s) are
> split in chunks/blocks. On the same note, would like to ask few things:
>
>
>    1. No. of mappers are decided as: Total_File_Size/Max. Block Size.
>    Thus, if the file is smaller than the block size, only one mapper will be
>    invoked. Right?
>    2. If yes, it means, the map() will be called only once. Right? In
>    this case, if there are two datanodes with a replication factor as 1: only
>    one datanode(mapper machine) will perform the task. Right?
>    3. The map() function is called by all the datanodes/slaves right? If
>    the no. of mappers are more than the no. of slaves, what happens?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>