You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marcelo Elias Del Valle <mv...@gmail.com> on 2013/01/28 16:54:00 UTC

number of mapper tasks

Hello,

    I am using hadoop with TextInputFormat, a mapper and no reducers. I am
running my jobs at Amazon EMR. When I run my job, I set both following
options:
-s,mapred.tasktracker.map.tasks.maximum=10
-jobconf,mapred.map.tasks=10
    When I run my job with just 1 instance, I see it only creates 1 mapper.
When I run my job with 5 instances (1 master and 4 cores), I can see only 2
mapper slots are used and 6 stay open.

     I am trying to figure why I am not being able to run more mappers in
parallel. When I see the logs, I find some messages like these:

INFO org.apache.hadoop.mapred.ReduceTask (main):
attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
org.apache.hadoop.mapred.ReduceTask (main):
attempt_201301281437_0001_r_000003_0 Need another 1 map output(s)
where 0 is already in progress

    Any hints? They would be highly appreciatted.

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod



On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> wrote:

> Hello,
>
>     I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
>     I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
> which says:
>
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
>
>      I am not sure on how to get more info.
>
>      Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
>
> Best regards,
> Marcelo.
>
>
> 2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> I implemented my custom input format. Here is how I used it:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>>
>> As you can see, I do:
>> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>>
>> And here is the Input format and the linereader:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>>
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>>
>> public static final String LINES_PER_MAP =
>> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>>
>>
>>
>> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>>
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>>
>>>  HTH,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>>
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>>
>>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>>
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>>
>>> Thanks
>>> Marcelo.
>>>
>>>
>>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>>
>>>> Sorry for asking too many questions, but the answers are really
>>>> happening.
>>>>
>>>>
>>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>>
>>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>>
>>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>>> .
>>>>> This should let you spawn more maps as we, based on your N factor.
>>>>>
>>>>
>>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>>> I could change it to read several lines at a time, but would this alone
>>>> allow more tasks running in parallel?
>>>>
>>>>
>>>>> Not really - "Slots" are capacities, rather than split factors
>>>>> themselves. You can have N slots always available, but your job has to
>>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>>> up.
>>>>>
>>>>
>>>> But how can I do that (supply map tasks) in my job? changing its code?
>>>> hadoop config?
>>>>
>>>>
>>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>>> reducer is always run that waits to see if it has any outputs from
>>>>> maps. If it does not receive any outputs after maps have all
>>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>>
>>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>>> thanks!
>>>>
>>>>
>>>> --
>>>> Marcelo Elias Del Valle
>>>> http://mvalle.com - @mvallebr
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod



On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> wrote:

> Hello,
>
>     I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
>     I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
> which says:
>
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
>
>      I am not sure on how to get more info.
>
>      Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
>
> Best regards,
> Marcelo.
>
>
> 2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> I implemented my custom input format. Here is how I used it:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>>
>> As you can see, I do:
>> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>>
>> And here is the Input format and the linereader:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>>
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>>
>> public static final String LINES_PER_MAP =
>> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>>
>>
>>
>> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>>
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>>
>>>  HTH,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>>
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>>
>>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>>
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>>
>>> Thanks
>>> Marcelo.
>>>
>>>
>>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>>
>>>> Sorry for asking too many questions, but the answers are really
>>>> happening.
>>>>
>>>>
>>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>>
>>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>>
>>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>>> .
>>>>> This should let you spawn more maps as we, based on your N factor.
>>>>>
>>>>
>>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>>> I could change it to read several lines at a time, but would this alone
>>>> allow more tasks running in parallel?
>>>>
>>>>
>>>>> Not really - "Slots" are capacities, rather than split factors
>>>>> themselves. You can have N slots always available, but your job has to
>>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>>> up.
>>>>>
>>>>
>>>> But how can I do that (supply map tasks) in my job? changing its code?
>>>> hadoop config?
>>>>
>>>>
>>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>>> reducer is always run that waits to see if it has any outputs from
>>>>> maps. If it does not receive any outputs after maps have all
>>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>>
>>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>>> thanks!
>>>>
>>>>
>>>> --
>>>> Marcelo Elias Del Valle
>>>> http://mvalle.com - @mvallebr
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod



On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> wrote:

> Hello,
>
>     I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
>     I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
> which says:
>
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
>
>      I am not sure on how to get more info.
>
>      Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
>
> Best regards,
> Marcelo.
>
>
> 2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> I implemented my custom input format. Here is how I used it:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>>
>> As you can see, I do:
>> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>>
>> And here is the Input format and the linereader:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>>
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>>
>> public static final String LINES_PER_MAP =
>> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>>
>>
>>
>> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>>
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>>
>>>  HTH,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>>
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>>
>>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>>
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>>
>>> Thanks
>>> Marcelo.
>>>
>>>
>>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>>
>>>> Sorry for asking too many questions, but the answers are really
>>>> happening.
>>>>
>>>>
>>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>>
>>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>>
>>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>>> .
>>>>> This should let you spawn more maps as we, based on your N factor.
>>>>>
>>>>
>>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>>> I could change it to read several lines at a time, but would this alone
>>>> allow more tasks running in parallel?
>>>>
>>>>
>>>>> Not really - "Slots" are capacities, rather than split factors
>>>>> themselves. You can have N slots always available, but your job has to
>>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>>> up.
>>>>>
>>>>
>>>> But how can I do that (supply map tasks) in my job? changing its code?
>>>> hadoop config?
>>>>
>>>>
>>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>>> reducer is always run that waits to see if it has any outputs from
>>>>> maps. If it does not receive any outputs after maps have all
>>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>>
>>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>>> thanks!
>>>>
>>>>
>>>> --
>>>> Marcelo Elias Del Valle
>>>> http://mvalle.com - @mvallebr
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod



On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> wrote:

> Hello,
>
>     I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
>     I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
> which says:
>
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
>
>      I am not sure on how to get more info.
>
>      Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
>
> Best regards,
> Marcelo.
>
>
> 2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> I implemented my custom input format. Here is how I used it:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>>
>> As you can see, I do:
>> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>>
>> And here is the Input format and the linereader:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>>
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>>
>> public static final String LINES_PER_MAP =
>> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>>
>>
>>
>> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>>
>>>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>>
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>>
>>>  HTH,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>>
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>>
>>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>>
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>>
>>> Thanks
>>> Marcelo.
>>>
>>>
>>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>>
>>>> Sorry for asking too many questions, but the answers are really
>>>> happening.
>>>>
>>>>
>>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>>
>>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>>
>>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>>> .
>>>>> This should let you spawn more maps as we, based on your N factor.
>>>>>
>>>>
>>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>>> I could change it to read several lines at a time, but would this alone
>>>> allow more tasks running in parallel?
>>>>
>>>>
>>>>> Not really - "Slots" are capacities, rather than split factors
>>>>> themselves. You can have N slots always available, but your job has to
>>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>>> up.
>>>>>
>>>>
>>>> But how can I do that (supply map tasks) in my job? changing its code?
>>>> hadoop config?
>>>>
>>>>
>>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>>> reducer is always run that waits to see if it has any outputs from
>>>>> maps. If it does not receive any outputs after maps have all
>>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>>
>>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>>> thanks!
>>>>
>>>>
>>>> --
>>>> Marcelo Elias Del Valle
>>>> http://mvalle.com - @mvallebr
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello,

    I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
    I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book "Hadoop - The definitive guide" (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

"The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ..."

     I am not sure on how to get more info.

     Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>

> I implemented my custom input format. Here is how I used it:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>
> As you can see, I do:
> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>
> And here is the Input format and the linereader:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>
> In this input format, I completely ignore these other parameters and get
> the splits by the number of lines. The amount of lines per map can be
> controlled by the same parameter used in NLineInputFormat:
>
> public static final String LINES_PER_MAP =
> "mapreduce.input.lineinputformat.linespermap";
> However, it has really no effect on the number of maps.
>
>
>
> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Regarding your original question, you can use the min and max split
>> settings to control the number of maps:
>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>> use mapred.min.split.size directly.
>>
>> W.r.t your custom inputformat, are you sure you job is using this
>> InputFormat and not the default one?
>>
>>  HTH,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>
>> Just to complement the last question, I have implemented the getSplits
>> method in my input format:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> However, it still doesn't create more than 2 map tasks. Is there
>> something I could do about it to assure more map tasks are created?
>>
>> Thanks
>> Marcelo.
>>
>>
>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>
>>> Sorry for asking too many questions, but the answers are really
>>> happening.
>>>
>>>
>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>
>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>> .
>>>> This should let you spawn more maps as we, based on your N factor.
>>>>
>>>
>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>> I could change it to read several lines at a time, but would this alone
>>> allow more tasks running in parallel?
>>>
>>>
>>>> Not really - "Slots" are capacities, rather than split factors
>>>> themselves. You can have N slots always available, but your job has to
>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>> up.
>>>>
>>>
>>> But how can I do that (supply map tasks) in my job? changing its code?
>>> hadoop config?
>>>
>>>
>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>> reducer is always run that waits to see if it has any outputs from
>>>> maps. If it does not receive any outputs after maps have all
>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>
>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>> thanks!
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello,

    I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
    I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book "Hadoop - The definitive guide" (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

"The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ..."

     I am not sure on how to get more info.

     Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>

> I implemented my custom input format. Here is how I used it:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>
> As you can see, I do:
> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>
> And here is the Input format and the linereader:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>
> In this input format, I completely ignore these other parameters and get
> the splits by the number of lines. The amount of lines per map can be
> controlled by the same parameter used in NLineInputFormat:
>
> public static final String LINES_PER_MAP =
> "mapreduce.input.lineinputformat.linespermap";
> However, it has really no effect on the number of maps.
>
>
>
> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Regarding your original question, you can use the min and max split
>> settings to control the number of maps:
>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>> use mapred.min.split.size directly.
>>
>> W.r.t your custom inputformat, are you sure you job is using this
>> InputFormat and not the default one?
>>
>>  HTH,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>
>> Just to complement the last question, I have implemented the getSplits
>> method in my input format:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> However, it still doesn't create more than 2 map tasks. Is there
>> something I could do about it to assure more map tasks are created?
>>
>> Thanks
>> Marcelo.
>>
>>
>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>
>>> Sorry for asking too many questions, but the answers are really
>>> happening.
>>>
>>>
>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>
>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>> .
>>>> This should let you spawn more maps as we, based on your N factor.
>>>>
>>>
>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>> I could change it to read several lines at a time, but would this alone
>>> allow more tasks running in parallel?
>>>
>>>
>>>> Not really - "Slots" are capacities, rather than split factors
>>>> themselves. You can have N slots always available, but your job has to
>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>> up.
>>>>
>>>
>>> But how can I do that (supply map tasks) in my job? changing its code?
>>> hadoop config?
>>>
>>>
>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>> reducer is always run that waits to see if it has any outputs from
>>>> maps. If it does not receive any outputs after maps have all
>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>
>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>> thanks!
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello,

    I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
    I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book "Hadoop - The definitive guide" (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

"The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ..."

     I am not sure on how to get more info.

     Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>

> I implemented my custom input format. Here is how I used it:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>
> As you can see, I do:
> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>
> And here is the Input format and the linereader:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>
> In this input format, I completely ignore these other parameters and get
> the splits by the number of lines. The amount of lines per map can be
> controlled by the same parameter used in NLineInputFormat:
>
> public static final String LINES_PER_MAP =
> "mapreduce.input.lineinputformat.linespermap";
> However, it has really no effect on the number of maps.
>
>
>
> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Regarding your original question, you can use the min and max split
>> settings to control the number of maps:
>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>> use mapred.min.split.size directly.
>>
>> W.r.t your custom inputformat, are you sure you job is using this
>> InputFormat and not the default one?
>>
>>  HTH,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>
>> Just to complement the last question, I have implemented the getSplits
>> method in my input format:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> However, it still doesn't create more than 2 map tasks. Is there
>> something I could do about it to assure more map tasks are created?
>>
>> Thanks
>> Marcelo.
>>
>>
>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>
>>> Sorry for asking too many questions, but the answers are really
>>> happening.
>>>
>>>
>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>
>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>> .
>>>> This should let you spawn more maps as we, based on your N factor.
>>>>
>>>
>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>> I could change it to read several lines at a time, but would this alone
>>> allow more tasks running in parallel?
>>>
>>>
>>>> Not really - "Slots" are capacities, rather than split factors
>>>> themselves. You can have N slots always available, but your job has to
>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>> up.
>>>>
>>>
>>> But how can I do that (supply map tasks) in my job? changing its code?
>>> hadoop config?
>>>
>>>
>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>> reducer is always run that waits to see if it has any outputs from
>>>> maps. If it does not receive any outputs after maps have all
>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>
>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>> thanks!
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello,

    I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
    I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book "Hadoop - The definitive guide" (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

"The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ..."

     I am not sure on how to get more info.

     Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle <mv...@gmail.com>

> I implemented my custom input format. Here is how I used it:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>
> As you can see, I do:
> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>
> And here is the Input format and the linereader:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>
> In this input format, I completely ignore these other parameters and get
> the splits by the number of lines. The amount of lines per map can be
> controlled by the same parameter used in NLineInputFormat:
>
> public static final String LINES_PER_MAP =
> "mapreduce.input.lineinputformat.linespermap";
> However, it has really no effect on the number of maps.
>
>
>
> 2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>
>
>>
>> Regarding your original question, you can use the min and max split
>> settings to control the number of maps:
>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>> use mapred.min.split.size directly.
>>
>> W.r.t your custom inputformat, are you sure you job is using this
>> InputFormat and not the default one?
>>
>>  HTH,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>
>> Just to complement the last question, I have implemented the getSplits
>> method in my input format:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> However, it still doesn't create more than 2 map tasks. Is there
>> something I could do about it to assure more map tasks are created?
>>
>> Thanks
>> Marcelo.
>>
>>
>> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>>
>>> Sorry for asking too many questions, but the answers are really
>>> happening.
>>>
>>>
>>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>>
>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>> .
>>>> This should let you spawn more maps as we, based on your N factor.
>>>>
>>>
>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>> I could change it to read several lines at a time, but would this alone
>>> allow more tasks running in parallel?
>>>
>>>
>>>> Not really - "Slots" are capacities, rather than split factors
>>>> themselves. You can have N slots always available, but your job has to
>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>> up.
>>>>
>>>
>>> But how can I do that (supply map tasks) in my job? changing its code?
>>> hadoop config?
>>>
>>>
>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>> reducer is always run that waits to see if it has any outputs from
>>>> maps. If it does not receive any outputs after maps have all
>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>
>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>> thanks!
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP =
"mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.



2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP =
"mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.



2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP =
"mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.



2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP =
"mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.



2013/1/29 Vinod Kumar Vavilapalli <vi...@hortonworks.com>

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html . See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly.

W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one?

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

> Just to complement the last question, I have implemented the getSplits method in my input format:
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
> 
> However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created?
> 
> Thanks
> Marcelo.
> 
> 
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
> Sorry for asking too many questions, but the answers are really happening.
> 
> 
> 2013/1/28 Harsh J <ha...@cloudera.com>
> This seems CPU-oriented. You probably want the NLineInputFormat? See
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
> This should let you spawn more maps as we, based on your N factor.
> 
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone allow more tasks running in parallel?
>  
> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
> 
> But how can I do that (supply map tasks) in my job? changing its code? hadoop config?
>  
> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks!
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
> 
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html . See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly.

W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one?

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

> Just to complement the last question, I have implemented the getSplits method in my input format:
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
> 
> However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created?
> 
> Thanks
> Marcelo.
> 
> 
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
> Sorry for asking too many questions, but the answers are really happening.
> 
> 
> 2013/1/28 Harsh J <ha...@cloudera.com>
> This seems CPU-oriented. You probably want the NLineInputFormat? See
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
> This should let you spawn more maps as we, based on your N factor.
> 
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone allow more tasks running in parallel?
>  
> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
> 
> But how can I do that (supply map tasks) in my job? changing its code? hadoop config?
>  
> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks!
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
> 
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html . See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly.

W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one?

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

> Just to complement the last question, I have implemented the getSplits method in my input format:
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
> 
> However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created?
> 
> Thanks
> Marcelo.
> 
> 
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
> Sorry for asking too many questions, but the answers are really happening.
> 
> 
> 2013/1/28 Harsh J <ha...@cloudera.com>
> This seems CPU-oriented. You probably want the NLineInputFormat? See
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
> This should let you spawn more maps as we, based on your N factor.
> 
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone allow more tasks running in parallel?
>  
> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
> 
> But how can I do that (supply map tasks) in my job? changing its code? hadoop config?
>  
> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks!
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
> 
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html . See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly.

W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one?

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

> Just to complement the last question, I have implemented the getSplits method in my input format:
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
> 
> However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created?
> 
> Thanks
> Marcelo.
> 
> 
> 2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>
> Sorry for asking too many questions, but the answers are really happening.
> 
> 
> 2013/1/28 Harsh J <ha...@cloudera.com>
> This seems CPU-oriented. You probably want the NLineInputFormat? See
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
> This should let you spawn more maps as we, based on your N factor.
> 
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone allow more tasks running in parallel?
>  
> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
> 
> But how can I do that (supply map tasks) in my job? changing its code? hadoop config?
>  
> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks!
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
> 
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Just to complement the last question, I have implemented the getSplits
method in my input format:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

However, it still doesn't create more than 2 map tasks. Is there something
I could do about it to assure more map tasks are created?

Thanks
Marcelo.


2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>

> Sorry for asking too many questions, but the answers are really happening.
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>
>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>> .
>> This should let you spawn more maps as we, based on your N factor.
>>
>
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline
> CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone
> allow more tasks running in parallel?
>
>
>> Not really - "Slots" are capacities, rather than split factors
>> themselves. You can have N slots always available, but your job has to
>> supply as many map tasks (based on its input/needs/etc.) to use them
>> up.
>>
>
> But how can I do that (supply map tasks) in my job? changing its code?
> hadoop config?
>
>
>> Unless your job sets the number of reducers to 0 manually, 1 default
>> reducer is always run that waits to see if it has any outputs from
>> maps. If it does not receive any outputs after maps have all
>> completed, it dies out with behavior equivalent to a NOP.
>>
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
> thanks!
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Just to complement the last question, I have implemented the getSplits
method in my input format:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

However, it still doesn't create more than 2 map tasks. Is there something
I could do about it to assure more map tasks are created?

Thanks
Marcelo.


2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>

> Sorry for asking too many questions, but the answers are really happening.
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>
>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>> .
>> This should let you spawn more maps as we, based on your N factor.
>>
>
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline
> CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone
> allow more tasks running in parallel?
>
>
>> Not really - "Slots" are capacities, rather than split factors
>> themselves. You can have N slots always available, but your job has to
>> supply as many map tasks (based on its input/needs/etc.) to use them
>> up.
>>
>
> But how can I do that (supply map tasks) in my job? changing its code?
> hadoop config?
>
>
>> Unless your job sets the number of reducers to 0 manually, 1 default
>> reducer is always run that waits to see if it has any outputs from
>> maps. If it does not receive any outputs after maps have all
>> completed, it dies out with behavior equivalent to a NOP.
>>
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
> thanks!
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Just to complement the last question, I have implemented the getSplits
method in my input format:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

However, it still doesn't create more than 2 map tasks. Is there something
I could do about it to assure more map tasks are created?

Thanks
Marcelo.


2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>

> Sorry for asking too many questions, but the answers are really happening.
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>
>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>> .
>> This should let you spawn more maps as we, based on your N factor.
>>
>
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline
> CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone
> allow more tasks running in parallel?
>
>
>> Not really - "Slots" are capacities, rather than split factors
>> themselves. You can have N slots always available, but your job has to
>> supply as many map tasks (based on its input/needs/etc.) to use them
>> up.
>>
>
> But how can I do that (supply map tasks) in my job? changing its code?
> hadoop config?
>
>
>> Unless your job sets the number of reducers to 0 manually, 1 default
>> reducer is always run that waits to see if it has any outputs from
>> maps. If it does not receive any outputs after maps have all
>> completed, it dies out with behavior equivalent to a NOP.
>>
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
> thanks!
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Just to complement the last question, I have implemented the getSplits
method in my input format:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

However, it still doesn't create more than 2 map tasks. Is there something
I could do about it to assure more map tasks are created?

Thanks
Marcelo.


2013/1/28 Marcelo Elias Del Valle <mv...@gmail.com>

> Sorry for asking too many questions, but the answers are really happening.
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>
>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>> .
>> This should let you spawn more maps as we, based on your N factor.
>>
>
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline
> CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone
> allow more tasks running in parallel?
>
>
>> Not really - "Slots" are capacities, rather than split factors
>> themselves. You can have N slots always available, but your job has to
>> supply as many map tasks (based on its input/needs/etc.) to use them
>> up.
>>
>
> But how can I do that (supply map tasks) in my job? changing its code?
> hadoop config?
>
>
>> Unless your job sets the number of reducers to 0 manually, 1 default
>> reducer is always run that waits to see if it has any outputs from
>> maps. If it does not receive any outputs after maps have all
>> completed, it dies out with behavior equivalent to a NOP.
>>
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
> thanks!
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Sorry for asking too many questions, but the answers are really happening.


2013/1/28 Harsh J <ha...@cloudera.com>

> This seems CPU-oriented. You probably want the NLineInputFormat? See
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> .
> This should let you spawn more maps as we, based on your N factor.
>

Indeed, CPU is my bottleneck. That's why I want more things in parallel.
Actually, I wrote my own InputFormat, to be able to process multiline CSVs:
https://github.com/mvallebr/CSVInputFormat
I could change it to read several lines at a time, but would this alone
allow more tasks running in parallel?


> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
>

But how can I do that (supply map tasks) in my job? changing its code?
hadoop config?


> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
>
Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
thanks!

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Sorry for asking too many questions, but the answers are really happening.


2013/1/28 Harsh J <ha...@cloudera.com>

> This seems CPU-oriented. You probably want the NLineInputFormat? See
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> .
> This should let you spawn more maps as we, based on your N factor.
>

Indeed, CPU is my bottleneck. That's why I want more things in parallel.
Actually, I wrote my own InputFormat, to be able to process multiline CSVs:
https://github.com/mvallebr/CSVInputFormat
I could change it to read several lines at a time, but would this alone
allow more tasks running in parallel?


> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
>

But how can I do that (supply map tasks) in my job? changing its code?
hadoop config?


> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
>
Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
thanks!

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Sorry for asking too many questions, but the answers are really happening.


2013/1/28 Harsh J <ha...@cloudera.com>

> This seems CPU-oriented. You probably want the NLineInputFormat? See
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> .
> This should let you spawn more maps as we, based on your N factor.
>

Indeed, CPU is my bottleneck. That's why I want more things in parallel.
Actually, I wrote my own InputFormat, to be able to process multiline CSVs:
https://github.com/mvallebr/CSVInputFormat
I could change it to read several lines at a time, but would this alone
allow more tasks running in parallel?


> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
>

But how can I do that (supply map tasks) in my job? changing its code?
hadoop config?


> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
>
Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
thanks!

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Sorry for asking too many questions, but the answers are really happening.


2013/1/28 Harsh J <ha...@cloudera.com>

> This seems CPU-oriented. You probably want the NLineInputFormat? See
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> .
> This should let you spawn more maps as we, based on your N factor.
>

Indeed, CPU is my bottleneck. That's why I want more things in parallel.
Actually, I wrote my own InputFormat, to be able to process multiline CSVs:
https://github.com/mvallebr/CSVInputFormat
I could change it to read several lines at a time, but would this alone
allow more tasks running in parallel?


> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
>

But how can I do that (supply map tasks) in my job? changing its code?
hadoop config?


> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
>
Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
thanks!

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

Hi again,

(Inline)

On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello Harsh,
>
>     First of all, thanks for the answer!
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>> So depending on your implementation of the job here, you may or may
>> not see it act in effect. Hope this helps.
>
>
> Is there anything I can do in my job, my code or in my inputFormat so that
> hadoop would choose to run more mappers? My text file and 10 million lines
> and each mapper task process 1 line at a time, very fastly. I would like to
> have 40 threads in parallel or even more processing those lines.

This seems CPU-oriented. You probably want the NLineInputFormat? See
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
This should let you spawn more maps as we, based on your N factor.

>>
>> >     When I run my job with just 1 instance, I see it only creates 1
>> > mapper.
>> > When I run my job with 5 instances (1 master and 4 cores), I can see
>> > only 2
>> > mapper slots are used and 6 stay open.
>>
>> Perhaps the job itself launched with 2 total map tasks? You can check
>> this on the JobTracker UI or whatever EMR offers as a job viewer.
>
>
> I am trying to figure this out. Here is what I have from EMR:
> http://mvalle.com/downloads/hadoop_monitor.png
> I will try to get their support to understand this, but I didn't understand
> what you said about the job being launched with 2 total map tasks... if I
> have 8 slots, shouldn't all of them be filled always?

Not really - "Slots" are capacities, rather than split factors
themselves. You can have N slots always available, but your job has to
supply as many map tasks (based on its input/needs/etc.) to use them
up.

>>
>>
>> This is a typical waiting reduce task log, what are you asking here
>> specifically?
>
>
> I have no reduce tasks. My map does the job without putting anything in the
> output. Is it happening because reduce tasks receive nothing as input?

Unless your job sets the number of reducers to 0 manually, 1 default
reducer is always run that waits to see if it has any outputs from
maps. If it does not receive any outputs after maps have all
completed, it dies out with behavior equivalent to a NOP.

Hope this helps!

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

Hi again,

(Inline)

On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello Harsh,
>
>     First of all, thanks for the answer!
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>> So depending on your implementation of the job here, you may or may
>> not see it act in effect. Hope this helps.
>
>
> Is there anything I can do in my job, my code or in my inputFormat so that
> hadoop would choose to run more mappers? My text file and 10 million lines
> and each mapper task process 1 line at a time, very fastly. I would like to
> have 40 threads in parallel or even more processing those lines.

This seems CPU-oriented. You probably want the NLineInputFormat? See
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
This should let you spawn more maps as we, based on your N factor.

>>
>> >     When I run my job with just 1 instance, I see it only creates 1
>> > mapper.
>> > When I run my job with 5 instances (1 master and 4 cores), I can see
>> > only 2
>> > mapper slots are used and 6 stay open.
>>
>> Perhaps the job itself launched with 2 total map tasks? You can check
>> this on the JobTracker UI or whatever EMR offers as a job viewer.
>
>
> I am trying to figure this out. Here is what I have from EMR:
> http://mvalle.com/downloads/hadoop_monitor.png
> I will try to get their support to understand this, but I didn't understand
> what you said about the job being launched with 2 total map tasks... if I
> have 8 slots, shouldn't all of them be filled always?

Not really - "Slots" are capacities, rather than split factors
themselves. You can have N slots always available, but your job has to
supply as many map tasks (based on its input/needs/etc.) to use them
up.

>>
>>
>> This is a typical waiting reduce task log, what are you asking here
>> specifically?
>
>
> I have no reduce tasks. My map does the job without putting anything in the
> output. Is it happening because reduce tasks receive nothing as input?

Unless your job sets the number of reducers to 0 manually, 1 default
reducer is always run that waits to see if it has any outputs from
maps. If it does not receive any outputs after maps have all
completed, it dies out with behavior equivalent to a NOP.

Hope this helps!

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

Hi again,

(Inline)

On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello Harsh,
>
>     First of all, thanks for the answer!
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>> So depending on your implementation of the job here, you may or may
>> not see it act in effect. Hope this helps.
>
>
> Is there anything I can do in my job, my code or in my inputFormat so that
> hadoop would choose to run more mappers? My text file and 10 million lines
> and each mapper task process 1 line at a time, very fastly. I would like to
> have 40 threads in parallel or even more processing those lines.

This seems CPU-oriented. You probably want the NLineInputFormat? See
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
This should let you spawn more maps as we, based on your N factor.

>>
>> >     When I run my job with just 1 instance, I see it only creates 1
>> > mapper.
>> > When I run my job with 5 instances (1 master and 4 cores), I can see
>> > only 2
>> > mapper slots are used and 6 stay open.
>>
>> Perhaps the job itself launched with 2 total map tasks? You can check
>> this on the JobTracker UI or whatever EMR offers as a job viewer.
>
>
> I am trying to figure this out. Here is what I have from EMR:
> http://mvalle.com/downloads/hadoop_monitor.png
> I will try to get their support to understand this, but I didn't understand
> what you said about the job being launched with 2 total map tasks... if I
> have 8 slots, shouldn't all of them be filled always?

Not really - "Slots" are capacities, rather than split factors
themselves. You can have N slots always available, but your job has to
supply as many map tasks (based on its input/needs/etc.) to use them
up.

>>
>>
>> This is a typical waiting reduce task log, what are you asking here
>> specifically?
>
>
> I have no reduce tasks. My map does the job without putting anything in the
> output. Is it happening because reduce tasks receive nothing as input?

Unless your job sets the number of reducers to 0 manually, 1 default
reducer is always run that waits to see if it has any outputs from
maps. If it does not receive any outputs after maps have all
completed, it dies out with behavior equivalent to a NOP.

Hope this helps!

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

Hi again,

(Inline)

On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello Harsh,
>
>     First of all, thanks for the answer!
>
>
> 2013/1/28 Harsh J <ha...@cloudera.com>
>>
>> So depending on your implementation of the job here, you may or may
>> not see it act in effect. Hope this helps.
>
>
> Is there anything I can do in my job, my code or in my inputFormat so that
> hadoop would choose to run more mappers? My text file and 10 million lines
> and each mapper task process 1 line at a time, very fastly. I would like to
> have 40 threads in parallel or even more processing those lines.

This seems CPU-oriented. You probably want the NLineInputFormat? See
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html.
This should let you spawn more maps as we, based on your N factor.

>>
>> >     When I run my job with just 1 instance, I see it only creates 1
>> > mapper.
>> > When I run my job with 5 instances (1 master and 4 cores), I can see
>> > only 2
>> > mapper slots are used and 6 stay open.
>>
>> Perhaps the job itself launched with 2 total map tasks? You can check
>> this on the JobTracker UI or whatever EMR offers as a job viewer.
>
>
> I am trying to figure this out. Here is what I have from EMR:
> http://mvalle.com/downloads/hadoop_monitor.png
> I will try to get their support to understand this, but I didn't understand
> what you said about the job being launched with 2 total map tasks... if I
> have 8 slots, shouldn't all of them be filled always?

Not really - "Slots" are capacities, rather than split factors
themselves. You can have N slots always available, but your job has to
supply as many map tasks (based on its input/needs/etc.) to use them
up.

>>
>>
>> This is a typical waiting reduce task log, what are you asking here
>> specifically?
>
>
> I have no reduce tasks. My map does the job without putting anything in the
> output. Is it happening because reduce tasks receive nothing as input?

Unless your job sets the number of reducers to 0 manually, 1 default
reducer is always run that waits to see if it has any outputs from
maps. If it does not receive any outputs after maps have all
completed, it dies out with behavior equivalent to a NOP.

Hope this helps!

--
Harsh J

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello Harsh,

    First of all, thanks for the answer!


2013/1/28 Harsh J <ha...@cloudera.com>
>
> So depending on your implementation of the job here, you may or may
> not see it act in effect. Hope this helps.
>

Is there anything I can do in my job, my code or in my inputFormat so that
hadoop would choose to run more mappers? My text file and 10 million lines
and each mapper task process 1 line at a time, very fastly. I would like to
have 40 threads in parallel or even more processing those lines.


> >     When I run my job with just 1 instance, I see it only creates 1
> mapper.
> > When I run my job with 5 instances (1 master and 4 cores), I can see
> only 2
> > mapper slots are used and 6 stay open.
>
> Perhaps the job itself launched with 2 total map tasks? You can check
> this on the JobTracker UI or whatever EMR offers as a job viewer.
>

I am trying to figure this out. Here is what I have from EMR:
http://mvalle.com/downloads/hadoop_monitor.png
I will try to get their support to understand this, but I didn't understand
what you said about the job being launched with 2 total map tasks... if I
have 8 slots, shouldn't all of them be filled always?


>
> This is a typical waiting reduce task log, what are you asking here
> specifically?
>

I have no reduce tasks. My map does the job without putting anything in the
output. Is it happening because reduce tasks receive nothing as input?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello Harsh,

    First of all, thanks for the answer!


2013/1/28 Harsh J <ha...@cloudera.com>
>
> So depending on your implementation of the job here, you may or may
> not see it act in effect. Hope this helps.
>

Is there anything I can do in my job, my code or in my inputFormat so that
hadoop would choose to run more mappers? My text file and 10 million lines
and each mapper task process 1 line at a time, very fastly. I would like to
have 40 threads in parallel or even more processing those lines.


> >     When I run my job with just 1 instance, I see it only creates 1
> mapper.
> > When I run my job with 5 instances (1 master and 4 cores), I can see
> only 2
> > mapper slots are used and 6 stay open.
>
> Perhaps the job itself launched with 2 total map tasks? You can check
> this on the JobTracker UI or whatever EMR offers as a job viewer.
>

I am trying to figure this out. Here is what I have from EMR:
http://mvalle.com/downloads/hadoop_monitor.png
I will try to get their support to understand this, but I didn't understand
what you said about the job being launched with 2 total map tasks... if I
have 8 slots, shouldn't all of them be filled always?


>
> This is a typical waiting reduce task log, what are you asking here
> specifically?
>

I have no reduce tasks. My map does the job without putting anything in the
output. Is it happening because reduce tasks receive nothing as input?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello Harsh,

    First of all, thanks for the answer!


2013/1/28 Harsh J <ha...@cloudera.com>
>
> So depending on your implementation of the job here, you may or may
> not see it act in effect. Hope this helps.
>

Is there anything I can do in my job, my code or in my inputFormat so that
hadoop would choose to run more mappers? My text file and 10 million lines
and each mapper task process 1 line at a time, very fastly. I would like to
have 40 threads in parallel or even more processing those lines.


> >     When I run my job with just 1 instance, I see it only creates 1
> mapper.
> > When I run my job with 5 instances (1 master and 4 cores), I can see
> only 2
> > mapper slots are used and 6 stay open.
>
> Perhaps the job itself launched with 2 total map tasks? You can check
> this on the JobTracker UI or whatever EMR offers as a job viewer.
>

I am trying to figure this out. Here is what I have from EMR:
http://mvalle.com/downloads/hadoop_monitor.png
I will try to get their support to understand this, but I didn't understand
what you said about the job being launched with 2 total map tasks... if I
have 8 slots, shouldn't all of them be filled always?


>
> This is a typical waiting reduce task log, what are you asking here
> specifically?
>

I have no reduce tasks. My map does the job without putting anything in the
output. Is it happening because reduce tasks receive nothing as input?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Hello Harsh,

    First of all, thanks for the answer!


2013/1/28 Harsh J <ha...@cloudera.com>
>
> So depending on your implementation of the job here, you may or may
> not see it act in effect. Hope this helps.
>

Is there anything I can do in my job, my code or in my inputFormat so that
hadoop would choose to run more mappers? My text file and 10 million lines
and each mapper task process 1 line at a time, very fastly. I would like to
have 40 threads in parallel or even more processing those lines.


> >     When I run my job with just 1 instance, I see it only creates 1
> mapper.
> > When I run my job with 5 instances (1 master and 4 cores), I can see
> only 2
> > mapper slots are used and 6 stay open.
>
> Perhaps the job itself launched with 2 total map tasks? You can check
> this on the JobTracker UI or whatever EMR offers as a job viewer.
>

I am trying to figure this out. Here is what I have from EMR:
http://mvalle.com/downloads/hadoop_monitor.png
I will try to get their support to understand this, but I didn't understand
what you said about the job being launched with 2 total map tasks... if I
have 8 slots, shouldn't all of them be filled always?


>
> This is a typical waiting reduce task log, what are you asking here
> specifically?
>

I have no reduce tasks. My map does the job without putting anything in the
output. Is it happening because reduce tasks receive nothing as input?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

I'm unfamiliar with EMR myself (perhaps the question fits EMR's own
boards) but here's my take anyway:

On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello,
>
>     I am using hadoop with TextInputFormat, a mapper and no reducers. I am
> running my jobs at Amazon EMR. When I run my job, I set both following
> options:
> -s,mapred.tasktracker.map.tasks.maximum=10
> -jobconf,mapred.map.tasks=10

The first property you've given, refers to a single tasktracker's
maximum concurrency. This means, if you have 4 TaskTrackers, with this
property at each of them, then you have 40 total concurrent map slots
available in all - perhaps more than you intended to configure?

Again, this may be an EMR specific and I may be wrong, since I haven't
seen anyone pass this via CLI before and it is generally to be
configured at a service level.

The second property is more to do with your problem. MR typically
decides the number of map tasks it requires for a job, based on the
input size. In the stable API (the org.apache.hadoop.mapred one), the
mapred.map.tasks can be passed in the way you seem to be passing
above, for an input format to take it as a 'hint' to decide number of
map splits to enforce out of the input, no matter if it isn't large
enough to necessitate that many maps.

However, the new API code accepts no such config-based hints (and such
logic changes need to be done in the programs' own code).

So depending on your implementation of the job here, you may or may
not see it act in effect. Hope this helps.

>     When I run my job with just 1 instance, I see it only creates 1 mapper.
> When I run my job with 5 instances (1 master and 4 cores), I can see only 2
> mapper slots are used and 6 stay open.

Perhaps the job itself launched with 2 total map tasks? You can check
this on the JobTracker UI or whatever EMR offers as a job viewer.

>      I am trying to figure why I am not being able to run more mappers in
> parallel. When I see the logs, I find some messages like these:
>
> INFO org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is
> already in progress

This is a typical waiting reduce task log, what are you asking here
specifically?

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

I'm unfamiliar with EMR myself (perhaps the question fits EMR's own
boards) but here's my take anyway:

On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello,
>
>     I am using hadoop with TextInputFormat, a mapper and no reducers. I am
> running my jobs at Amazon EMR. When I run my job, I set both following
> options:
> -s,mapred.tasktracker.map.tasks.maximum=10
> -jobconf,mapred.map.tasks=10

The first property you've given, refers to a single tasktracker's
maximum concurrency. This means, if you have 4 TaskTrackers, with this
property at each of them, then you have 40 total concurrent map slots
available in all - perhaps more than you intended to configure?

Again, this may be an EMR specific and I may be wrong, since I haven't
seen anyone pass this via CLI before and it is generally to be
configured at a service level.

The second property is more to do with your problem. MR typically
decides the number of map tasks it requires for a job, based on the
input size. In the stable API (the org.apache.hadoop.mapred one), the
mapred.map.tasks can be passed in the way you seem to be passing
above, for an input format to take it as a 'hint' to decide number of
map splits to enforce out of the input, no matter if it isn't large
enough to necessitate that many maps.

However, the new API code accepts no such config-based hints (and such
logic changes need to be done in the programs' own code).

So depending on your implementation of the job here, you may or may
not see it act in effect. Hope this helps.

>     When I run my job with just 1 instance, I see it only creates 1 mapper.
> When I run my job with 5 instances (1 master and 4 cores), I can see only 2
> mapper slots are used and 6 stay open.

Perhaps the job itself launched with 2 total map tasks? You can check
this on the JobTracker UI or whatever EMR offers as a job viewer.

>      I am trying to figure why I am not being able to run more mappers in
> parallel. When I see the logs, I find some messages like these:
>
> INFO org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is
> already in progress

This is a typical waiting reduce task log, what are you asking here
specifically?

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

I'm unfamiliar with EMR myself (perhaps the question fits EMR's own
boards) but here's my take anyway:

On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello,
>
>     I am using hadoop with TextInputFormat, a mapper and no reducers. I am
> running my jobs at Amazon EMR. When I run my job, I set both following
> options:
> -s,mapred.tasktracker.map.tasks.maximum=10
> -jobconf,mapred.map.tasks=10

The first property you've given, refers to a single tasktracker's
maximum concurrency. This means, if you have 4 TaskTrackers, with this
property at each of them, then you have 40 total concurrent map slots
available in all - perhaps more than you intended to configure?

Again, this may be an EMR specific and I may be wrong, since I haven't
seen anyone pass this via CLI before and it is generally to be
configured at a service level.

The second property is more to do with your problem. MR typically
decides the number of map tasks it requires for a job, based on the
input size. In the stable API (the org.apache.hadoop.mapred one), the
mapred.map.tasks can be passed in the way you seem to be passing
above, for an input format to take it as a 'hint' to decide number of
map splits to enforce out of the input, no matter if it isn't large
enough to necessitate that many maps.

However, the new API code accepts no such config-based hints (and such
logic changes need to be done in the programs' own code).

So depending on your implementation of the job here, you may or may
not see it act in effect. Hope this helps.

>     When I run my job with just 1 instance, I see it only creates 1 mapper.
> When I run my job with 5 instances (1 master and 4 cores), I can see only 2
> mapper slots are used and 6 stay open.

Perhaps the job itself launched with 2 total map tasks? You can check
this on the JobTracker UI or whatever EMR offers as a job viewer.

>      I am trying to figure why I am not being able to run more mappers in
> parallel. When I see the logs, I find some messages like these:
>
> INFO org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is
> already in progress

This is a typical waiting reduce task log, what are you asking here
specifically?

--
Harsh J

Re: number of mapper tasks

Posted by Harsh J <ha...@cloudera.com>.

I'm unfamiliar with EMR myself (perhaps the question fits EMR's own
boards) but here's my take anyway:

On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
> Hello,
>
>     I am using hadoop with TextInputFormat, a mapper and no reducers. I am
> running my jobs at Amazon EMR. When I run my job, I set both following
> options:
> -s,mapred.tasktracker.map.tasks.maximum=10
> -jobconf,mapred.map.tasks=10

The first property you've given, refers to a single tasktracker's
maximum concurrency. This means, if you have 4 TaskTrackers, with this
property at each of them, then you have 40 total concurrent map slots
available in all - perhaps more than you intended to configure?

Again, this may be an EMR specific and I may be wrong, since I haven't
seen anyone pass this via CLI before and it is generally to be
configured at a service level.

The second property is more to do with your problem. MR typically
decides the number of map tasks it requires for a job, based on the
input size. In the stable API (the org.apache.hadoop.mapred one), the
mapred.map.tasks can be passed in the way you seem to be passing
above, for an input format to take it as a 'hint' to decide number of
map splits to enforce out of the input, no matter if it isn't large
enough to necessitate that many maps.

However, the new API code accepts no such config-based hints (and such
logic changes need to be done in the programs' own code).

So depending on your implementation of the job here, you may or may
not see it act in effect. Hope this helps.

>     When I run my job with just 1 instance, I see it only creates 1 mapper.
> When I run my job with 5 instances (1 master and 4 cores), I can see only 2
> mapper slots are used and 6 stay open.

Perhaps the job itself launched with 2 total map tasks? You can check
this on the JobTracker UI or whatever EMR offers as a job viewer.

>      I am trying to figure why I am not being able to run more mappers in
> parallel. When I see the logs, I find some messages like these:
>
> INFO org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is
> already in progress

This is a typical waiting reduce task log, what are you asking here
specifically?

--
Harsh J