You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jeremy p <at...@gmail.com> on 2013/03/19 21:55:56 UTC

What happens when you have fewer input files than mapper slots?

Short version : let's say you have 20 nodes, and each node has 10 mapper
slots.  You start a job with 20 very small input files.  How is the work
distributed to the cluster?  Will it be even, with each node spawning one
mapper task?  Is there any way of predicting or controlling how the work
will be distributed?

Long version : My cluster is currently used for two different jobs.  The
cluster is currently optimized for Job A, so each node has a maximum of 18
mapper slots.  However, I also need to run Job B.  Job B is VERY
cpu-intensive, so we really only want one mapper to run on a node at any
given time.  I've done a bunch of research, and it doesn't seem like Hadoop
gives you any way to set the maximum number of mappers per node on a
per-job basis.  I'm at my wit's end here, and considering some rather
egregious workarounds.  If you can think of anything that can help me, I'd
very much appreciate it.

Thanks!

--Jeremy

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)

On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was added to the 2.0.3 release via
> https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
> need exactly. However, looking at that patch, it seems like
> param-config support for MR apps wasn't added by this so it may
> require some work before you can easily leverage it in MRv2.
>
> On MRv1, you can achieve the per-node memory supply vs. requirement
> hack Rahul suggested by using the CapacityScheduler instead. It does
> not have CPU Core based scheduling directly though.
>
> On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
> <at...@gmail.com> wrote:
>> The job we need to run executes some third-party code that utilizes multiple
>> cores.  The only way the job will get done in a timely fashion is if we give
>> it all the cores available on the machine.  This is not a task that can be
>> split up.
>>
>> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>>
>>
>> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>>
>>> This may not be what you were looking for, but I was just curious when you
>>> mentioned that
>>>  you would only want to run only one map task because it was cpu
>>> intensive. Well, the map
>>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>>> are 10 then that
>>> would mean you have close to 10 cores available in each node. So, if you
>>> run only one
>>> map task, no matter how much cpu intensive it is, it will only be able to
>>> max out one core, so the
>>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>>> map tasks on that machine.
>>>
>>> Or, maybe your node's core count is way less than 10, in which case you
>>> might be better off setting
>>> the mapper slots to a lower value anyway.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>>> wrote:
>>>>
>>>> Thank you for your help.
>>>>
>>>> We're using MRv1.  I've tried setting
>>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>>> helped me at all.
>>>>
>>>> Per-job control is definitely what I need.  I need to be able to say,
>>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>>> node".  I have not found any way to do this.
>>>>
>>>> I will definitely look into schedulers.  Are there any examples you can
>>>> point me to where someone does what I'm needing to do?
>>>>
>>>> --Jeremy
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>>
>>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>>
>>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>>> and
>>>>> mapreduce.map.memory.mb  (job level setting)
>>>>>
>>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>>> and mapreduce.map.memory.mb= 40
>>>>> a maximum of two mapper can run on a node at any time.
>>>>>
>>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>>> machine:
>>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>>> 'per job' control. on mappers.
>>>>>
>>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>>> fair scheduler and capacity scheduler documentation...
>>>>>
>>>>>
>>>>> -Rahul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>>> <at...@gmail.com> wrote:
>>>>>>
>>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>>> will be distributed?
>>>>>>
>>>>>> Long version : My cluster is currently used for two different jobs.
>>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>>> appreciate it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --Jeremy
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)

On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was added to the 2.0.3 release via
> https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
> need exactly. However, looking at that patch, it seems like
> param-config support for MR apps wasn't added by this so it may
> require some work before you can easily leverage it in MRv2.
>
> On MRv1, you can achieve the per-node memory supply vs. requirement
> hack Rahul suggested by using the CapacityScheduler instead. It does
> not have CPU Core based scheduling directly though.
>
> On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
> <at...@gmail.com> wrote:
>> The job we need to run executes some third-party code that utilizes multiple
>> cores.  The only way the job will get done in a timely fashion is if we give
>> it all the cores available on the machine.  This is not a task that can be
>> split up.
>>
>> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>>
>>
>> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>>
>>> This may not be what you were looking for, but I was just curious when you
>>> mentioned that
>>>  you would only want to run only one map task because it was cpu
>>> intensive. Well, the map
>>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>>> are 10 then that
>>> would mean you have close to 10 cores available in each node. So, if you
>>> run only one
>>> map task, no matter how much cpu intensive it is, it will only be able to
>>> max out one core, so the
>>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>>> map tasks on that machine.
>>>
>>> Or, maybe your node's core count is way less than 10, in which case you
>>> might be better off setting
>>> the mapper slots to a lower value anyway.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>>> wrote:
>>>>
>>>> Thank you for your help.
>>>>
>>>> We're using MRv1.  I've tried setting
>>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>>> helped me at all.
>>>>
>>>> Per-job control is definitely what I need.  I need to be able to say,
>>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>>> node".  I have not found any way to do this.
>>>>
>>>> I will definitely look into schedulers.  Are there any examples you can
>>>> point me to where someone does what I'm needing to do?
>>>>
>>>> --Jeremy
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>>
>>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>>
>>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>>> and
>>>>> mapreduce.map.memory.mb  (job level setting)
>>>>>
>>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>>> and mapreduce.map.memory.mb= 40
>>>>> a maximum of two mapper can run on a node at any time.
>>>>>
>>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>>> machine:
>>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>>> 'per job' control. on mappers.
>>>>>
>>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>>> fair scheduler and capacity scheduler documentation...
>>>>>
>>>>>
>>>>> -Rahul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>>> <at...@gmail.com> wrote:
>>>>>>
>>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>>> will be distributed?
>>>>>>
>>>>>> Long version : My cluster is currently used for two different jobs.
>>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>>> appreciate it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --Jeremy
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)

On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was added to the 2.0.3 release via
> https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
> need exactly. However, looking at that patch, it seems like
> param-config support for MR apps wasn't added by this so it may
> require some work before you can easily leverage it in MRv2.
>
> On MRv1, you can achieve the per-node memory supply vs. requirement
> hack Rahul suggested by using the CapacityScheduler instead. It does
> not have CPU Core based scheduling directly though.
>
> On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
> <at...@gmail.com> wrote:
>> The job we need to run executes some third-party code that utilizes multiple
>> cores.  The only way the job will get done in a timely fashion is if we give
>> it all the cores available on the machine.  This is not a task that can be
>> split up.
>>
>> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>>
>>
>> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>>
>>> This may not be what you were looking for, but I was just curious when you
>>> mentioned that
>>>  you would only want to run only one map task because it was cpu
>>> intensive. Well, the map
>>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>>> are 10 then that
>>> would mean you have close to 10 cores available in each node. So, if you
>>> run only one
>>> map task, no matter how much cpu intensive it is, it will only be able to
>>> max out one core, so the
>>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>>> map tasks on that machine.
>>>
>>> Or, maybe your node's core count is way less than 10, in which case you
>>> might be better off setting
>>> the mapper slots to a lower value anyway.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>>> wrote:
>>>>
>>>> Thank you for your help.
>>>>
>>>> We're using MRv1.  I've tried setting
>>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>>> helped me at all.
>>>>
>>>> Per-job control is definitely what I need.  I need to be able to say,
>>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>>> node".  I have not found any way to do this.
>>>>
>>>> I will definitely look into schedulers.  Are there any examples you can
>>>> point me to where someone does what I'm needing to do?
>>>>
>>>> --Jeremy
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>>
>>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>>
>>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>>> and
>>>>> mapreduce.map.memory.mb  (job level setting)
>>>>>
>>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>>> and mapreduce.map.memory.mb= 40
>>>>> a maximum of two mapper can run on a node at any time.
>>>>>
>>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>>> machine:
>>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>>> 'per job' control. on mappers.
>>>>>
>>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>>> fair scheduler and capacity scheduler documentation...
>>>>>
>>>>>
>>>>> -Rahul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>>> <at...@gmail.com> wrote:
>>>>>>
>>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>>> will be distributed?
>>>>>>
>>>>>> Long version : My cluster is currently used for two different jobs.
>>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>>> appreciate it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --Jeremy
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)

On Wed, Mar 20, 2013 at 5:34 AM, Harsh J <ha...@cloudera.com> wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was added to the 2.0.3 release via
> https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
> need exactly. However, looking at that patch, it seems like
> param-config support for MR apps wasn't added by this so it may
> require some work before you can easily leverage it in MRv2.
>
> On MRv1, you can achieve the per-node memory supply vs. requirement
> hack Rahul suggested by using the CapacityScheduler instead. It does
> not have CPU Core based scheduling directly though.
>
> On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
> <at...@gmail.com> wrote:
>> The job we need to run executes some third-party code that utilizes multiple
>> cores.  The only way the job will get done in a timely fashion is if we give
>> it all the cores available on the machine.  This is not a task that can be
>> split up.
>>
>> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>>
>>
>> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>>
>>> This may not be what you were looking for, but I was just curious when you
>>> mentioned that
>>>  you would only want to run only one map task because it was cpu
>>> intensive. Well, the map
>>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>>> are 10 then that
>>> would mean you have close to 10 cores available in each node. So, if you
>>> run only one
>>> map task, no matter how much cpu intensive it is, it will only be able to
>>> max out one core, so the
>>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>>> map tasks on that machine.
>>>
>>> Or, maybe your node's core count is way less than 10, in which case you
>>> might be better off setting
>>> the mapper slots to a lower value anyway.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>>> wrote:
>>>>
>>>> Thank you for your help.
>>>>
>>>> We're using MRv1.  I've tried setting
>>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>>> helped me at all.
>>>>
>>>> Per-job control is definitely what I need.  I need to be able to say,
>>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>>> node".  I have not found any way to do this.
>>>>
>>>> I will definitely look into schedulers.  Are there any examples you can
>>>> point me to where someone does what I'm needing to do?
>>>>
>>>> --Jeremy
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>>
>>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>>
>>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>>> and
>>>>> mapreduce.map.memory.mb  (job level setting)
>>>>>
>>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>>> and mapreduce.map.memory.mb= 40
>>>>> a maximum of two mapper can run on a node at any time.
>>>>>
>>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>>> machine:
>>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>>> 'per job' control. on mappers.
>>>>>
>>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>>> fair scheduler and capacity scheduler documentation...
>>>>>
>>>>>
>>>>> -Rahul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>>> <at...@gmail.com> wrote:
>>>>>>
>>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>>> will be distributed?
>>>>>>
>>>>>> Long version : My cluster is currently used for two different jobs.
>>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>>> appreciate it.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --Jeremy
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<at...@gmail.com> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <at...@gmail.com> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>> appreciate it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --Jeremy
>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<at...@gmail.com> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <at...@gmail.com> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>> appreciate it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --Jeremy
>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<at...@gmail.com> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <at...@gmail.com> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>> appreciate it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --Jeremy
>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by Harsh J <ha...@cloudera.com>.
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<at...@gmail.com> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <at...@gmail.com> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of
>>>>> 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>>> gives you any way to set the maximum number of mappers per node on a per-job
>>>>> basis.  I'm at my wit's end here, and considering some rather egregious
>>>>> workarounds.  If you can think of anything that can help me, I'd very much
>>>>> appreciate it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --Jeremy
>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
The job we need to run executes some third-party code that utilizes
multiple cores.  The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine.  This is not a task
that can be split up.

Yes, I know, it's not ideal, but this is the situation I have to deal with.

On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:

> This may not be what you were looking for, but I was just curious when you
> mentioned that
>  you would only want to run only one map task because it was cpu
> intensive. Well, the map
> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
> are 10 then that
> would mean you have close to 10 cores available in each node. So, if you
> run only one
> map task, no matter how much cpu intensive it is, it will only be able to
> max out one core, so the
> rest of the  9 cores would go under utilized. So, you can still run 9 more
> map tasks on that machine.
>
> Or, maybe your node's core count is way less than 10, in which case you
> might be better off setting
> the mapper slots to a lower value anyway.
>
>
> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:
>
>> Thank you for your help.
>>
>> We're using MRv1.  I've tried
>> setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and
>> neither one helped me at all.
>>
>> Per-job control is definitely what I need.  I need to be able to say,
>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>> node".  I have not found any way to do this.
>>
>> I will definitely look into schedulers.  Are there any examples you can
>> point me to where someone does what I'm needing to do?
>>
>> --Jeremy
>>
>>
>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>
>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>
>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>
>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>> and
>>> mapreduce.map.memory.mb  (job level setting)
>>>
>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>> and mapreduce.map.memory.mb= 40
>>> a maximum of two mapper can run on a node at any time.
>>>
>>> For MRv1, The equivalent way will be to control mapper slots on each
>>> machine:
>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>> 'per job' control. on mappers.
>>>
>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>> capability in addition to restrict the overall use of grid resource. Do
>>> read fair scheduler and capacity scheduler documentation...
>>>
>>>
>>> -Rahul
>>>
>>>
>>>
>>>
>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <
>>> athomewithagroovebox@gmail.com> wrote:
>>>
>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>> one mapper task?  Is there any way of predicting or controlling how the
>>>> work will be distributed?
>>>>
>>>> Long version : My cluster is currently used for two different jobs.
>>>>  The cluster is currently optimized for Job A, so each node has a maximum
>>>> of 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>> gives you any way to set the maximum number of mappers per node on a
>>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>>> very much appreciate it.
>>>>
>>>> Thanks!
>>>>
>>>> --Jeremy
>>>>
>>>
>>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
The job we need to run executes some third-party code that utilizes
multiple cores.  The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine.  This is not a task
that can be split up.

Yes, I know, it's not ideal, but this is the situation I have to deal with.

On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:

> This may not be what you were looking for, but I was just curious when you
> mentioned that
>  you would only want to run only one map task because it was cpu
> intensive. Well, the map
> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
> are 10 then that
> would mean you have close to 10 cores available in each node. So, if you
> run only one
> map task, no matter how much cpu intensive it is, it will only be able to
> max out one core, so the
> rest of the  9 cores would go under utilized. So, you can still run 9 more
> map tasks on that machine.
>
> Or, maybe your node's core count is way less than 10, in which case you
> might be better off setting
> the mapper slots to a lower value anyway.
>
>
> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:
>
>> Thank you for your help.
>>
>> We're using MRv1.  I've tried
>> setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and
>> neither one helped me at all.
>>
>> Per-job control is definitely what I need.  I need to be able to say,
>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>> node".  I have not found any way to do this.
>>
>> I will definitely look into schedulers.  Are there any examples you can
>> point me to where someone does what I'm needing to do?
>>
>> --Jeremy
>>
>>
>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>
>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>
>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>
>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>> and
>>> mapreduce.map.memory.mb  (job level setting)
>>>
>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>> and mapreduce.map.memory.mb= 40
>>> a maximum of two mapper can run on a node at any time.
>>>
>>> For MRv1, The equivalent way will be to control mapper slots on each
>>> machine:
>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>> 'per job' control. on mappers.
>>>
>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>> capability in addition to restrict the overall use of grid resource. Do
>>> read fair scheduler and capacity scheduler documentation...
>>>
>>>
>>> -Rahul
>>>
>>>
>>>
>>>
>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <
>>> athomewithagroovebox@gmail.com> wrote:
>>>
>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>> one mapper task?  Is there any way of predicting or controlling how the
>>>> work will be distributed?
>>>>
>>>> Long version : My cluster is currently used for two different jobs.
>>>>  The cluster is currently optimized for Job A, so each node has a maximum
>>>> of 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>> gives you any way to set the maximum number of mappers per node on a
>>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>>> very much appreciate it.
>>>>
>>>> Thanks!
>>>>
>>>> --Jeremy
>>>>
>>>
>>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
The job we need to run executes some third-party code that utilizes
multiple cores.  The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine.  This is not a task
that can be split up.

Yes, I know, it's not ideal, but this is the situation I have to deal with.

On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:

> This may not be what you were looking for, but I was just curious when you
> mentioned that
>  you would only want to run only one map task because it was cpu
> intensive. Well, the map
> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
> are 10 then that
> would mean you have close to 10 cores available in each node. So, if you
> run only one
> map task, no matter how much cpu intensive it is, it will only be able to
> max out one core, so the
> rest of the  9 cores would go under utilized. So, you can still run 9 more
> map tasks on that machine.
>
> Or, maybe your node's core count is way less than 10, in which case you
> might be better off setting
> the mapper slots to a lower value anyway.
>
>
> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:
>
>> Thank you for your help.
>>
>> We're using MRv1.  I've tried
>> setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and
>> neither one helped me at all.
>>
>> Per-job control is definitely what I need.  I need to be able to say,
>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>> node".  I have not found any way to do this.
>>
>> I will definitely look into schedulers.  Are there any examples you can
>> point me to where someone does what I'm needing to do?
>>
>> --Jeremy
>>
>>
>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>
>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>
>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>
>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>> and
>>> mapreduce.map.memory.mb  (job level setting)
>>>
>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>> and mapreduce.map.memory.mb= 40
>>> a maximum of two mapper can run on a node at any time.
>>>
>>> For MRv1, The equivalent way will be to control mapper slots on each
>>> machine:
>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>> 'per job' control. on mappers.
>>>
>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>> capability in addition to restrict the overall use of grid resource. Do
>>> read fair scheduler and capacity scheduler documentation...
>>>
>>>
>>> -Rahul
>>>
>>>
>>>
>>>
>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <
>>> athomewithagroovebox@gmail.com> wrote:
>>>
>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>> one mapper task?  Is there any way of predicting or controlling how the
>>>> work will be distributed?
>>>>
>>>> Long version : My cluster is currently used for two different jobs.
>>>>  The cluster is currently optimized for Job A, so each node has a maximum
>>>> of 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>> gives you any way to set the maximum number of mappers per node on a
>>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>>> very much appreciate it.
>>>>
>>>> Thanks!
>>>>
>>>> --Jeremy
>>>>
>>>
>>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
The job we need to run executes some third-party code that utilizes
multiple cores.  The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine.  This is not a task
that can be split up.

Yes, I know, it's not ideal, but this is the situation I have to deal with.

On Tue, Mar 19, 2013 at 3:15 PM, hari <ha...@gmail.com> wrote:

> This may not be what you were looking for, but I was just curious when you
> mentioned that
>  you would only want to run only one map task because it was cpu
> intensive. Well, the map
> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
> are 10 then that
> would mean you have close to 10 cores available in each node. So, if you
> run only one
> map task, no matter how much cpu intensive it is, it will only be able to
> max out one core, so the
> rest of the  9 cores would go under utilized. So, you can still run 9 more
> map tasks on that machine.
>
> Or, maybe your node's core count is way less than 10, in which case you
> might be better off setting
> the mapper slots to a lower value anyway.
>
>
> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:
>
>> Thank you for your help.
>>
>> We're using MRv1.  I've tried
>> setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and
>> neither one helped me at all.
>>
>> Per-job control is definitely what I need.  I need to be able to say,
>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>> node".  I have not found any way to do this.
>>
>> I will definitely look into schedulers.  Are there any examples you can
>> point me to where someone does what I'm needing to do?
>>
>> --Jeremy
>>
>>
>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>>
>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>
>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>
>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>> and
>>> mapreduce.map.memory.mb  (job level setting)
>>>
>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>> and mapreduce.map.memory.mb= 40
>>> a maximum of two mapper can run on a node at any time.
>>>
>>> For MRv1, The equivalent way will be to control mapper slots on each
>>> machine:
>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>> 'per job' control. on mappers.
>>>
>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>> capability in addition to restrict the overall use of grid resource. Do
>>> read fair scheduler and capacity scheduler documentation...
>>>
>>>
>>> -Rahul
>>>
>>>
>>>
>>>
>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <
>>> athomewithagroovebox@gmail.com> wrote:
>>>
>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>> one mapper task?  Is there any way of predicting or controlling how the
>>>> work will be distributed?
>>>>
>>>> Long version : My cluster is currently used for two different jobs.
>>>>  The cluster is currently optimized for Job A, so each node has a maximum
>>>> of 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>> gives you any way to set the maximum number of mappers per node on a
>>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>>> very much appreciate it.
>>>>
>>>> Thanks!
>>>>
>>>> --Jeremy
>>>>
>>>
>>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by hari <ha...@gmail.com>.
This may not be what you were looking for, but I was just curious when you
mentioned that
 you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close to 10 cores available in each node. So, if you
run only one
map task, no matter how much cpu intensive it is, it will only be able to
max out one core, so the
rest of the  9 cores would go under utilized. So, you can still run 9 more
map tasks on that machine.

Or, maybe your node's core count is way less than 10, in which case you
might be better off setting
the mapper slots to a lower value anyway.


On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:

> Thank you for your help.
>
> We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
> and mapred.map.tasks, and neither one helped me at all.
>
> Per-job control is definitely what I need.  I need to be able to say, "For
> Job A, only use one mapper per node, but for Job B, use 16 mappers per
> node".  I have not found any way to do this.
>
> I will definitely look into schedulers.  Are there any examples you can
> point me to where someone does what I'm needing to do?
>
> --Jeremy
>
>
> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>
>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>
>> For MRv2 (yarn): you can pretty much achieve this using:
>>
>> yarn.nodemanager.resource.memory-mb (system wide setting)
>> and
>> mapreduce.map.memory.mb  (job level setting)
>>
>> e.g. if yarn.nodemanager.resource.memory-mb=100
>> and mapreduce.map.memory.mb= 40
>> a maximum of two mapper can run on a node at any time.
>>
>> For MRv1, The equivalent way will be to control mapper slots on each
>> machine:
>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>> 'per job' control. on mappers.
>>
>> In addition in both cases, you can use a scheduler with 'pools / queues'
>> capability in addition to restrict the overall use of grid resource. Do
>> read fair scheduler and capacity scheduler documentation...
>>
>>
>> -Rahul
>>
>>
>>
>>
>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <athomewithagroovebox@gmail.com
>> > wrote:
>>
>>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>> Thanks!
>>>
>>> --Jeremy
>>>
>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by hari <ha...@gmail.com>.
This may not be what you were looking for, but I was just curious when you
mentioned that
 you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close to 10 cores available in each node. So, if you
run only one
map task, no matter how much cpu intensive it is, it will only be able to
max out one core, so the
rest of the  9 cores would go under utilized. So, you can still run 9 more
map tasks on that machine.

Or, maybe your node's core count is way less than 10, in which case you
might be better off setting
the mapper slots to a lower value anyway.


On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:

> Thank you for your help.
>
> We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
> and mapred.map.tasks, and neither one helped me at all.
>
> Per-job control is definitely what I need.  I need to be able to say, "For
> Job A, only use one mapper per node, but for Job B, use 16 mappers per
> node".  I have not found any way to do this.
>
> I will definitely look into schedulers.  Are there any examples you can
> point me to where someone does what I'm needing to do?
>
> --Jeremy
>
>
> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>
>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>
>> For MRv2 (yarn): you can pretty much achieve this using:
>>
>> yarn.nodemanager.resource.memory-mb (system wide setting)
>> and
>> mapreduce.map.memory.mb  (job level setting)
>>
>> e.g. if yarn.nodemanager.resource.memory-mb=100
>> and mapreduce.map.memory.mb= 40
>> a maximum of two mapper can run on a node at any time.
>>
>> For MRv1, The equivalent way will be to control mapper slots on each
>> machine:
>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>> 'per job' control. on mappers.
>>
>> In addition in both cases, you can use a scheduler with 'pools / queues'
>> capability in addition to restrict the overall use of grid resource. Do
>> read fair scheduler and capacity scheduler documentation...
>>
>>
>> -Rahul
>>
>>
>>
>>
>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <athomewithagroovebox@gmail.com
>> > wrote:
>>
>>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>> Thanks!
>>>
>>> --Jeremy
>>>
>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by hari <ha...@gmail.com>.
This may not be what you were looking for, but I was just curious when you
mentioned that
 you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close to 10 cores available in each node. So, if you
run only one
map task, no matter how much cpu intensive it is, it will only be able to
max out one core, so the
rest of the  9 cores would go under utilized. So, you can still run 9 more
map tasks on that machine.

Or, maybe your node's core count is way less than 10, in which case you
might be better off setting
the mapper slots to a lower value anyway.


On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:

> Thank you for your help.
>
> We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
> and mapred.map.tasks, and neither one helped me at all.
>
> Per-job control is definitely what I need.  I need to be able to say, "For
> Job A, only use one mapper per node, but for Job B, use 16 mappers per
> node".  I have not found any way to do this.
>
> I will definitely look into schedulers.  Are there any examples you can
> point me to where someone does what I'm needing to do?
>
> --Jeremy
>
>
> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>
>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>
>> For MRv2 (yarn): you can pretty much achieve this using:
>>
>> yarn.nodemanager.resource.memory-mb (system wide setting)
>> and
>> mapreduce.map.memory.mb  (job level setting)
>>
>> e.g. if yarn.nodemanager.resource.memory-mb=100
>> and mapreduce.map.memory.mb= 40
>> a maximum of two mapper can run on a node at any time.
>>
>> For MRv1, The equivalent way will be to control mapper slots on each
>> machine:
>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>> 'per job' control. on mappers.
>>
>> In addition in both cases, you can use a scheduler with 'pools / queues'
>> capability in addition to restrict the overall use of grid resource. Do
>> read fair scheduler and capacity scheduler documentation...
>>
>>
>> -Rahul
>>
>>
>>
>>
>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <athomewithagroovebox@gmail.com
>> > wrote:
>>
>>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>> Thanks!
>>>
>>> --Jeremy
>>>
>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by hari <ha...@gmail.com>.
This may not be what you were looking for, but I was just curious when you
mentioned that
 you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close to 10 cores available in each node. So, if you
run only one
map task, no matter how much cpu intensive it is, it will only be able to
max out one core, so the
rest of the  9 cores would go under utilized. So, you can still run 9 more
map tasks on that machine.

Or, maybe your node's core count is way less than 10, in which case you
might be better off setting
the mapper slots to a lower value anyway.


On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <at...@gmail.com>wrote:

> Thank you for your help.
>
> We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
> and mapred.map.tasks, and neither one helped me at all.
>
> Per-job control is definitely what I need.  I need to be able to say, "For
> Job A, only use one mapper per node, but for Job B, use 16 mappers per
> node".  I have not found any way to do this.
>
> I will definitely look into schedulers.  Are there any examples you can
> point me to where someone does what I'm needing to do?
>
> --Jeremy
>
>
> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:
>
>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>
>> For MRv2 (yarn): you can pretty much achieve this using:
>>
>> yarn.nodemanager.resource.memory-mb (system wide setting)
>> and
>> mapreduce.map.memory.mb  (job level setting)
>>
>> e.g. if yarn.nodemanager.resource.memory-mb=100
>> and mapreduce.map.memory.mb= 40
>> a maximum of two mapper can run on a node at any time.
>>
>> For MRv1, The equivalent way will be to control mapper slots on each
>> machine:
>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>> 'per job' control. on mappers.
>>
>> In addition in both cases, you can use a scheduler with 'pools / queues'
>> capability in addition to restrict the overall use of grid resource. Do
>> read fair scheduler and capacity scheduler documentation...
>>
>>
>> -Rahul
>>
>>
>>
>>
>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <athomewithagroovebox@gmail.com
>> > wrote:
>>
>>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>> Thanks!
>>>
>>> --Jeremy
>>>
>>
>>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Thank you for your help.

We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.

Per-job control is definitely what I need.  I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers per
node".  I have not found any way to do this.

I will definitely look into schedulers.  Are there any examples you can
point me to where someone does what I'm needing to do?

--Jeremy

On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:

> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>
> For MRv2 (yarn): you can pretty much achieve this using:
>
> yarn.nodemanager.resource.memory-mb (system wide setting)
> and
> mapreduce.map.memory.mb  (job level setting)
>
> e.g. if yarn.nodemanager.resource.memory-mb=100
> and mapreduce.map.memory.mb= 40
> a maximum of two mapper can run on a node at any time.
>
> For MRv1, The equivalent way will be to control mapper slots on each
> machine:
> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
> 'per job' control. on mappers.
>
> In addition in both cases, you can use a scheduler with 'pools / queues'
> capability in addition to restrict the overall use of grid resource. Do
> read fair scheduler and capacity scheduler documentation...
>
>
> -Rahul
>
>
>
>
> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:
>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>> Thanks!
>>
>> --Jeremy
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Thank you for your help.

We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.

Per-job control is definitely what I need.  I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers per
node".  I have not found any way to do this.

I will definitely look into schedulers.  Are there any examples you can
point me to where someone does what I'm needing to do?

--Jeremy

On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:

> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>
> For MRv2 (yarn): you can pretty much achieve this using:
>
> yarn.nodemanager.resource.memory-mb (system wide setting)
> and
> mapreduce.map.memory.mb  (job level setting)
>
> e.g. if yarn.nodemanager.resource.memory-mb=100
> and mapreduce.map.memory.mb= 40
> a maximum of two mapper can run on a node at any time.
>
> For MRv1, The equivalent way will be to control mapper slots on each
> machine:
> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
> 'per job' control. on mappers.
>
> In addition in both cases, you can use a scheduler with 'pools / queues'
> capability in addition to restrict the overall use of grid resource. Do
> read fair scheduler and capacity scheduler documentation...
>
>
> -Rahul
>
>
>
>
> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:
>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>> Thanks!
>>
>> --Jeremy
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Thank you for your help.

We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.

Per-job control is definitely what I need.  I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers per
node".  I have not found any way to do this.

I will definitely look into schedulers.  Are there any examples you can
point me to where someone does what I'm needing to do?

--Jeremy

On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:

> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>
> For MRv2 (yarn): you can pretty much achieve this using:
>
> yarn.nodemanager.resource.memory-mb (system wide setting)
> and
> mapreduce.map.memory.mb  (job level setting)
>
> e.g. if yarn.nodemanager.resource.memory-mb=100
> and mapreduce.map.memory.mb= 40
> a maximum of two mapper can run on a node at any time.
>
> For MRv1, The equivalent way will be to control mapper slots on each
> machine:
> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
> 'per job' control. on mappers.
>
> In addition in both cases, you can use a scheduler with 'pools / queues'
> capability in addition to restrict the overall use of grid resource. Do
> read fair scheduler and capacity scheduler documentation...
>
>
> -Rahul
>
>
>
>
> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:
>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>> Thanks!
>>
>> --Jeremy
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Thank you for your help.

We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.

Per-job control is definitely what I need.  I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers per
node".  I have not found any way to do this.

I will definitely look into schedulers.  Are there any examples you can
point me to where someone does what I'm needing to do?

--Jeremy

On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rj...@gmail.com> wrote:

> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>
> For MRv2 (yarn): you can pretty much achieve this using:
>
> yarn.nodemanager.resource.memory-mb (system wide setting)
> and
> mapreduce.map.memory.mb  (job level setting)
>
> e.g. if yarn.nodemanager.resource.memory-mb=100
> and mapreduce.map.memory.mb= 40
> a maximum of two mapper can run on a node at any time.
>
> For MRv1, The equivalent way will be to control mapper slots on each
> machine:
> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
> 'per job' control. on mappers.
>
> In addition in both cases, you can use a scheduler with 'pools / queues'
> capability in addition to restrict the overall use of grid resource. Do
> read fair scheduler and capacity scheduler documentation...
>
>
> -Rahul
>
>
>
>
> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:
>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>> Thanks!
>>
>> --Jeremy
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by Rahul Jain <rj...@gmail.com>.
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??

For MRv2 (yarn): you can pretty much achieve this using:

yarn.nodemanager.resource.memory-mb (system wide setting)
and
mapreduce.map.memory.mb  (job level setting)

e.g. if yarn.nodemanager.resource.memory-mb=100
and mapreduce.map.memory.mb= 40
a maximum of two mapper can run on a node at any time.

For MRv1, The equivalent way will be to control mapper slots on each
machine:
mapred.tasktracker.map.tasks.maximum,  of course this does not give you
'per job' control. on mappers.

In addition in both cases, you can use a scheduler with 'pools / queues'
capability in addition to restrict the overall use of grid resource. Do
read fair scheduler and capacity scheduler documentation...


-Rahul




On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:

> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?
>
> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>
> Thanks!
>
> --Jeremy
>

Re: Unsubscribe

Posted by Mohammad Tariq <do...@gmail.com>.
You need to go here :
user-unsubscribe@hadoop.apache.org

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Mar 20, 2013 at 3:54 AM, John Conwell <jo...@iamjohn.me> wrote:

> No!
>
>
> On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen <bruceperttunen@gmail.com
> > wrote:
>
>> Unsubscribe
>>
>>
>
>
> --
>
> Thanks,
> John C
>

Re: Unsubscribe

Posted by Mohammad Tariq <do...@gmail.com>.
You need to go here :
user-unsubscribe@hadoop.apache.org

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Mar 20, 2013 at 3:54 AM, John Conwell <jo...@iamjohn.me> wrote:

> No!
>
>
> On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen <bruceperttunen@gmail.com
> > wrote:
>
>> Unsubscribe
>>
>>
>
>
> --
>
> Thanks,
> John C
>

Re: Unsubscribe

Posted by Mohammad Tariq <do...@gmail.com>.
You need to go here :
user-unsubscribe@hadoop.apache.org

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Mar 20, 2013 at 3:54 AM, John Conwell <jo...@iamjohn.me> wrote:

> No!
>
>
> On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen <bruceperttunen@gmail.com
> > wrote:
>
>> Unsubscribe
>>
>>
>
>
> --
>
> Thanks,
> John C
>

Re: Unsubscribe

Posted by Mohammad Tariq <do...@gmail.com>.
You need to go here :
user-unsubscribe@hadoop.apache.org

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Mar 20, 2013 at 3:54 AM, John Conwell <jo...@iamjohn.me> wrote:

> No!
>
>
> On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen <bruceperttunen@gmail.com
> > wrote:
>
>> Unsubscribe
>>
>>
>
>
> --
>
> Thanks,
> John C
>

Re: Unsubscribe

Posted by John Conwell <jo...@iamjohn.me>.
No!


On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen
<br...@gmail.com>wrote:

> Unsubscribe
>
>


-- 

Thanks,
John C

Re: Unsubscribe

Posted by John Conwell <jo...@iamjohn.me>.
No!


On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen
<br...@gmail.com>wrote:

> Unsubscribe
>
>


-- 

Thanks,
John C

Re: Unsubscribe

Posted by John Conwell <jo...@iamjohn.me>.
No!


On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen
<br...@gmail.com>wrote:

> Unsubscribe
>
>


-- 

Thanks,
John C

Re: Unsubscribe

Posted by John Conwell <jo...@iamjohn.me>.
No!


On Tue, Mar 19, 2013 at 3:23 PM, Bruce Perttunen
<br...@gmail.com>wrote:

> Unsubscribe
>
>


-- 

Thanks,
John C

Unsubscribe

Posted by Bruce Perttunen <br...@gmail.com>.
Unsubscribe


Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Is there a way to force an even spread of data?

On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <at...@gmail.com>wrote:

> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map task directly."  How would
> manually reading the files into the map task help me?  Hadoop would still
> spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
> trying to get one mapper per machine for this job.
>
> --Jeremy
>
>
> On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:
>
>>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>
>>
>> You're right in expecting that the tasks of the small job will likely be
>> evenly distributed among 20 nodes, if the 20 files are evenly distributed
>> among the nodes and that there are free slots on every node.
>>
>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>
>> Are you seeing Job B tasks are not being evenly distributed to each node?
>> You can check the locations of the files by hadoop fsck. If the evenness is
>> the goal you can also write your own input format that return empty
>> locations for each split and read the small files in map task directly. If
>> you're using Hadoop 1.0.x and fair scheduler, you might need to set
>> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
>> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
>> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>>
>> __Luke
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Is there a way to force an even spread of data?

On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <at...@gmail.com>wrote:

> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map task directly."  How would
> manually reading the files into the map task help me?  Hadoop would still
> spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
> trying to get one mapper per machine for this job.
>
> --Jeremy
>
>
> On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:
>
>>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>
>>
>> You're right in expecting that the tasks of the small job will likely be
>> evenly distributed among 20 nodes, if the 20 files are evenly distributed
>> among the nodes and that there are free slots on every node.
>>
>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>
>> Are you seeing Job B tasks are not being evenly distributed to each node?
>> You can check the locations of the files by hadoop fsck. If the evenness is
>> the goal you can also write your own input format that return empty
>> locations for each split and read the small files in map task directly. If
>> you're using Hadoop 1.0.x and fair scheduler, you might need to set
>> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
>> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
>> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>>
>> __Luke
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Is there a way to force an even spread of data?

On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <at...@gmail.com>wrote:

> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map task directly."  How would
> manually reading the files into the map task help me?  Hadoop would still
> spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
> trying to get one mapper per machine for this job.
>
> --Jeremy
>
>
> On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:
>
>>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>
>>
>> You're right in expecting that the tasks of the small job will likely be
>> evenly distributed among 20 nodes, if the 20 files are evenly distributed
>> among the nodes and that there are free slots on every node.
>>
>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>
>> Are you seeing Job B tasks are not being evenly distributed to each node?
>> You can check the locations of the files by hadoop fsck. If the evenness is
>> the goal you can also write your own input format that return empty
>> locations for each split and read the small files in map task directly. If
>> you're using Hadoop 1.0.x and fair scheduler, you might need to set
>> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
>> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
>> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>>
>> __Luke
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Is there a way to force an even spread of data?

On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <at...@gmail.com>wrote:

> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map task directly."  How would
> manually reading the files into the map task help me?  Hadoop would still
> spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
> trying to get one mapper per machine for this job.
>
> --Jeremy
>
>
> On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:
>
>>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>
>>
>> You're right in expecting that the tasks of the small job will likely be
>> evenly distributed among 20 nodes, if the 20 files are evenly distributed
>> among the nodes and that there are free slots on every node.
>>
>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>
>> Are you seeing Job B tasks are not being evenly distributed to each node?
>> You can check the locations of the files by hadoop fsck. If the evenness is
>> the goal you can also write your own input format that return empty
>> locations for each split and read the small files in map task directly. If
>> you're using Hadoop 1.0.x and fair scheduler, you might need to set
>> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
>> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
>> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>>
>> __Luke
>>
>
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly."  How would
manually reading the files into the map task help me?  Hadoop would still
spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
trying to get one mapper per machine for this job.

--Jeremy

On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:

>
> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>
>
> You're right in expecting that the tasks of the small job will likely be
> evenly distributed among 20 nodes, if the 20 files are evenly distributed
> among the nodes and that there are free slots on every node.
>
>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>
> Are you seeing Job B tasks are not being evenly distributed to each node?
> You can check the locations of the files by hadoop fsck. If the evenness is
> the goal you can also write your own input format that return empty
> locations for each split and read the small files in map task directly. If
> you're using Hadoop 1.0.x and fair scheduler, you might need to set
> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>
> __Luke
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly."  How would
manually reading the files into the map task help me?  Hadoop would still
spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
trying to get one mapper per machine for this job.

--Jeremy

On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:

>
> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>
>
> You're right in expecting that the tasks of the small job will likely be
> evenly distributed among 20 nodes, if the 20 files are evenly distributed
> among the nodes and that there are free slots on every node.
>
>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>
> Are you seeing Job B tasks are not being evenly distributed to each node?
> You can check the locations of the files by hadoop fsck. If the evenness is
> the goal you can also write your own input format that return empty
> locations for each split and read the small files in map task directly. If
> you're using Hadoop 1.0.x and fair scheduler, you might need to set
> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>
> __Luke
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly."  How would
manually reading the files into the map task help me?  Hadoop would still
spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
trying to get one mapper per machine for this job.

--Jeremy

On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:

>
> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>
>
> You're right in expecting that the tasks of the small job will likely be
> evenly distributed among 20 nodes, if the 20 files are evenly distributed
> among the nodes and that there are free slots on every node.
>
>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>
> Are you seeing Job B tasks are not being evenly distributed to each node?
> You can check the locations of the files by hadoop fsck. If the evenness is
> the goal you can also write your own input format that return empty
> locations for each split and read the small files in map task directly. If
> you're using Hadoop 1.0.x and fair scheduler, you might need to set
> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>
> __Luke
>

Re: What happens when you have fewer input files than mapper slots?

Posted by jeremy p <at...@gmail.com>.
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly."  How would
manually reading the files into the map task help me?  Hadoop would still
spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
trying to get one mapper per machine for this job.

--Jeremy

On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <ll...@apache.org> wrote:

>
> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>
>
> You're right in expecting that the tasks of the small job will likely be
> evenly distributed among 20 nodes, if the 20 files are evenly distributed
> among the nodes and that there are free slots on every node.
>
>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>
> Are you seeing Job B tasks are not being evenly distributed to each node?
> You can check the locations of the files by hadoop fsck. If the evenness is
> the goal you can also write your own input format that return empty
> locations for each split and read the small files in map task directly. If
> you're using Hadoop 1.0.x and fair scheduler, you might need to set
> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>
> __Luke
>

Re: What happens when you have fewer input files than mapper slots?

Posted by Luke Lu <ll...@apache.org>.
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?


You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.


> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>

Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.

__Luke

Re: What happens when you have fewer input files than mapper slots?

Posted by Rahul Jain <rj...@gmail.com>.
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??

For MRv2 (yarn): you can pretty much achieve this using:

yarn.nodemanager.resource.memory-mb (system wide setting)
and
mapreduce.map.memory.mb  (job level setting)

e.g. if yarn.nodemanager.resource.memory-mb=100
and mapreduce.map.memory.mb= 40
a maximum of two mapper can run on a node at any time.

For MRv1, The equivalent way will be to control mapper slots on each
machine:
mapred.tasktracker.map.tasks.maximum,  of course this does not give you
'per job' control. on mappers.

In addition in both cases, you can use a scheduler with 'pools / queues'
capability in addition to restrict the overall use of grid resource. Do
read fair scheduler and capacity scheduler documentation...


-Rahul




On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:

> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?
>
> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>
> Thanks!
>
> --Jeremy
>

Unsubscribe

Posted by Bruce Perttunen <br...@gmail.com>.
Unsubscribe


Re: What happens when you have fewer input files than mapper slots?

Posted by Rahul Jain <rj...@gmail.com>.
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??

For MRv2 (yarn): you can pretty much achieve this using:

yarn.nodemanager.resource.memory-mb (system wide setting)
and
mapreduce.map.memory.mb  (job level setting)

e.g. if yarn.nodemanager.resource.memory-mb=100
and mapreduce.map.memory.mb= 40
a maximum of two mapper can run on a node at any time.

For MRv1, The equivalent way will be to control mapper slots on each
machine:
mapred.tasktracker.map.tasks.maximum,  of course this does not give you
'per job' control. on mappers.

In addition in both cases, you can use a scheduler with 'pools / queues'
capability in addition to restrict the overall use of grid resource. Do
read fair scheduler and capacity scheduler documentation...


-Rahul




On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:

> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?
>
> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>
> Thanks!
>
> --Jeremy
>

Re: What happens when you have fewer input files than mapper slots?

Posted by Luke Lu <ll...@apache.org>.
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?


You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.


> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>

Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.

__Luke

Re: What happens when you have fewer input files than mapper slots?

Posted by Luke Lu <ll...@apache.org>.
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?


You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.


> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>

Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.

__Luke

Re: What happens when you have fewer input files than mapper slots?

Posted by Luke Lu <ll...@apache.org>.
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?


You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.


> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>

Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.

__Luke

Unsubscribe

Posted by Bruce Perttunen <br...@gmail.com>.
Unsubscribe


Unsubscribe

Posted by Bruce Perttunen <br...@gmail.com>.
Unsubscribe


Re: What happens when you have fewer input files than mapper slots?

Posted by Rahul Jain <rj...@gmail.com>.
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??

For MRv2 (yarn): you can pretty much achieve this using:

yarn.nodemanager.resource.memory-mb (system wide setting)
and
mapreduce.map.memory.mb  (job level setting)

e.g. if yarn.nodemanager.resource.memory-mb=100
and mapreduce.map.memory.mb= 40
a maximum of two mapper can run on a node at any time.

For MRv1, The equivalent way will be to control mapper slots on each
machine:
mapred.tasktracker.map.tasks.maximum,  of course this does not give you
'per job' control. on mappers.

In addition in both cases, you can use a scheduler with 'pools / queues'
capability in addition to restrict the overall use of grid resource. Do
read fair scheduler and capacity scheduler documentation...


-Rahul




On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <at...@gmail.com>wrote:

> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots.  You start a job with 20 very small input files.  How is the work
> distributed to the cluster?  Will it be even, with each node spawning one
> mapper task?  Is there any way of predicting or controlling how the work
> will be distributed?
>
> Long version : My cluster is currently used for two different jobs.  The
> cluster is currently optimized for Job A, so each node has a maximum of 18
> mapper slots.  However, I also need to run Job B.  Job B is VERY
> cpu-intensive, so we really only want one mapper to run on a node at any
> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
> gives you any way to set the maximum number of mappers per node on a
> per-job basis.  I'm at my wit's end here, and considering some rather
> egregious workarounds.  If you can think of anything that can help me, I'd
> very much appreciate it.
>
> Thanks!
>
> --Jeremy
>