You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Marco Zühlke <mz...@gmail.com> on 2012/10/30 16:49:29 UTC

Memory based scheduling

Hi,

on our cluster our jobs usually satisfied with less than 2 GB of heap space.
so we have on our 8 GB computers 3 maps maximum and on our 16 GB
computers 4 maps maximum (we only have quad core CPUs and to have
memory left for reducers). This works very well.

But now we have a new kind of jobs. Each mapper requires at lest 4 GB
of heap space.

Is it possible to limit the number of tasks (mapper) per computer to 1 or 2
for
these kinds of jobs ?

Regards,
Marco

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Yes, use the CapacityScheduler and ask for multiple slots per map or reduce task:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling

Arun

On Oct 30, 2012, at 8:49 AM, Marco Zühlke wrote:

> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by Harsh J <ha...@cloudera.com>.

Arun's correct, on the 0.20.x and 1.x line, the CS allows "slot
reservation" based on provided memory requests from each job.

You'd have to set your cluster's maximum allowed per-task memory
request [0], define the per "slot" maximum memory unit [1], then set
on a per-job basis the real memory request you may need for it [2].
While the framework itself allows for memory usage monitoring in
general, CS further allows "slot management" based on the requested
resources, so if you requested a 4 GB map task memory resource [3] on
a cluster slot definition of 2 GB [2], two slots get reserved to run
such a task JVM. Arun's link has more info on setting up the whole CS.

Btw, you may also want
https://issues.apache.org/jira/browse/MAPREDUCE-4001 and
https://issues.apache.org/jira/browse/MAPREDUCE-3789 in your Hadoop
release/distribution since your environment is heterogenous, and the
1.x/0.20.x CS without these fixes applied might end up wasting some
cluster resources unnecessarily.

I'd also recommend looking at YARN, which is driven purely based on
resource requests (memory currently, but soon cpu and others).

[0] - mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb
[1] - mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb
[2] - mapred.job.map.memory.mb and mapred.job.reduce.memory.mb

On Tue, Oct 30, 2012 at 10:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user configurations
> like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they are
> empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

-- 
Harsh J

Re: Memory based scheduling

Posted by Harsh J <ha...@cloudera.com>.

Arun's correct, on the 0.20.x and 1.x line, the CS allows "slot
reservation" based on provided memory requests from each job.

You'd have to set your cluster's maximum allowed per-task memory
request [0], define the per "slot" maximum memory unit [1], then set
on a per-job basis the real memory request you may need for it [2].
While the framework itself allows for memory usage monitoring in
general, CS further allows "slot management" based on the requested
resources, so if you requested a 4 GB map task memory resource [3] on
a cluster slot definition of 2 GB [2], two slots get reserved to run
such a task JVM. Arun's link has more info on setting up the whole CS.

Btw, you may also want
https://issues.apache.org/jira/browse/MAPREDUCE-4001 and
https://issues.apache.org/jira/browse/MAPREDUCE-3789 in your Hadoop
release/distribution since your environment is heterogenous, and the
1.x/0.20.x CS without these fixes applied might end up wasting some
cluster resources unnecessarily.

I'd also recommend looking at YARN, which is driven purely based on
resource requests (memory currently, but soon cpu and others).

[0] - mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb
[1] - mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb
[2] - mapred.job.map.memory.mb and mapred.job.reduce.memory.mb

On Tue, Oct 30, 2012 at 10:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user configurations
> like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they are
> empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

-- 
Harsh J

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

I did not know about this feature in Capacity Scheduler.
Just to get more information, lets assume cluster has 4 slots on node with
8G (as mentioned by Marco). Now if 2 slots are occupied with 4G memory,
what happens if another job which gets scheduled which asks for only 2G per
slot. Does it get scheduled? if so, does it get killed instead of the 4G
slots?


2012/10/30 Arun C Murthy <ac...@hortonworks.com>

> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user
> configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they
> are empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

I did not know about this feature in Capacity Scheduler.
Just to get more information, lets assume cluster has 4 slots on node with
8G (as mentioned by Marco). Now if 2 slots are occupied with 4G memory,
what happens if another job which gets scheduled which asks for only 2G per
slot. Does it get scheduled? if so, does it get killed instead of the 4G
slots?


2012/10/30 Arun C Murthy <ac...@hortonworks.com>

> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user
> configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they
> are empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by Harsh J <ha...@cloudera.com>.

Arun's correct, on the 0.20.x and 1.x line, the CS allows "slot
reservation" based on provided memory requests from each job.

You'd have to set your cluster's maximum allowed per-task memory
request [0], define the per "slot" maximum memory unit [1], then set
on a per-job basis the real memory request you may need for it [2].
While the framework itself allows for memory usage monitoring in
general, CS further allows "slot management" based on the requested
resources, so if you requested a 4 GB map task memory resource [3] on
a cluster slot definition of 2 GB [2], two slots get reserved to run
such a task JVM. Arun's link has more info on setting up the whole CS.

Btw, you may also want
https://issues.apache.org/jira/browse/MAPREDUCE-4001 and
https://issues.apache.org/jira/browse/MAPREDUCE-3789 in your Hadoop
release/distribution since your environment is heterogenous, and the
1.x/0.20.x CS without these fixes applied might end up wasting some
cluster resources unnecessarily.

I'd also recommend looking at YARN, which is driven purely based on
resource requests (memory currently, but soon cpu and others).

[0] - mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb
[1] - mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb
[2] - mapred.job.map.memory.mb and mapred.job.reduce.memory.mb

On Tue, Oct 30, 2012 at 10:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user configurations
> like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they are
> empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

-- 
Harsh J

Re: Memory based scheduling

Posted by Harsh J <ha...@cloudera.com>.

Arun's correct, on the 0.20.x and 1.x line, the CS allows "slot
reservation" based on provided memory requests from each job.

You'd have to set your cluster's maximum allowed per-task memory
request [0], define the per "slot" maximum memory unit [1], then set
on a per-job basis the real memory request you may need for it [2].
While the framework itself allows for memory usage monitoring in
general, CS further allows "slot management" based on the requested
resources, so if you requested a 4 GB map task memory resource [3] on
a cluster slot definition of 2 GB [2], two slots get reserved to run
such a task JVM. Arun's link has more info on setting up the whole CS.

Btw, you may also want
https://issues.apache.org/jira/browse/MAPREDUCE-4001 and
https://issues.apache.org/jira/browse/MAPREDUCE-3789 in your Hadoop
release/distribution since your environment is heterogenous, and the
1.x/0.20.x CS without these fixes applied might end up wasting some
cluster resources unnecessarily.

I'd also recommend looking at YARN, which is driven purely based on
resource requests (memory currently, but soon cpu and others).

[0] - mapred.cluster.max.map.memory.mb and mapred.cluster.max.reduce.memory.mb
[1] - mapred.cluster.map.memory.mb and mapred.cluster.reduce.memory.mb
[2] - mapred.job.map.memory.mb and mapred.job.reduce.memory.mb

On Tue, Oct 30, 2012 at 10:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user configurations
> like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they are
> empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

-- 
Harsh J

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

I did not know about this feature in Capacity Scheduler.
Just to get more information, lets assume cluster has 4 slots on node with
8G (as mentioned by Marco). Now if 2 slots are occupied with 4G memory,
what happens if another job which gets scheduled which asks for only 2G per
slot. Does it get scheduled? if so, does it get killed instead of the 4G
slots?


2012/10/30 Arun C Murthy <ac...@hortonworks.com>

> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user
> configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they
> are empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

I did not know about this feature in Capacity Scheduler.
Just to get more information, lets assume cluster has 4 slots on node with
8G (as mentioned by Marco). Now if 2 slots are occupied with 4G memory,
what happens if another job which gets scheduled which asks for only 2G per
slot. Does it get scheduled? if so, does it get killed instead of the 4G
slots?


2012/10/30 Arun C Murthy <ac...@hortonworks.com>

> Not true, take a look at my prev. response.
>
> On Oct 30, 2012, at 9:08 AM, lohit wrote:
>
> As far as I recall this is not possible. Per job or per user
> configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster
> capacity. (This is possible with FairSchedule, I do not know of
> CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and
> set max maps to be 20.
> JobTracker will try its best to spread tasks across nodes provided they
> are empty slots. But again, this is not guaranteed.
>
>
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
>
>> Hi,
>>
>> on our cluster our jobs usually satisfied with less than 2 GB of heap
>> space.
>> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
>> computers 4 maps maximum (we only have quad core CPUs and to have
>> memory left for reducers). This works very well.
>>
>> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
>> of heap space.
>>
>> Is it possible to limit the number of tasks (mapper) per computer to 1 or
>> 2 for
>> these kinds of jobs ?
>>
>> Regards,
>> Marco
>>
>>
>
>
> --
> Have a Nice Day!
> Lohit
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not true, take a look at my prev. response.

On Oct 30, 2012, at 9:08 AM, lohit wrote:

> As far as I recall this is not possible. Per job or per user configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster capacity. (This is possible with FairSchedule, I do not know of CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and set max maps to be 20. 
> JobTracker will try its best to spread tasks across nodes provided they are empty slots. But again, this is not guaranteed. 
> 
> 
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 
> 
> 
> 
> -- 
> Have a Nice Day!
> Lohit

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not true, take a look at my prev. response.

On Oct 30, 2012, at 9:08 AM, lohit wrote:

> As far as I recall this is not possible. Per job or per user configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster capacity. (This is possible with FairSchedule, I do not know of CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and set max maps to be 20. 
> JobTracker will try its best to spread tasks across nodes provided they are empty slots. But again, this is not guaranteed. 
> 
> 
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 
> 
> 
> 
> -- 
> Have a Nice Day!
> Lohit

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not true, take a look at my prev. response.

On Oct 30, 2012, at 9:08 AM, lohit wrote:

> As far as I recall this is not possible. Per job or per user configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster capacity. (This is possible with FairSchedule, I do not know of CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and set max maps to be 20. 
> JobTracker will try its best to spread tasks across nodes provided they are empty slots. But again, this is not guaranteed. 
> 
> 
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 
> 
> 
> 
> -- 
> Have a Nice Day!
> Lohit

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not true, take a look at my prev. response.

On Oct 30, 2012, at 9:08 AM, lohit wrote:

> As far as I recall this is not possible. Per job or per user configurations like these are little difficult in existing version.
> What you could try is to set max map per job to be say half of cluster capacity. (This is possible with FairSchedule, I do not know of CapacityScheduler)
> For eg, if you have 10 nodes with 4 slots each. You would create pool and set max maps to be 20. 
> JobTracker will try its best to spread tasks across nodes provided they are empty slots. But again, this is not guaranteed. 
> 
> 
> 2012/10/30 Marco Zühlke <mz...@gmail.com>
> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 
> 
> 
> 
> -- 
> Have a Nice Day!
> Lohit

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

As far as I recall this is not possible. Per job or per user configurations
like these are little difficult in existing version.
What you could try is to set max map per job to be say half of cluster
capacity. (This is possible with FairSchedule, I do not know of
CapacityScheduler)
For eg, if you have 10 nodes with 4 slots each. You would create pool and
set max maps to be 20.
JobTracker will try its best to spread tasks across nodes provided they are
empty slots. But again, this is not guaranteed.


2012/10/30 Marco Zühlke <mz...@gmail.com>

> Hi,
>
> on our cluster our jobs usually satisfied with less than 2 GB of heap
> space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
>
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
>
> Is it possible to limit the number of tasks (mapper) per computer to 1 or
> 2 for
> these kinds of jobs ?
>
> Regards,
> Marco
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

As far as I recall this is not possible. Per job or per user configurations
like these are little difficult in existing version.
What you could try is to set max map per job to be say half of cluster
capacity. (This is possible with FairSchedule, I do not know of
CapacityScheduler)
For eg, if you have 10 nodes with 4 slots each. You would create pool and
set max maps to be 20.
JobTracker will try its best to spread tasks across nodes provided they are
empty slots. But again, this is not guaranteed.


2012/10/30 Marco Zühlke <mz...@gmail.com>

> Hi,
>
> on our cluster our jobs usually satisfied with less than 2 GB of heap
> space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
>
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
>
> Is it possible to limit the number of tasks (mapper) per computer to 1 or
> 2 for
> these kinds of jobs ?
>
> Regards,
> Marco
>
>


-- 
Have a Nice Day!
Lohit

RE: Memory based scheduling

Posted by "Kaczmarek, Eric" <er...@intel.com>.

Someone might correct me if I am wrong, but isn't the number of mappers determined by your input size and hdfs block size?

Example your input per system is 1Mb, setting your block size to 512k should result in only 2 mappers to execute on that system?

-Eric

From: Marco Zühlke [mailto:mzuehlke@gmail.com]
Sent: Tuesday, October 30, 2012 8:49 AM
To: user@hadoop.apache.org
Subject: Memory based scheduling

Hi,

on our cluster our jobs usually satisfied with less than 2 GB of heap space.
so we have on our 8 GB computers 3 maps maximum and on our 16 GB
computers 4 maps maximum (we only have quad core CPUs and to have
memory left for reducers). This works very well.

But now we have a new kind of jobs. Each mapper requires at lest 4 GB
of heap space.

Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
these kinds of jobs ?

Regards,
Marco

RE: Memory based scheduling

Posted by "Kaczmarek, Eric" <er...@intel.com>.

Someone might correct me if I am wrong, but isn't the number of mappers determined by your input size and hdfs block size?

Example your input per system is 1Mb, setting your block size to 512k should result in only 2 mappers to execute on that system?

-Eric

From: Marco Zühlke [mailto:mzuehlke@gmail.com]
Sent: Tuesday, October 30, 2012 8:49 AM
To: user@hadoop.apache.org
Subject: Memory based scheduling

Hi,

on our cluster our jobs usually satisfied with less than 2 GB of heap space.
so we have on our 8 GB computers 3 maps maximum and on our 16 GB
computers 4 maps maximum (we only have quad core CPUs and to have
memory left for reducers). This works very well.

But now we have a new kind of jobs. Each mapper requires at lest 4 GB
of heap space.

Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
these kinds of jobs ?

Regards,
Marco

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

As far as I recall this is not possible. Per job or per user configurations
like these are little difficult in existing version.
What you could try is to set max map per job to be say half of cluster
capacity. (This is possible with FairSchedule, I do not know of
CapacityScheduler)
For eg, if you have 10 nodes with 4 slots each. You would create pool and
set max maps to be 20.
JobTracker will try its best to spread tasks across nodes provided they are
empty slots. But again, this is not guaranteed.


2012/10/30 Marco Zühlke <mz...@gmail.com>

> Hi,
>
> on our cluster our jobs usually satisfied with less than 2 GB of heap
> space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
>
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
>
> Is it possible to limit the number of tasks (mapper) per computer to 1 or
> 2 for
> these kinds of jobs ?
>
> Regards,
> Marco
>
>


-- 
Have a Nice Day!
Lohit

RE: Memory based scheduling

Posted by "Kaczmarek, Eric" <er...@intel.com>.

Someone might correct me if I am wrong, but isn't the number of mappers determined by your input size and hdfs block size?

Example your input per system is 1Mb, setting your block size to 512k should result in only 2 mappers to execute on that system?

-Eric

From: Marco Zühlke [mailto:mzuehlke@gmail.com]
Sent: Tuesday, October 30, 2012 8:49 AM
To: user@hadoop.apache.org
Subject: Memory based scheduling

Hi,

on our cluster our jobs usually satisfied with less than 2 GB of heap space.
so we have on our 8 GB computers 3 maps maximum and on our 16 GB
computers 4 maps maximum (we only have quad core CPUs and to have
memory left for reducers). This works very well.

But now we have a new kind of jobs. Each mapper requires at lest 4 GB
of heap space.

Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
these kinds of jobs ?

Regards,
Marco

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Yes, use the CapacityScheduler and ask for multiple slots per map or reduce task:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling

Arun

On Oct 30, 2012, at 8:49 AM, Marco Zühlke wrote:

> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Memory based scheduling

Posted by lohit <lo...@gmail.com>.

As far as I recall this is not possible. Per job or per user configurations
like these are little difficult in existing version.
What you could try is to set max map per job to be say half of cluster
capacity. (This is possible with FairSchedule, I do not know of
CapacityScheduler)
For eg, if you have 10 nodes with 4 slots each. You would create pool and
set max maps to be 20.
JobTracker will try its best to spread tasks across nodes provided they are
empty slots. But again, this is not guaranteed.


2012/10/30 Marco Zühlke <mz...@gmail.com>

> Hi,
>
> on our cluster our jobs usually satisfied with less than 2 GB of heap
> space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
>
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
>
> Is it possible to limit the number of tasks (mapper) per computer to 1 or
> 2 for
> these kinds of jobs ?
>
> Regards,
> Marco
>
>


-- 
Have a Nice Day!
Lohit

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Yes, use the CapacityScheduler and ask for multiple slots per map or reduce task:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling

Arun

On Oct 30, 2012, at 8:49 AM, Marco Zühlke wrote:

> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

RE: Memory based scheduling

Posted by "Kaczmarek, Eric" <er...@intel.com>.

Someone might correct me if I am wrong, but isn't the number of mappers determined by your input size and hdfs block size?

Example your input per system is 1Mb, setting your block size to 512k should result in only 2 mappers to execute on that system?

-Eric

From: Marco Zühlke [mailto:mzuehlke@gmail.com]
Sent: Tuesday, October 30, 2012 8:49 AM
To: user@hadoop.apache.org
Subject: Memory based scheduling

Hi,

on our cluster our jobs usually satisfied with less than 2 GB of heap space.
so we have on our 8 GB computers 3 maps maximum and on our 16 GB
computers 4 maps maximum (we only have quad core CPUs and to have
memory left for reducers). This works very well.

But now we have a new kind of jobs. Each mapper requires at lest 4 GB
of heap space.

Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
these kinds of jobs ?

Regards,
Marco

Re: Memory based scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Yes, use the CapacityScheduler and ask for multiple slots per map or reduce task:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling

Arun

On Oct 30, 2012, at 8:49 AM, Marco Zühlke wrote:

> Hi,
> 
> on our cluster our jobs usually satisfied with less than 2 GB of heap space.
> so we have on our 8 GB computers 3 maps maximum and on our 16 GB
> computers 4 maps maximum (we only have quad core CPUs and to have
> memory left for reducers). This works very well.
> 
> But now we have a new kind of jobs. Each mapper requires at lest 4 GB
> of heap space.
> 
> Is it possible to limit the number of tasks (mapper) per computer to 1 or 2 for
> these kinds of jobs ?
> 
> Regards,
> Marco
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/