You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Shaojun Zhao <sh...@gmail.com> on 2013/01/18 21:05:57 UTC

config for high memory jobs does not work, please help.

Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not sure about EMR, but if you install your own cluster on EC2 you can use the configs mentioned here:

>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html

Arun

On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote:

> I am using Amazon EC2/EMR.
> jps give this
> 16600 JobTracker
> 2732 RunJar
> 2504 StatePusher
> 31902 instance-controller.jar
> 23553 Jps
> 22444 RunJar
> 2077 NameNode
> 
> I am not sure how I can impose capacityscheduler on ec2/emr machines.
> -Shaojun
> 
> On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>> 
>> Some more info:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
>> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>> 
>> hth,
>> Arun
>> 
>> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>> 
>>> Dear all,
>>> 
>>> I know it is best to use small amount of mem in mapper and reduce.
>>> However, sometimes it is hard to do so. For example, in machine
>>> learning algorithms, it is common to load the model into mem in the
>>> mapper step. When the model is big, I have to allocate a lot of mem
>>> for the mapper.
>>> 
>>> Here is my question: how can I config hadoop so that it does not fork
>>> too many mappers and run out of physical memory?
>>> 
>>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>>> fork 6 mappers on each machine, no matter what config I used. I really
>>> want to reduce it to what ever number I want, for example, just 1
>>> mapper per machine.
>>> 
>>> Here are the config I tried. (I use streaming, and I pass the config
>>> in the command line)
>>> 
>>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>> 
>>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>>> of mappers
>>> 
>>> Am I missing something here?
>>> I use Hadoop 0.20.205
>>> 
>>> Thanks a lot in advance!
>>> -Shaojun
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not sure about EMR, but if you install your own cluster on EC2 you can use the configs mentioned here:

>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html

Arun

On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote:

> I am using Amazon EC2/EMR.
> jps give this
> 16600 JobTracker
> 2732 RunJar
> 2504 StatePusher
> 31902 instance-controller.jar
> 23553 Jps
> 22444 RunJar
> 2077 NameNode
> 
> I am not sure how I can impose capacityscheduler on ec2/emr machines.
> -Shaojun
> 
> On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>> 
>> Some more info:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
>> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>> 
>> hth,
>> Arun
>> 
>> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>> 
>>> Dear all,
>>> 
>>> I know it is best to use small amount of mem in mapper and reduce.
>>> However, sometimes it is hard to do so. For example, in machine
>>> learning algorithms, it is common to load the model into mem in the
>>> mapper step. When the model is big, I have to allocate a lot of mem
>>> for the mapper.
>>> 
>>> Here is my question: how can I config hadoop so that it does not fork
>>> too many mappers and run out of physical memory?
>>> 
>>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>>> fork 6 mappers on each machine, no matter what config I used. I really
>>> want to reduce it to what ever number I want, for example, just 1
>>> mapper per machine.
>>> 
>>> Here are the config I tried. (I use streaming, and I pass the config
>>> in the command line)
>>> 
>>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>> 
>>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>>> of mappers
>>> 
>>> Am I missing something here?
>>> I use Hadoop 0.20.205
>>> 
>>> Thanks a lot in advance!
>>> -Shaojun
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not sure about EMR, but if you install your own cluster on EC2 you can use the configs mentioned here:

>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html

Arun

On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote:

> I am using Amazon EC2/EMR.
> jps give this
> 16600 JobTracker
> 2732 RunJar
> 2504 StatePusher
> 31902 instance-controller.jar
> 23553 Jps
> 22444 RunJar
> 2077 NameNode
> 
> I am not sure how I can impose capacityscheduler on ec2/emr machines.
> -Shaojun
> 
> On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>> 
>> Some more info:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
>> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>> 
>> hth,
>> Arun
>> 
>> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>> 
>>> Dear all,
>>> 
>>> I know it is best to use small amount of mem in mapper and reduce.
>>> However, sometimes it is hard to do so. For example, in machine
>>> learning algorithms, it is common to load the model into mem in the
>>> mapper step. When the model is big, I have to allocate a lot of mem
>>> for the mapper.
>>> 
>>> Here is my question: how can I config hadoop so that it does not fork
>>> too many mappers and run out of physical memory?
>>> 
>>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>>> fork 6 mappers on each machine, no matter what config I used. I really
>>> want to reduce it to what ever number I want, for example, just 1
>>> mapper per machine.
>>> 
>>> Here are the config I tried. (I use streaming, and I pass the config
>>> in the command line)
>>> 
>>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>> 
>>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>>> of mappers
>>> 
>>> Am I missing something here?
>>> I use Hadoop 0.20.205
>>> 
>>> Thanks a lot in advance!
>>> -Shaojun
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Not sure about EMR, but if you install your own cluster on EC2 you can use the configs mentioned here:

>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html

Arun

On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote:

> I am using Amazon EC2/EMR.
> jps give this
> 16600 JobTracker
> 2732 RunJar
> 2504 StatePusher
> 31902 instance-controller.jar
> 23553 Jps
> 22444 RunJar
> 2077 NameNode
> 
> I am not sure how I can impose capacityscheduler on ec2/emr machines.
> -Shaojun
> 
> On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>> 
>> Some more info:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
>> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>> 
>> hth,
>> Arun
>> 
>> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>> 
>>> Dear all,
>>> 
>>> I know it is best to use small amount of mem in mapper and reduce.
>>> However, sometimes it is hard to do so. For example, in machine
>>> learning algorithms, it is common to load the model into mem in the
>>> mapper step. When the model is big, I have to allocate a lot of mem
>>> for the mapper.
>>> 
>>> Here is my question: how can I config hadoop so that it does not fork
>>> too many mappers and run out of physical memory?
>>> 
>>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>>> fork 6 mappers on each machine, no matter what config I used. I really
>>> want to reduce it to what ever number I want, for example, just 1
>>> mapper per machine.
>>> 
>>> Here are the config I tried. (I use streaming, and I pass the config
>>> in the command line)
>>> 
>>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>> 
>>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>>> of mappers
>>> 
>>> Am I missing something here?
>>> I use Hadoop 0.20.205
>>> 
>>> Thanks a lot in advance!
>>> -Shaojun
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I am using Amazon EC2/EMR.
jps give this
16600 JobTracker
2732 RunJar
2504 StatePusher
31902 instance-controller.jar
23553 Jps
22444 RunJar
2077 NameNode

I am not sure how I can impose capacityscheduler on ec2/emr machines.
-Shaojun

On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>
> Some more info:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>
> hth,
> Arun
>
> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>
>> Dear all,
>>
>> I know it is best to use small amount of mem in mapper and reduce.
>> However, sometimes it is hard to do so. For example, in machine
>> learning algorithms, it is common to load the model into mem in the
>> mapper step. When the model is big, I have to allocate a lot of mem
>> for the mapper.
>>
>> Here is my question: how can I config hadoop so that it does not fork
>> too many mappers and run out of physical memory?
>>
>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>> fork 6 mappers on each machine, no matter what config I used. I really
>> want to reduce it to what ever number I want, for example, just 1
>> mapper per machine.
>>
>> Here are the config I tried. (I use streaming, and I pass the config
>> in the command line)
>>
>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>
>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>> of mappers
>>
>> Am I missing something here?
>> I use Hadoop 0.20.205
>>
>> Thanks a lot in advance!
>> -Shaojun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I am using Amazon EC2/EMR.
jps give this
16600 JobTracker
2732 RunJar
2504 StatePusher
31902 instance-controller.jar
23553 Jps
22444 RunJar
2077 NameNode

I am not sure how I can impose capacityscheduler on ec2/emr machines.
-Shaojun

On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>
> Some more info:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>
> hth,
> Arun
>
> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>
>> Dear all,
>>
>> I know it is best to use small amount of mem in mapper and reduce.
>> However, sometimes it is hard to do so. For example, in machine
>> learning algorithms, it is common to load the model into mem in the
>> mapper step. When the model is big, I have to allocate a lot of mem
>> for the mapper.
>>
>> Here is my question: how can I config hadoop so that it does not fork
>> too many mappers and run out of physical memory?
>>
>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>> fork 6 mappers on each machine, no matter what config I used. I really
>> want to reduce it to what ever number I want, for example, just 1
>> mapper per machine.
>>
>> Here are the config I tried. (I use streaming, and I pass the config
>> in the command line)
>>
>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>
>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>> of mappers
>>
>> Am I missing something here?
>> I use Hadoop 0.20.205
>>
>> Thanks a lot in advance!
>> -Shaojun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I am using Amazon EC2/EMR.
jps give this
16600 JobTracker
2732 RunJar
2504 StatePusher
31902 instance-controller.jar
23553 Jps
22444 RunJar
2077 NameNode

I am not sure how I can impose capacityscheduler on ec2/emr machines.
-Shaojun

On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>
> Some more info:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>
> hth,
> Arun
>
> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>
>> Dear all,
>>
>> I know it is best to use small amount of mem in mapper and reduce.
>> However, sometimes it is hard to do so. For example, in machine
>> learning algorithms, it is common to load the model into mem in the
>> mapper step. When the model is big, I have to allocate a lot of mem
>> for the mapper.
>>
>> Here is my question: how can I config hadoop so that it does not fork
>> too many mappers and run out of physical memory?
>>
>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>> fork 6 mappers on each machine, no matter what config I used. I really
>> want to reduce it to what ever number I want, for example, just 1
>> mapper per machine.
>>
>> Here are the config I tried. (I use streaming, and I pass the config
>> in the command line)
>>
>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>
>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>> of mappers
>>
>> Am I missing something here?
>> I use Hadoop 0.20.205
>>
>> Thanks a lot in advance!
>> -Shaojun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I am using Amazon EC2/EMR.
jps give this
16600 JobTracker
2732 RunJar
2504 StatePusher
31902 instance-controller.jar
23553 Jps
22444 RunJar
2077 NameNode

I am not sure how I can impose capacityscheduler on ec2/emr machines.
-Shaojun

On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).
>
> Some more info:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
> http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
>
> hth,
> Arun
>
> On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
>
>> Dear all,
>>
>> I know it is best to use small amount of mem in mapper and reduce.
>> However, sometimes it is hard to do so. For example, in machine
>> learning algorithms, it is common to load the model into mem in the
>> mapper step. When the model is big, I have to allocate a lot of mem
>> for the mapper.
>>
>> Here is my question: how can I config hadoop so that it does not fork
>> too many mappers and run out of physical memory?
>>
>> My machines have 24G, and I have 100 of them. Each time, hadoop will
>> fork 6 mappers on each machine, no matter what config I used. I really
>> want to reduce it to what ever number I want, for example, just 1
>> mapper per machine.
>>
>> Here are the config I tried. (I use streaming, and I pass the config
>> in the command line)
>>
>> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>>
>> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
>> of mappers
>>
>> Am I missing something here?
>> I use Hadoop 0.20.205
>>
>> Thanks a lot in advance!
>> -Shaojun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).

Some more info:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/

hth,
Arun

On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:

> Dear all,
> 
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
> 
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
> 
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
> 
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
> 
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
> 
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
> 
> Am I missing something here?
> I use Hadoop 0.20.205
> 
> Thanks a lot in advance!
> -Shaojun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).

Some more info:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/

hth,
Arun

On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:

> Dear all,
> 
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
> 
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
> 
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
> 
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
> 
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
> 
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
> 
> Am I missing something here?
> I use Hadoop 0.20.205
> 
> Thanks a lot in advance!
> -Shaojun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).

Some more info:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/

hth,
Arun

On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:

> Dear all,
> 
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
> 
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
> 
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
> 
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
> 
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
> 
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
> 
> Am I missing something here?
> I use Hadoop 0.20.205
> 
> Thanks a lot in advance!
> -Shaojun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Arun C Murthy <ac...@hortonworks.com>.

Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)).

Some more info:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/

hth,
Arun

On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:

> Dear all,
> 
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
> 
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
> 
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
> 
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
> 
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
> 
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
> 
> Am I missing something here?
> I use Hadoop 0.20.205
> 
> Thanks a lot in advance!
> -Shaojun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I do have this in my command line, and it did not work.
-Dmapred.tasktracker.map.tasks.maximum=2

I also tried changing mapred-site.xml, and restart the tasktracker, it
did not work either. I am sure it will work if I restart everything,
but I really do not want to lose my data on hdfs. So I have not tried
restarting everyting.

Best regards,
-Shaojun


On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> Try:
>
> -Dmapred.tasktracker.map.tasks.maximum=1
>
> Although I usually put this parameter in mapred-site.xml.
>
> Jeff
>
>
> Dear all,
>
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
>
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
>
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
>
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
>
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
>
> Am I missing something here?
> I use Hadoop 0.20.205
>
> Thanks a lot in advance!
> -Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I do have this in my command line, and it did not work.
-Dmapred.tasktracker.map.tasks.maximum=2

I also tried changing mapred-site.xml, and restart the tasktracker, it
did not work either. I am sure it will work if I restart everything,
but I really do not want to lose my data on hdfs. So I have not tried
restarting everyting.

Best regards,
-Shaojun


On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> Try:
>
> -Dmapred.tasktracker.map.tasks.maximum=1
>
> Although I usually put this parameter in mapred-site.xml.
>
> Jeff
>
>
> Dear all,
>
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
>
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
>
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
>
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
>
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
>
> Am I missing something here?
> I use Hadoop 0.20.205
>
> Thanks a lot in advance!
> -Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I do have this in my command line, and it did not work.
-Dmapred.tasktracker.map.tasks.maximum=2

I also tried changing mapred-site.xml, and restart the tasktracker, it
did not work either. I am sure it will work if I restart everything,
but I really do not want to lose my data on hdfs. So I have not tried
restarting everyting.

Best regards,
-Shaojun


On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> Try:
>
> -Dmapred.tasktracker.map.tasks.maximum=1
>
> Although I usually put this parameter in mapred-site.xml.
>
> Jeff
>
>
> Dear all,
>
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
>
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
>
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
>
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
>
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
>
> Am I missing something here?
> I use Hadoop 0.20.205
>
> Thanks a lot in advance!
> -Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Shaojun Zhao <sh...@gmail.com>.

I do have this in my command line, and it did not work.
-Dmapred.tasktracker.map.tasks.maximum=2

I also tried changing mapred-site.xml, and restart the tasktracker, it
did not work either. I am sure it will work if I restart everything,
but I really do not want to lose my data on hdfs. So I have not tried
restarting everyting.

Best regards,
-Shaojun


On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell <jb...@vmware.com> wrote:
> Try:
>
> -Dmapred.tasktracker.map.tasks.maximum=1
>
> Although I usually put this parameter in mapred-site.xml.
>
> Jeff
>
>
> Dear all,
>
> I know it is best to use small amount of mem in mapper and reduce.
> However, sometimes it is hard to do so. For example, in machine
> learning algorithms, it is common to load the model into mem in the
> mapper step. When the model is big, I have to allocate a lot of mem
> for the mapper.
>
> Here is my question: how can I config hadoop so that it does not fork
> too many mappers and run out of physical memory?
>
> My machines have 24G, and I have 100 of them. Each time, hadoop will
> fork 6 mappers on each machine, no matter what config I used. I really
> want to reduce it to what ever number I want, for example, just 1
> mapper per machine.
>
> Here are the config I tried. (I use streaming, and I pass the config
> in the command line)
>
> -Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers
>
> -Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
> of mappers
>
> Am I missing something here?
> I use Hadoop 0.20.205
>
> Thanks a lot in advance!
> -Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Jeffrey Buell <jb...@vmware.com>.

Try:

-Dmapred.tasktracker.map.tasks.maximum=1

Although I usually put this parameter in mapred-site.xml.

Jeff


Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Jeffrey Buell <jb...@vmware.com>.

Try:

-Dmapred.tasktracker.map.tasks.maximum=1

Although I usually put this parameter in mapred-site.xml.

Jeff


Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Jeffrey Buell <jb...@vmware.com>.

Try:

-Dmapred.tasktracker.map.tasks.maximum=1

Although I usually put this parameter in mapred-site.xml.

Jeff


Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun

Re: config for high memory jobs does not work, please help.

Posted by Jeffrey Buell <jb...@vmware.com>.

Try:

-Dmapred.tasktracker.map.tasks.maximum=1

Although I usually put this parameter in mapred-site.xml.

Jeff


Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  <-- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 <-- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun