You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Jakub Stransky <st...@gmail.com> on 2014/09/12 17:51:23 UTC

CPU utilization

Hello experienced hadoop users,

I have one beginners question regarding cpu utilization on datanodes when
running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
using following parameters:
# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb              : 768
mapreduce.map.java.opts              : -Xmx512m
mapreduce.reduce.memory.mb           : 1024
mapreduce.reduce.java.opts           : -Xmx768m
mapreduce.task.io.sort.mb            : 100
yarn.app.mapreduce.am.resource.mb    : 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

and I have map only task which uses 3 mappers which are essentially
distributed across the cluster - 1 task per dn. What I see on the cluster
nodes is that cpu utilization doesn't overcome 30%.

Am I right and hadoop do really limit all the resources per container
bases? I wasn't able to find any command/setting which would prove this
theory. ulimit for yarn were unlimited, etc.

Not sure if I am missing something here

Thanks for providing more insight into resource planning and utilization
Jakub

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

> Adam, how did you come to the conclusion that it is memory bounded?
>

I mean the number of containers running on your NodeManager, not the job
itself.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

> Adam, how did you come to the conclusion that it is memory bounded?
>

I mean the number of containers running on your NodeManager, not the job
itself.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

> Adam, how did you come to the conclusion that it is memory bounded?
>

I mean the number of containers running on your NodeManager, not the job
itself.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

> Adam, how did you come to the conclusion that it is memory bounded?
>

I mean the number of containers running on your NodeManager, not the job
itself.

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.

On 12 September 2014 20:57, Adam Kawa <ka...@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
>
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
>
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
>
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>>
>
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>
>>
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
>>
>
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>
>
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
>>
>
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.
>

-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.

On 12 September 2014 20:57, Adam Kawa <ka...@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
>
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
>
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
>
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>>
>
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>
>>
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
>>
>
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>
>
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
>>
>
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.
>

-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.

On 12 September 2014 20:57, Adam Kawa <ka...@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
>
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
>
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
>
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>>
>
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>
>>
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
>>
>
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>
>
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
>>
>
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.
>

-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.

On 12 September 2014 20:57, Adam Kawa <ka...@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
>
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
>
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
>
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>>
>
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>
>>
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
>>
>
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>
>
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
>>
>
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.
>

-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
allocating containers.

If you run map task, you need 768 MB (mapreduce.map.memory.mb).
If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
If you run the MapReduce app master, you need 1024 MB (yarn.app.mapreduce.am
.resource.mb).

Therefore, you run MapReduce job, you can run only 2 containers per
NodeManager (3 x 768 = 2304 < 2048) on your setup.

2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:


>  I thought that memory assigned has to be muliply of
> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>

That's right. It also specifies the minimum size of a container to prevent
from requesting unreasonable small containers (that are likely to cause
tasks failures).

>
> any other I am not aware of. Are there any additional parameters like that
> you mentioned which should be set?
>

There are also settings related to vcores in mapred-site.xml and
yarn-site.xml. But they don't change anything in your case (as you are
limited by the memory, not vcores).


> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
> data and run for 60min. I wasn't able to make any significant improvment.
> It is map only job. And wasn't able to achive more that 30% of total
> machine cpu utilization. Howewer top command were displaying 100 %cpu for
> process running on data node, that's why I was thinking that way about
> limit on container process limit. I didn't find any other boundary like io
> or network or memory.
>

CPU utilization depends on type of your jobs (e.g. doing complex math
operations or just counting words) and the number of containers you run. If
you want to play with this, you can run more CPU-bound jobs or increase the
number of containers running on a node.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
allocating containers.

If you run map task, you need 768 MB (mapreduce.map.memory.mb).
If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
If you run the MapReduce app master, you need 1024 MB (yarn.app.mapreduce.am
.resource.mb).

Therefore, you run MapReduce job, you can run only 2 containers per
NodeManager (3 x 768 = 2304 < 2048) on your setup.

2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:


>  I thought that memory assigned has to be muliply of
> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>

That's right. It also specifies the minimum size of a container to prevent
from requesting unreasonable small containers (that are likely to cause
tasks failures).

>
> any other I am not aware of. Are there any additional parameters like that
> you mentioned which should be set?
>

There are also settings related to vcores in mapred-site.xml and
yarn-site.xml. But they don't change anything in your case (as you are
limited by the memory, not vcores).


> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
> data and run for 60min. I wasn't able to make any significant improvment.
> It is map only job. And wasn't able to achive more that 30% of total
> machine cpu utilization. Howewer top command were displaying 100 %cpu for
> process running on data node, that's why I was thinking that way about
> limit on container process limit. I didn't find any other boundary like io
> or network or memory.
>

CPU utilization depends on type of your jobs (e.g. doing complex math
operations or just counting words) and the number of containers you run. If
you want to play with this, you can run more CPU-bound jobs or increase the
number of containers running on a node.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
allocating containers.

If you run map task, you need 768 MB (mapreduce.map.memory.mb).
If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
If you run the MapReduce app master, you need 1024 MB (yarn.app.mapreduce.am
.resource.mb).

Therefore, you run MapReduce job, you can run only 2 containers per
NodeManager (3 x 768 = 2304 < 2048) on your setup.

2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:


>  I thought that memory assigned has to be muliply of
> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>

That's right. It also specifies the minimum size of a container to prevent
from requesting unreasonable small containers (that are likely to cause
tasks failures).

>
> any other I am not aware of. Are there any additional parameters like that
> you mentioned which should be set?
>

There are also settings related to vcores in mapred-site.xml and
yarn-site.xml. But they don't change anything in your case (as you are
limited by the memory, not vcores).


> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
> data and run for 60min. I wasn't able to make any significant improvment.
> It is map only job. And wasn't able to achive more that 30% of total
> machine cpu utilization. Howewer top command were displaying 100 %cpu for
> process running on data node, that's why I was thinking that way about
> limit on container process limit. I didn't find any other boundary like io
> or network or memory.
>

CPU utilization depends on type of your jobs (e.g. doing complex math
operations or just counting words) and the number of containers you run. If
you want to play with this, you can run more CPU-bound jobs or increase the
number of containers running on a node.

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
allocating containers.

If you run map task, you need 768 MB (mapreduce.map.memory.mb).
If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
If you run the MapReduce app master, you need 1024 MB (yarn.app.mapreduce.am
.resource.mb).

Therefore, you run MapReduce job, you can run only 2 containers per
NodeManager (3 x 768 = 2304 < 2048) on your setup.

2014-09-12 20:37 GMT+02:00 Jakub Stransky <st...@gmail.com>:


>  I thought that memory assigned has to be muliply of
> yarn.scheduler.minimum-allocation-mb and is rounded according that.
>

That's right. It also specifies the minimum size of a container to prevent
from requesting unreasonable small containers (that are likely to cause
tasks failures).

>
> any other I am not aware of. Are there any additional parameters like that
> you mentioned which should be set?
>

There are also settings related to vcores in mapred-site.xml and
yarn-site.xml. But they don't change anything in your case (as you are
limited by the memory, not vcores).


> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
> data and run for 60min. I wasn't able to make any significant improvment.
> It is map only job. And wasn't able to achive more that 30% of total
> machine cpu utilization. Howewer top command were displaying 100 %cpu for
> process running on data node, that's why I was thinking that way about
> limit on container process limit. I didn't find any other boundary like io
> or network or memory.
>

CPU utilization depends on type of your jobs (e.g. doing complex math
operations or just counting words) and the number of containers you run. If
you want to play with this, you can run more CPU-bound jobs or increase the
number of containers running on a node.

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Hi Adam,

thanks for your response. I thought that memory assigned has to be muliply
of yarn.scheduler.minimum-allocation-mb and is rounded according that.

I am seetting just those properties mentioned that means
# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb              : 768
mapreduce.map.java.opts              : -Xmx512m
mapreduce.reduce.memory.mb           : 1024
mapreduce.reduce.java.opts           : -Xmx768m
mapreduce.task.io.sort.mb            : 100
yarn.app.mapreduce.am.resource.mb    : 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

any other I am not aware of. Are there any additional parameters like that
you mentioned which should be set? The job wasn't the smallest but wasn't
PB of data. Was run on 1.5GB of data and run for 60min. I wasn't able to
make any significant improvment. It is map only job. And wasn't able to
achive more that 30% of total machine cpu utilization. Howewer top command
were displaying 100 %cpu for process running on data node, that's why I was
thinking that way about limit on container process limit. I didn't find any
other boundary like io or network or memory.

Thanks for any help or clarification
Jakub


On 12 September 2014 18:23, Adam Kawa <ka...@gmail.com> wrote:

> Hi,
>
> With these settings, your are able to start 2 containers maximally per
> NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of
> your containers is between 768 - 1024 MBs (not sure what is your value of
> yarn.nodemanager.resource.cpu-vcores).
> Have you tried to run more (or bigger) jobs on the cluster concurrently?
> Then you might see higher CPU utilization than 30%.
>
> Cheers!
> Adam
>
> 2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>> Hello experienced hadoop users,
>>
>> I have one beginners question regarding cpu utilization on datanodes when
>> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
>> using following parameters:
>> # hadoop - yarn-site.xml
>> yarn.nodemanager.resource.memory-mb  : 2048
>> yarn.scheduler.minimum-allocation-mb : 256
>> yarn.scheduler.maximum-allocation-mb : 2048
>>
>> # hadoop - mapred-site.xml
>> mapreduce.map.memory.mb              : 768
>> mapreduce.map.java.opts              : -Xmx512m
>> mapreduce.reduce.memory.mb           : 1024
>> mapreduce.reduce.java.opts           : -Xmx768m
>> mapreduce.task.io.sort.mb            : 100
>> yarn.app.mapreduce.am.resource.mb    : 1024
>> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>>
>> and I have map only task which uses 3 mappers which are essentially
>> distributed across the cluster - 1 task per dn. What I see on the cluster
>> nodes is that cpu utilization doesn't overcome 30%.
>>
>> Am I right and hadoop do really limit all the resources per container
>> bases? I wasn't able to find any command/setting which would prove this
>> theory. ulimit for yarn were unlimited, etc.
>>
>> Not sure if I am missing something here
>>
>> Thanks for providing more insight into resource planning and utilization
>> Jakub
>>
>>
>>
>>
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Hi Adam,

thanks for your response. I thought that memory assigned has to be muliply
of yarn.scheduler.minimum-allocation-mb and is rounded according that.

I am seetting just those properties mentioned that means
# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb              : 768
mapreduce.map.java.opts              : -Xmx512m
mapreduce.reduce.memory.mb           : 1024
mapreduce.reduce.java.opts           : -Xmx768m
mapreduce.task.io.sort.mb            : 100
yarn.app.mapreduce.am.resource.mb    : 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

any other I am not aware of. Are there any additional parameters like that
you mentioned which should be set? The job wasn't the smallest but wasn't
PB of data. Was run on 1.5GB of data and run for 60min. I wasn't able to
make any significant improvment. It is map only job. And wasn't able to
achive more that 30% of total machine cpu utilization. Howewer top command
were displaying 100 %cpu for process running on data node, that's why I was
thinking that way about limit on container process limit. I didn't find any
other boundary like io or network or memory.

Thanks for any help or clarification
Jakub


On 12 September 2014 18:23, Adam Kawa <ka...@gmail.com> wrote:

> Hi,
>
> With these settings, your are able to start 2 containers maximally per
> NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of
> your containers is between 768 - 1024 MBs (not sure what is your value of
> yarn.nodemanager.resource.cpu-vcores).
> Have you tried to run more (or bigger) jobs on the cluster concurrently?
> Then you might see higher CPU utilization than 30%.
>
> Cheers!
> Adam
>
> 2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>> Hello experienced hadoop users,
>>
>> I have one beginners question regarding cpu utilization on datanodes when
>> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
>> using following parameters:
>> # hadoop - yarn-site.xml
>> yarn.nodemanager.resource.memory-mb  : 2048
>> yarn.scheduler.minimum-allocation-mb : 256
>> yarn.scheduler.maximum-allocation-mb : 2048
>>
>> # hadoop - mapred-site.xml
>> mapreduce.map.memory.mb              : 768
>> mapreduce.map.java.opts              : -Xmx512m
>> mapreduce.reduce.memory.mb           : 1024
>> mapreduce.reduce.java.opts           : -Xmx768m
>> mapreduce.task.io.sort.mb            : 100
>> yarn.app.mapreduce.am.resource.mb    : 1024
>> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>>
>> and I have map only task which uses 3 mappers which are essentially
>> distributed across the cluster - 1 task per dn. What I see on the cluster
>> nodes is that cpu utilization doesn't overcome 30%.
>>
>> Am I right and hadoop do really limit all the resources per container
>> bases? I wasn't able to find any command/setting which would prove this
>> theory. ulimit for yarn were unlimited, etc.
>>
>> Not sure if I am missing something here
>>
>> Thanks for providing more insight into resource planning and utilization
>> Jakub
>>
>>
>>
>>
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Hi Adam,

thanks for your response. I thought that memory assigned has to be muliply
of yarn.scheduler.minimum-allocation-mb and is rounded according that.

I am seetting just those properties mentioned that means
# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb              : 768
mapreduce.map.java.opts              : -Xmx512m
mapreduce.reduce.memory.mb           : 1024
mapreduce.reduce.java.opts           : -Xmx768m
mapreduce.task.io.sort.mb            : 100
yarn.app.mapreduce.am.resource.mb    : 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

any other I am not aware of. Are there any additional parameters like that
you mentioned which should be set? The job wasn't the smallest but wasn't
PB of data. Was run on 1.5GB of data and run for 60min. I wasn't able to
make any significant improvment. It is map only job. And wasn't able to
achive more that 30% of total machine cpu utilization. Howewer top command
were displaying 100 %cpu for process running on data node, that's why I was
thinking that way about limit on container process limit. I didn't find any
other boundary like io or network or memory.

Thanks for any help or clarification
Jakub


On 12 September 2014 18:23, Adam Kawa <ka...@gmail.com> wrote:

> Hi,
>
> With these settings, your are able to start 2 containers maximally per
> NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of
> your containers is between 768 - 1024 MBs (not sure what is your value of
> yarn.nodemanager.resource.cpu-vcores).
> Have you tried to run more (or bigger) jobs on the cluster concurrently?
> Then you might see higher CPU utilization than 30%.
>
> Cheers!
> Adam
>
> 2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>> Hello experienced hadoop users,
>>
>> I have one beginners question regarding cpu utilization on datanodes when
>> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
>> using following parameters:
>> # hadoop - yarn-site.xml
>> yarn.nodemanager.resource.memory-mb  : 2048
>> yarn.scheduler.minimum-allocation-mb : 256
>> yarn.scheduler.maximum-allocation-mb : 2048
>>
>> # hadoop - mapred-site.xml
>> mapreduce.map.memory.mb              : 768
>> mapreduce.map.java.opts              : -Xmx512m
>> mapreduce.reduce.memory.mb           : 1024
>> mapreduce.reduce.java.opts           : -Xmx768m
>> mapreduce.task.io.sort.mb            : 100
>> yarn.app.mapreduce.am.resource.mb    : 1024
>> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>>
>> and I have map only task which uses 3 mappers which are essentially
>> distributed across the cluster - 1 task per dn. What I see on the cluster
>> nodes is that cpu utilization doesn't overcome 30%.
>>
>> Am I right and hadoop do really limit all the resources per container
>> bases? I wasn't able to find any command/setting which would prove this
>> theory. ulimit for yarn were unlimited, etc.
>>
>> Not sure if I am missing something here
>>
>> Thanks for providing more insight into resource planning and utilization
>> Jakub
>>
>>
>>
>>
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Jakub Stransky <st...@gmail.com>.

Hi Adam,

thanks for your response. I thought that memory assigned has to be muliply
of yarn.scheduler.minimum-allocation-mb and is rounded according that.

I am seetting just those properties mentioned that means
# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb              : 768
mapreduce.map.java.opts              : -Xmx512m
mapreduce.reduce.memory.mb           : 1024
mapreduce.reduce.java.opts           : -Xmx768m
mapreduce.task.io.sort.mb            : 100
yarn.app.mapreduce.am.resource.mb    : 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

any other I am not aware of. Are there any additional parameters like that
you mentioned which should be set? The job wasn't the smallest but wasn't
PB of data. Was run on 1.5GB of data and run for 60min. I wasn't able to
make any significant improvment. It is map only job. And wasn't able to
achive more that 30% of total machine cpu utilization. Howewer top command
were displaying 100 %cpu for process running on data node, that's why I was
thinking that way about limit on container process limit. I didn't find any
other boundary like io or network or memory.

Thanks for any help or clarification
Jakub


On 12 September 2014 18:23, Adam Kawa <ka...@gmail.com> wrote:

> Hi,
>
> With these settings, your are able to start 2 containers maximally per
> NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of
> your containers is between 768 - 1024 MBs (not sure what is your value of
> yarn.nodemanager.resource.cpu-vcores).
> Have you tried to run more (or bigger) jobs on the cluster concurrently?
> Then you might see higher CPU utilization than 30%.
>
> Cheers!
> Adam
>
> 2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:
>
>> Hello experienced hadoop users,
>>
>> I have one beginners question regarding cpu utilization on datanodes when
>> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
>> using following parameters:
>> # hadoop - yarn-site.xml
>> yarn.nodemanager.resource.memory-mb  : 2048
>> yarn.scheduler.minimum-allocation-mb : 256
>> yarn.scheduler.maximum-allocation-mb : 2048
>>
>> # hadoop - mapred-site.xml
>> mapreduce.map.memory.mb              : 768
>> mapreduce.map.java.opts              : -Xmx512m
>> mapreduce.reduce.memory.mb           : 1024
>> mapreduce.reduce.java.opts           : -Xmx768m
>> mapreduce.task.io.sort.mb            : 100
>> yarn.app.mapreduce.am.resource.mb    : 1024
>> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>>
>> and I have map only task which uses 3 mappers which are essentially
>> distributed across the cluster - 1 task per dn. What I see on the cluster
>> nodes is that cpu utilization doesn't overcome 30%.
>>
>> Am I right and hadoop do really limit all the resources per container
>> bases? I wasn't able to find any command/setting which would prove this
>> theory. ulimit for yarn were unlimited, etc.
>>
>> Not sure if I am missing something here
>>
>> Thanks for providing more insight into resource planning and utilization
>> Jakub
>>
>>
>>
>>
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

With these settings, your are able to start 2 containers maximally per
NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of your
containers is between 768 - 1024 MBs (not sure what is your value of
yarn.nodemanager.resource.cpu-vcores).
Have you tried to run more (or bigger) jobs on the cluster concurrently?
Then you might see higher CPU utilization than 30%.

Cheers!
Adam

2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:

> Hello experienced hadoop users,
>
> I have one beginners question regarding cpu utilization on datanodes when
> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
> using following parameters:
> # hadoop - yarn-site.xml
> yarn.nodemanager.resource.memory-mb  : 2048
> yarn.scheduler.minimum-allocation-mb : 256
> yarn.scheduler.maximum-allocation-mb : 2048
>
> # hadoop - mapred-site.xml
> mapreduce.map.memory.mb              : 768
> mapreduce.map.java.opts              : -Xmx512m
> mapreduce.reduce.memory.mb           : 1024
> mapreduce.reduce.java.opts           : -Xmx768m
> mapreduce.task.io.sort.mb            : 100
> yarn.app.mapreduce.am.resource.mb    : 1024
> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>
> and I have map only task which uses 3 mappers which are essentially
> distributed across the cluster - 1 task per dn. What I see on the cluster
> nodes is that cpu utilization doesn't overcome 30%.
>
> Am I right and hadoop do really limit all the resources per container
> bases? I wasn't able to find any command/setting which would prove this
> theory. ulimit for yarn were unlimited, etc.
>
> Not sure if I am missing something here
>
> Thanks for providing more insight into resource planning and utilization
> Jakub
>
>
>
>
>
>

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

With these settings, your are able to start 2 containers maximally per
NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of your
containers is between 768 - 1024 MBs (not sure what is your value of
yarn.nodemanager.resource.cpu-vcores).
Have you tried to run more (or bigger) jobs on the cluster concurrently?
Then you might see higher CPU utilization than 30%.

Cheers!
Adam

2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:

> Hello experienced hadoop users,
>
> I have one beginners question regarding cpu utilization on datanodes when
> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
> using following parameters:
> # hadoop - yarn-site.xml
> yarn.nodemanager.resource.memory-mb  : 2048
> yarn.scheduler.minimum-allocation-mb : 256
> yarn.scheduler.maximum-allocation-mb : 2048
>
> # hadoop - mapred-site.xml
> mapreduce.map.memory.mb              : 768
> mapreduce.map.java.opts              : -Xmx512m
> mapreduce.reduce.memory.mb           : 1024
> mapreduce.reduce.java.opts           : -Xmx768m
> mapreduce.task.io.sort.mb            : 100
> yarn.app.mapreduce.am.resource.mb    : 1024
> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>
> and I have map only task which uses 3 mappers which are essentially
> distributed across the cluster - 1 task per dn. What I see on the cluster
> nodes is that cpu utilization doesn't overcome 30%.
>
> Am I right and hadoop do really limit all the resources per container
> bases? I wasn't able to find any command/setting which would prove this
> theory. ulimit for yarn were unlimited, etc.
>
> Not sure if I am missing something here
>
> Thanks for providing more insight into resource planning and utilization
> Jakub
>
>
>
>
>
>

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

With these settings, your are able to start 2 containers maximally per
NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of your
containers is between 768 - 1024 MBs (not sure what is your value of
yarn.nodemanager.resource.cpu-vcores).
Have you tried to run more (or bigger) jobs on the cluster concurrently?
Then you might see higher CPU utilization than 30%.

Cheers!
Adam

2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:

> Hello experienced hadoop users,
>
> I have one beginners question regarding cpu utilization on datanodes when
> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
> using following parameters:
> # hadoop - yarn-site.xml
> yarn.nodemanager.resource.memory-mb  : 2048
> yarn.scheduler.minimum-allocation-mb : 256
> yarn.scheduler.maximum-allocation-mb : 2048
>
> # hadoop - mapred-site.xml
> mapreduce.map.memory.mb              : 768
> mapreduce.map.java.opts              : -Xmx512m
> mapreduce.reduce.memory.mb           : 1024
> mapreduce.reduce.java.opts           : -Xmx768m
> mapreduce.task.io.sort.mb            : 100
> yarn.app.mapreduce.am.resource.mb    : 1024
> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>
> and I have map only task which uses 3 mappers which are essentially
> distributed across the cluster - 1 task per dn. What I see on the cluster
> nodes is that cpu utilization doesn't overcome 30%.
>
> Am I right and hadoop do really limit all the resources per container
> bases? I wasn't able to find any command/setting which would prove this
> theory. ulimit for yarn were unlimited, etc.
>
> Not sure if I am missing something here
>
> Thanks for providing more insight into resource planning and utilization
> Jakub
>
>
>
>
>
>

Re: CPU utilization

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

With these settings, your are able to start 2 containers maximally per
NodeManager (yarn.nodemanager.resource.memory-mb  = 2048). The size of your
containers is between 768 - 1024 MBs (not sure what is your value of
yarn.nodemanager.resource.cpu-vcores).
Have you tried to run more (or bigger) jobs on the cluster concurrently?
Then you might see higher CPU utilization than 30%.

Cheers!
Adam

2014-09-12 17:51 GMT+02:00 Jakub Stransky <st...@gmail.com>:

> Hello experienced hadoop users,
>
> I have one beginners question regarding cpu utilization on datanodes when
> running MR job. Cluster of 5 machines, 2NN +3 DN really inexpensive hw
> using following parameters:
> # hadoop - yarn-site.xml
> yarn.nodemanager.resource.memory-mb  : 2048
> yarn.scheduler.minimum-allocation-mb : 256
> yarn.scheduler.maximum-allocation-mb : 2048
>
> # hadoop - mapred-site.xml
> mapreduce.map.memory.mb              : 768
> mapreduce.map.java.opts              : -Xmx512m
> mapreduce.reduce.memory.mb           : 1024
> mapreduce.reduce.java.opts           : -Xmx768m
> mapreduce.task.io.sort.mb            : 100
> yarn.app.mapreduce.am.resource.mb    : 1024
> yarn.app.mapreduce.am.command-opts   : -Xmx768m
>
> and I have map only task which uses 3 mappers which are essentially
> distributed across the cluster - 1 task per dn. What I see on the cluster
> nodes is that cpu utilization doesn't overcome 30%.
>
> Am I right and hadoop do really limit all the resources per container
> bases? I wasn't able to find any command/setting which would prove this
> theory. ulimit for yarn were unlimited, etc.
>
> Not sure if I am missing something here
>
> Thanks for providing more insight into resource planning and utilization
> Jakub
>
>
>
>
>
>