You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Lukas Steiblys <lu...@doubledutch.me> on 2014/09/18 19:13:08 UTC

Samza Memory Usage on YARN

Hello,

I’m trying to use Samza for our new data processing pipeline using YARN for job scheduling and I’ve noticed that it consumes an incredibly large amount of memory. Running the Application Master, that should be a very lightweight application in my opinion, consumes around ~1.4GB of virtual memory and ~200MB of physical memory. Same goes for the actual tasks.

Is this behavior common or could this be some misconfiguration? As I understand, one of the problems is that each container has it’s own VM instance and has to load all the libraries. Could there be some other issues? Maybe it’s possible to actually split the application master package from the task package so it’s more lightweight?

Lukas

Re: Samza Memory Usage on YARN

Posted by Chris Riccomini <cr...@linkedin.com.INVALID>.
Hey Lukas,

> In my case for some simple tasks that just process stateless messages,
>the AM container is taking essentially half of the resources - one
>container for the AM and one for the task instances, which again seem to
>request >1GB of virtual memory.

Yes, this is just something that we live with. An interesting JIRA might
be to consolidate the AM and SamzaContainer into just one container in
cases where you don't want to waste resources, but we haven't bothered
with this thus far.

Cheers,
Chris

On 9/18/14 1:49 PM, "Lukas Steiblys" <lu...@doubledutch.me> wrote:

>Ok, I have underestimated how much the AM is doing.
>
>>  All of our containers run with at least 1G, and the AM becomes
>>completely 
>> negligible compared to the total amount of resources a job uses.
>
>In my case for some simple tasks that just process stateless messages,
>the 
>AM container is taking essentially half of the resources - one container
>for 
>the AM and one for the task instances, which again seem to request >1GB
>of 
>virtual memory. Shouldn't the task container be a little more lightweight
>then?
>
>Thank you for the detailed explanation. Maybe my problem size is not big
>enough and it will all make sense when we have to process orders of
>magnitude more data.
>
>Lukas
>
>
>-----Original Message-----
>From: Chris Riccomini
>Sent: Thursday, September 18, 2014 1:31 PM
>To: dev@samza.incubator.apache.org
>Subject: Re: Samza Memory Usage on YARN
>
>Hey Lukas,
>
>As a pre-amble, I have to say, if you consider 200MB of memory usage an
>incredibly large amount of memory, you're probably either working with the
>wrong system, or worrying about optimizing the wrong thing. Your
>SamzaContainers are likely not going to be able to run without a few
>hundred megabytes of space. All of our containers run with at least 1G,
>and the AM becomes completely negligible compared to the total amount of
>resources a job uses.
>
>The default for the AM and the SamzaContainer are both:
>
>  -Xmx768M
>  1000MB containers
>
>This means that YARN will kill your process (AM or SamzaContainer) if it
>goes over the 1G limit, and a container will OOME if it goes over 768MB of
>heap usage.
>
>First, I'll address the AM's heap. There are two main reasons why we want
>a 768MB heap.
>
>* The AM runs a Scalatra webapp, which requires significant heap when it
>runs. We tried other -Xmx settings, but 768 seemed to be the lowest stable
>setting for all jobs.
>* Samza's core code is implemented in Scala, which can bloat the JVM. A
>quick glance shows about 12% of heap used for random scala.reflect
>classes.
>
>The 1G container limit (vs. 768MB heap) is to give the AM extra space for
>things like:
>
>* perm gen
>* off-heap space
>* page cache
>* thread stacks
>
>> Is this behavior common or could this be some misconfiguration?
>
>It is common. I took a look at some of our jobs. They're running between
>150MB and 250MB in steady state. When I load the AM webpage, the heap
>spikes up to ~300MB.
>
>> As I understand, one of the problems is that each container has it¹s own
>>VM instance and has to load all the libraries. Could there be some other
>>issues?
>
>There is a little bit of inefficiency from this, but it should be
>negligible. The 200MB of heap usage that you're seeing are actual objects
>being used by the AM. Don't forget that the AM is running a YARN client, a
>web service, a MetricsReporter, etc.
>
>If you're unhappy with the amount of memory that the AM is taking up, the
>first thing that you can do is to tune these two settings:
>
>  yarn.am.opts (to set -Xmx)
>  yarn.am.container.memory.mb (to lower YARN container memory mb)
>
>
>You can experiment to see how low you can get the heap and container
>settings.
>
>Cheers,
>Chris
>
>On 9/18/14 10:13 AM, "Lukas Steiblys" <lu...@doubledutch.me> wrote:
>
>>Hello,
>>
>>I¹m trying to use Samza for our new data processing pipeline using YARN
>>for job scheduling and I¹ve noticed that it consumes an incredibly large
>>amount of memory. Running the Application Master, that should be a very
>>lightweight application in my opinion, consumes around ~1.4GB of virtual
>>memory and ~200MB of physical memory. Same goes for the actual tasks.
>>
>>Is this behavior common or could this be some misconfiguration? As I
>>understand, one of the problems is that each container has it¹s own VM
>>instance and has to load all the libraries. Could there be some other
>>issues? Maybe it¹s possible to actually split the application master
>>package from the task package so it¹s more lightweight?
>>
>>Lukas
>


Re: Samza Memory Usage on YARN

Posted by Lukas Steiblys <lu...@doubledutch.me>.
Ok, I have underestimated how much the AM is doing.

>  All of our containers run with at least 1G, and the AM becomes completely 
> negligible compared to the total amount of resources a job uses.

In my case for some simple tasks that just process stateless messages, the 
AM container is taking essentially half of the resources - one container for 
the AM and one for the task instances, which again seem to request >1GB of 
virtual memory. Shouldn't the task container be a little more lightweight 
then?

Thank you for the detailed explanation. Maybe my problem size is not big 
enough and it will all make sense when we have to process orders of 
magnitude more data.

Lukas


-----Original Message----- 
From: Chris Riccomini
Sent: Thursday, September 18, 2014 1:31 PM
To: dev@samza.incubator.apache.org
Subject: Re: Samza Memory Usage on YARN

Hey Lukas,

As a pre-amble, I have to say, if you consider 200MB of memory usage an
incredibly large amount of memory, you're probably either working with the
wrong system, or worrying about optimizing the wrong thing. Your
SamzaContainers are likely not going to be able to run without a few
hundred megabytes of space. All of our containers run with at least 1G,
and the AM becomes completely negligible compared to the total amount of
resources a job uses.

The default for the AM and the SamzaContainer are both:

  -Xmx768M
  1000MB containers

This means that YARN will kill your process (AM or SamzaContainer) if it
goes over the 1G limit, and a container will OOME if it goes over 768MB of
heap usage.

First, I'll address the AM's heap. There are two main reasons why we want
a 768MB heap.

* The AM runs a Scalatra webapp, which requires significant heap when it
runs. We tried other -Xmx settings, but 768 seemed to be the lowest stable
setting for all jobs.
* Samza's core code is implemented in Scala, which can bloat the JVM. A
quick glance shows about 12% of heap used for random scala.reflect classes.

The 1G container limit (vs. 768MB heap) is to give the AM extra space for
things like:

* perm gen
* off-heap space
* page cache
* thread stacks

> Is this behavior common or could this be some misconfiguration?

It is common. I took a look at some of our jobs. They're running between
150MB and 250MB in steady state. When I load the AM webpage, the heap
spikes up to ~300MB.

> As I understand, one of the problems is that each container has it¹s own
>VM instance and has to load all the libraries. Could there be some other
>issues?

There is a little bit of inefficiency from this, but it should be
negligible. The 200MB of heap usage that you're seeing are actual objects
being used by the AM. Don't forget that the AM is running a YARN client, a
web service, a MetricsReporter, etc.

If you're unhappy with the amount of memory that the AM is taking up, the
first thing that you can do is to tune these two settings:

  yarn.am.opts (to set -Xmx)
  yarn.am.container.memory.mb (to lower YARN container memory mb)


You can experiment to see how low you can get the heap and container
settings.

Cheers,
Chris

On 9/18/14 10:13 AM, "Lukas Steiblys" <lu...@doubledutch.me> wrote:

>Hello,
>
>I¹m trying to use Samza for our new data processing pipeline using YARN
>for job scheduling and I¹ve noticed that it consumes an incredibly large
>amount of memory. Running the Application Master, that should be a very
>lightweight application in my opinion, consumes around ~1.4GB of virtual
>memory and ~200MB of physical memory. Same goes for the actual tasks.
>
>Is this behavior common or could this be some misconfiguration? As I
>understand, one of the problems is that each container has it¹s own VM
>instance and has to load all the libraries. Could there be some other
>issues? Maybe it¹s possible to actually split the application master
>package from the task package so it¹s more lightweight?
>
>Lukas


Re: Samza Memory Usage on YARN

Posted by Chris Riccomini <cr...@linkedin.com.INVALID>.
Hey Lukas,

As a pre-amble, I have to say, if you consider 200MB of memory usage an
incredibly large amount of memory, you're probably either working with the
wrong system, or worrying about optimizing the wrong thing. Your
SamzaContainers are likely not going to be able to run without a few
hundred megabytes of space. All of our containers run with at least 1G,
and the AM becomes completely negligible compared to the total amount of
resources a job uses.

The default for the AM and the SamzaContainer are both:

  -Xmx768M
  1000MB containers

This means that YARN will kill your process (AM or SamzaContainer) if it
goes over the 1G limit, and a container will OOME if it goes over 768MB of
heap usage.

First, I'll address the AM's heap. There are two main reasons why we want
a 768MB heap.

 * The AM runs a Scalatra webapp, which requires significant heap when it
runs. We tried other -Xmx settings, but 768 seemed to be the lowest stable
setting for all jobs.
 * Samza's core code is implemented in Scala, which can bloat the JVM. A
quick glance shows about 12% of heap used for random scala.reflect classes.

The 1G container limit (vs. 768MB heap) is to give the AM extra space for
things like:

 * perm gen
 * off-heap space
 * page cache
 * thread stacks

> Is this behavior common or could this be some misconfiguration?

It is common. I took a look at some of our jobs. They're running between
150MB and 250MB in steady state. When I load the AM webpage, the heap
spikes up to ~300MB.

> As I understand, one of the problems is that each container has it¹s own
>VM instance and has to load all the libraries. Could there be some other
>issues?

There is a little bit of inefficiency from this, but it should be
negligible. The 200MB of heap usage that you're seeing are actual objects
being used by the AM. Don't forget that the AM is running a YARN client, a
web service, a MetricsReporter, etc.

If you're unhappy with the amount of memory that the AM is taking up, the
first thing that you can do is to tune these two settings:

  yarn.am.opts (to set -Xmx)
  yarn.am.container.memory.mb (to lower YARN container memory mb)


You can experiment to see how low you can get the heap and container
settings.

Cheers,
Chris

On 9/18/14 10:13 AM, "Lukas Steiblys" <lu...@doubledutch.me> wrote:

>Hello,
>
>I¹m trying to use Samza for our new data processing pipeline using YARN
>for job scheduling and I¹ve noticed that it consumes an incredibly large
>amount of memory. Running the Application Master, that should be a very
>lightweight application in my opinion, consumes around ~1.4GB of virtual
>memory and ~200MB of physical memory. Same goes for the actual tasks.
>
>Is this behavior common or could this be some misconfiguration? As I
>understand, one of the problems is that each container has it¹s own VM
>instance and has to load all the libraries. Could there be some other
>issues? Maybe it¹s possible to actually split the application master
>package from the task package so it¹s more lightweight?
>
>Lukas