You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Scott Carey <sc...@richrelevance.com> on 2010/04/01 02:06:45 UTC

Re: swapping on hadoop

On Linux, check out the 'swappiness' OS tunable -- you can turn this down from the default to reduce swapping at the expense of some system file cache.  
However, you want a decent chunk of RAM left for the system to cache files -- if it is all allocated and used by Hadoop there will be extra I/O.

For Java GC, if your -Xmx is above 600MB or so, try either changing -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the -XX:MaxNewSize parameter to around 150MB to 250MB.

An example of Hadoop memory use scaling as -Xmx grows:

Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop sort space.

Both of these chunks of data will eventually get pushed to the tenured generation.   So, the actual heap required will end up close to:

(Size of young generation) + 100MB + 250MB + misc.  
The default size of the young generation is 1/3 of the heap.  So, at -Xmx750M this job will probably use a minimum of 600MB of java heap, plus about 50MB non-heap if this is a pure java job.

Now, perhaps due to some other jobs you want to set -Xmx1200M.  The above job will end up using about 150MB more now, because the new space has grown, although the footprint is the same.   A larger new space can improve performance, but with most typical hadoop jobs it won't.   Making sure it does not grow larger just because -Xmx is larger can help save a lot of memory.  Additionally, a job that would have failed with an OOME at -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space.

If you are using a 64 bit JRE, you can also save space with the -XX:+UseCompressedOops option -- sometimes quite a bit of space.

On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote:

> Hi all,
> 
> I 've noticed swapping for a single terasort job on a small 8-node
> cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
> can have back to back runs of the same job from the same hdfs input
> data and get swapping only on 1 out of 4 identical runs. I 've noticed
> this swapping behaviour on both terasort jobs and hive query jobs.
> 
> - Focusing on a single job config, Is there a rule of thumb about how
> much node memory should be left for use outside of Child JVMs?
> I make sure that per Node, there is free memory:
> (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
> JVMHeapSize < PhysicalMemoryonNode
> The total JVM heap size per node per job from the above equation
> currently account 65%-75% of the node's memory. (I 've tried
> allocating a riskier 90% of the node's memory, with similar swapping
> observations).
> 
> - Could there be an issue with HDFS data or metadata taking up memory?
> I am not cleaning output or intermediate outputs from HDFS between
> runs. Is this possible?
> 
> - Do people use any specific java flags (particularly garbage
> collection flags) for production environments where one job runs (or
> possibly more jobs run simultaneously) ?
> 
> - What are the memory requirements for the jobtracker,namenode and
> tasktracker,datanode JVMs?
> 
> - I am setting io.sort.mb to about half of the JVM heap size (half of
> -Xmx in javaopts). Should this be set to a different ratio? (this
> setting doesn't sound like it should be causing swapping in the first
> place).
> 
> - The buffer cache is cleaned before each run (flush and echo 3 >
> /proc/sys/vm/drop_caches)
> 
> any empirical advice and suggestions  to solve this are appreciated.
> thanks,
> 
> - Vasilis

Re: swapping on hadoop

Posted by Ted Yu <yu...@gmail.com>.

For multiple indepenendent hadoop jobs, there is no easy extension for them
to reuse JVMs.
JobTracker (from hadoop 0.20.2) doesn't directly keep track of JVMs.
JvmManager creates mapJvmManager that manages map tasks, reduceJvmManager
that manages reduce tasks.

A quick search over
http://hadoop.apache.org/common/docs/current/capacity_scheduler.html doesn't
yield anything on JVM reuse either.

FYI

On Thu, Apr 1, 2010 at 8:38 AM, Vasilis Liaskovitis <vl...@gmail.com>wrote:

> All,
>
> thanks for your suggestions everyone, these are valuable.
> Some comments:
>
> On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey <sc...@richrelevance.com>
> wrote:
> > On Linux, check out the 'swappiness' OS tunable -- you can turn this down
> from the default to reduce swapping at the expense of some system file
> cache.
> > However, you want a decent chunk of RAM left for the system to cache
> files -- if it is all allocated and used by Hadoop there will be extra I/O.
>
> I have set /proc/sys/vm/swappiness to 1, though I haven't tried 0.
>
> > For Java GC, if your -Xmx is above 600MB or so, try either changing
> -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the
> -XX:MaxNewSize parameter to around 150MB to 250MB.
> > An example of Hadoop memory use scaling as -Xmx grows:
> > Lets say you have a Hadoop job with a 100MB map side join, and 250MB of
> hadoop sort space.
>
> In this example, what hadoop config parameters do the above 2 buffers
> refer to? io.sort.mb=250, but which parameter does the "map side join"
> 100MB refer to? Are you referring to the split size of the input data
> handled by a single map task? Apart from that question, the example is
> clear to me and useful, thanks.
>
> > Now, perhaps due to some other jobs you want to set -Xmx1200M.  The above
> job will end up using about 150MB more now, because the new space has grown,
> although the footprint is the same.   A larger new space can improve
> performance, but with most typical hadoop jobs it won't.   Making sure it
> does not grow larger just because -Xmx is larger can help save a lot of
> memory.  Additionally, a job that would have failed with an OOME at
> -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB
> instead of 400MB of the space.
> >
> Indeed for my jobs I haven't noticed better performance going to
> XMx900 or 1000. I normally use -XMx700. I haven't tried the
> -XX:MaxNewSize or -XX:NewRatio but I will.
>
> > If you are using a 64 bit JRE, you can also save space with the
> -XX:+UseCompressedOops option -- sometimes quite a bit of space.
> >
> I am using this already, thanks.
>
> Quoting Allen: "Java takes more RAM than just the heap size.
> Sometimes 2-3x as much."
> Is there a clear indication that Java memory usage extends so far
> beyond its allocated heap? E.g. would java thread stacks really
> account for such a big increase 2x to 3x? Tasks seem to be heavily
> threaded. What are the relevant config options to control number of
> threads within a task?
>
> "My general rule of thumb for general purpose grids is to plan on having
> 3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode,
> and
> task tracker processes.  After that, you can carve it up however you want."
>
> I now have 4-6GB of free space, when taking into account the full heap
> space of all child JVMs. Does that sound reasonable for all the other
> node needs (file caching, datanode, task tracker)? Having ~1G for
> tasktracker and ~1G for datanode leaves 2-4GB for file caching. Even
> with this setup on a terasort run of 64GB data across 7 nodes
> (separate node for namenode/jobtracker), I run low on memory, though
> there is no swapping for the majority of cases. I assume I am running
> low on memory mainly due to file/disk caching or thread stacks? Any
> other possible reasons?
>
> With this new setup, I don't normally get swapping for a single job
> e.g. terasort or hive job. However, the problem in general is
> exacerbated if one spawns multiple indepenendent hadoop jobs
> simultaneously. I 've noticed that JVMs are not re-used across jobs,
> in an earlier post:
> http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01174.html
> This implies that Java memory usage would blow up when submitting
> multiple independent jobs. So this multiple job scenario sounds more
> susceptible to swapping
>
> A relevant question is: in production environments, do people run jobs
> in parallel? Or is it that the majority of jobs is a serial pipeline /
> cascade of jobs being run back to back?
>
> thanks,
>
> - Vasilis
>

Re: swapping on hadoop

Posted by Scott Carey <sc...@richrelevance.com>.

On Apr 1, 2010, at 5:04 PM, Vasilis Liaskovitis wrote:
>> 
> 
> ok. Now, considering a map side space buffer and a sort buffer, do
> both account for tenured space for both map and reduce JVMs? I 'd
> think the map side buffer gets used and tenured for map tasks and the
> sort space gets used and tenured for the reduce task during sort/merge
> phase. Would both spaces really be used in both kinds of tasks?
> 

It is my understanding that a JVM used for a map won't also be used for a reduce.  The JVM reuse runs multiple maps or reduces in one process but not across both.
The mapper does the majority of the sorting, the reducer mostly merges pre-sorted data.  Each kind of task tends to have a different memory footprint, dependent on the job and data.

>> The maximum number of map and reduce tasks per node applies no matter how many jobs are running.
> 
> RIght. But depending on your job scheduler, isn't it possible that you
> may be swapping the different jobs' JVM space in and out of physical
> memory while scheduling all the parallel jobs? Especially if nodes
> don't have huge amounts of memory, this scenario sounds likely.
> 

To be more precise, the max number of map and reduce tasks corresponds with the maximum number of active JVMs of each type at the same time.  When a job finishes all of its tasks, the JVMs for it are killed.  A new job gets new JVMs.  Running concurrent jobs means that each job has some fraction of these JVM slots occupied.
So, there should be no swapping different Jobs JVMs in and out of RAM.  The same number of active JVM's exists for one large job as it does for 4 concurrent jobs. 

>> 
> 
> Back to a single job running and assuming all heap space being used,
> what percentage of a node's memory would you leave for other functions
> esp. disk cache? I currently only have 25% of memory (~4GB) for
> non-heapJVM data; I guess there should be a sweet-spot, probably
> dependent on the job I/O characteristics.
> 

It will dependon the job, its I/O, and the OS tuning.  But 25% to 33% of memory for system file cache has worked for me (remember, the nodes aren't just for tasks, but also for HDFS).  A small amount of swap-out isn't bad, since the JVM's expire and never swap-in.

> - Vasilis

Re: swapping on hadoop

Posted by Vasilis Liaskovitis <vl...@gmail.com>.

Hi,

On Thu, Apr 1, 2010 at 2:02 PM, Scott Carey <sc...@richrelevance.com> wrote:
>> In this example, what hadoop config parameters do the above 2 buffers
>> refer to? io.sort.mb=250, but which parameter does the "map side join"
>> 100MB refer to? Are you referring to the split size of the input data
>> handled by a single map task?
>
> "Map side join" in just an example of one of many possible use cases where a particular map implementation may hold on to some semi-permanent data for the whole task.
> It could be anything that takes 100MB of heap and holds the data across individual calls to map().
>

ok. Now, considering a map side space buffer and a sort buffer, do
both account for tenured space for both map and reduce JVMs? I 'd
think the map side buffer gets used and tenured for map tasks and the
sort space gets used and tenured for the reduce task during sort/merge
phase. Would both spaces really be used in both kinds of tasks?

>
> Java typically uses 5MB to 60MB for classloader data (statics, classes) and some space for threads, etc.  The default thread stack on most OS's is about 1MB, and the number of threads for a task process is on the order of a dozen.
> Getting 2-3x the space in a java process outside the heap would require either a huge thread count, a large native library loaded, or perhaps a non-java hadoop job using pipes.
> It would be rather obvious in 'top' if you sort by memory (shift-M on linux), or vmstat, etc.  To get the current size of the heap of a process, you can use jstat or 'kill -3' to create a stack dump and heap summary.
>
thanks, good to know.

>>
>> With this new setup, I don't normally get swapping for a single job
>> e.g. terasort or hive job. However, the problem in general is
>> exacerbated if one spawns multiple indepenendent hadoop jobs
>> simultaneously. I 've noticed that JVMs are not re-used across jobs,
>> in an earlier post:
>> http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01174.html
>> This implies that Java memory usage would blow up when submitting
>> multiple independent jobs. So this multiple job scenario sounds more
>> susceptible to swapping
>>
> The maximum number of map and reduce tasks per node applies no matter how many jobs are running.

RIght. But depending on your job scheduler, isn't it possible that you
may be swapping the different jobs' JVM space in and out of physical
memory while scheduling all the parallel jobs? Especially if nodes
don't have huge amounts of memory, this scenario sounds likely.

>
>> A relevant question is: in production environments, do people run jobs
>> in parallel? Or is it that the majority of jobs is a serial pipeline /
>> cascade of jobs being run back to back?
>>
> Jobs are absolutely run in parallel.  I recommend using the fair scheduler with no config parameters other than 'assignmultiple = true' as the 'baseline' scheduler, and adjust from there accordingly.  The Capacity Scheduler has more tuning knobs for dealing with memory constraints if jobs have drastically different memory needs.  The out-of-the-box FIFO scheduler tends to have a hard time keeping the cluster utilization high when there are multiple jobs to run.

thanks, I 'll try this.

Back to a single job running and assuming all heap space being used,
what percentage of a node's memory would you leave for other functions
esp. disk cache? I currently only have 25% of memory (~4GB) for
non-heapJVM data; I guess there should be a sweet-spot, probably
dependent on the job I/O characteristics.

- Vasilis

Re: swapping on hadoop

Posted by Scott Carey <sc...@richrelevance.com>.

On Apr 1, 2010, at 8:38 AM, Vasilis Liaskovitis wrote:

> 
> In this example, what hadoop config parameters do the above 2 buffers
> refer to? io.sort.mb=250, but which parameter does the "map side join"
> 100MB refer to? Are you referring to the split size of the input data
> handled by a single map task? Apart from that question, the example is
> clear to me and useful, thanks.
> 

"Map side join" in just an example of one of many possible use cases where a particular map implementation may hold on to some semi-permanent data for the whole task.
It could be anything that takes 100MB of heap and holds the data across individual calls to map().

> 
> Quoting Allen: "Java takes more RAM than just the heap size.
> Sometimes 2-3x as much."
> Is there a clear indication that Java memory usage extends so far
> beyond its allocated heap? E.g. would java thread stacks really
> account for such a big increase 2x to 3x? Tasks seem to be heavily
> threaded. What are the relevant config options to control number of
> threads within a task?
> 

Java typically uses 5MB to 60MB for classloader data (statics, classes) and some space for threads, etc.  The default thread stack on most OS's is about 1MB, and the number of threads for a task process is on the order of a dozen.
Getting 2-3x the space in a java process outside the heap would require either a huge thread count, a large native library loaded, or perhaps a non-java hadoop job using pipes.
It would be rather obvious in 'top' if you sort by memory (shift-M on linux), or vmstat, etc.   To get the current size of the heap of a process, you can use jstat or 'kill -3' to create a stack dump and heap summary.

> 
> With this new setup, I don't normally get swapping for a single job
> e.g. terasort or hive job. However, the problem in general is
> exacerbated if one spawns multiple indepenendent hadoop jobs
> simultaneously. I 've noticed that JVMs are not re-used across jobs,
> in an earlier post:
> http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01174.html
> This implies that Java memory usage would blow up when submitting
> multiple independent jobs. So this multiple job scenario sounds more
> susceptible to swapping
> 
The maximum number of map and reduce tasks per node applies no matter how many jobs are running.

> A relevant question is: in production environments, do people run jobs
> in parallel? Or is it that the majority of jobs is a serial pipeline /
> cascade of jobs being run back to back?
> 
Jobs are absolutely run in parallel.  I recommend using the fair scheduler with no config parameters other than 'assignmultiple = true' as the 'baseline' scheduler, and adjust from there accordingly.  The Capacity Scheduler has more tuning knobs for dealing with memory constraints if jobs have drastically different memory needs.  The out-of-the-box FIFO scheduler tends to have a hard time keeping the cluster utilization high when there are multiple jobs to run.

> thanks,
> 
> - Vasilis

Re: swapping on hadoop

Posted by Vasilis Liaskovitis <vl...@gmail.com>.

All,

thanks for your suggestions everyone, these are valuable.
Some comments:

On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey <sc...@richrelevance.com> wrote:
> On Linux, check out the 'swappiness' OS tunable -- you can turn this down from the default to reduce swapping at the expense of some system file cache.
> However, you want a decent chunk of RAM left for the system to cache files -- if it is all allocated and used by Hadoop there will be extra I/O.

I have set /proc/sys/vm/swappiness to 1, though I haven't tried 0.

> For Java GC, if your -Xmx is above 600MB or so, try either changing -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the -XX:MaxNewSize parameter to around 150MB to 250MB.
> An example of Hadoop memory use scaling as -Xmx grows:
> Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop sort space.

In this example, what hadoop config parameters do the above 2 buffers
refer to? io.sort.mb=250, but which parameter does the "map side join"
100MB refer to? Are you referring to the split size of the input data
handled by a single map task? Apart from that question, the example is
clear to me and useful, thanks.

> Now, perhaps due to some other jobs you want to set -Xmx1200M.  The above job will end up using about 150MB more now, because the new space has grown, although the footprint is the same.   A larger new space can improve performance, but with most typical hadoop jobs it won't.   Making sure it does not grow larger just because -Xmx is larger can help save a lot of memory.  Additionally, a job that would have failed with an OOME at -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space.
>
Indeed for my jobs I haven't noticed better performance going to
XMx900 or 1000. I normally use -XMx700. I haven't tried the
-XX:MaxNewSize or -XX:NewRatio but I will.

> If you are using a 64 bit JRE, you can also save space with the -XX:+UseCompressedOops option -- sometimes quite a bit of space.
>
I am using this already, thanks.

Quoting Allen: "Java takes more RAM than just the heap size.
Sometimes 2-3x as much."
Is there a clear indication that Java memory usage extends so far
beyond its allocated heap? E.g. would java thread stacks really
account for such a big increase 2x to 3x? Tasks seem to be heavily
threaded. What are the relevant config options to control number of
threads within a task?

"My general rule of thumb for general purpose grids is to plan on having
3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, and
task tracker processes.  After that, you can carve it up however you want."

I now have 4-6GB of free space, when taking into account the full heap
space of all child JVMs. Does that sound reasonable for all the other
node needs (file caching, datanode, task tracker)? Having ~1G for
tasktracker and ~1G for datanode leaves 2-4GB for file caching. Even
with this setup on a terasort run of 64GB data across 7 nodes
(separate node for namenode/jobtracker), I run low on memory, though
there is no swapping for the majority of cases. I assume I am running
low on memory mainly due to file/disk caching or thread stacks? Any
other possible reasons?

With this new setup, I don't normally get swapping for a single job
e.g. terasort or hive job. However, the problem in general is
exacerbated if one spawns multiple indepenendent hadoop jobs
simultaneously. I 've noticed that JVMs are not re-used across jobs,
in an earlier post:
http://www.mail-archive.com/common-dev@hadoop.apache.org/msg01174.html
This implies that Java memory usage would blow up when submitting
multiple independent jobs. So this multiple job scenario sounds more
susceptible to swapping

A relevant question is: in production environments, do people run jobs
in parallel? Or is it that the majority of jobs is a serial pipeline /
cascade of jobs being run back to back?

thanks,

- Vasilis