You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Vasilis Liaskovitis <vl...@gmail.com> on 2010/02/20 01:11:02 UTC

reuse JVMs across multiple jobs

Hi,

Is it possible (and does it make sense) to reuse JVMs across jobs?

The job.reuse.jvm.num.tasks config option is a job specific parameter,
as its name implies. When running multiple independent jobs
simultaneously with job.reuse.jvm=-1 (this means always reuse), I see
a lot of different Java PIDs (far more than map.tasks.maximum +
reduce.tasks.maximum) across the duration of the job runs, instead of
the same Java processes persisting. The number of live JVMs on a given
node/tasktracker at any time never exceeds map.tasks.maximum +
reduce.tasks.maximum, as expected, but we do tear down idle JVMs and
spawn new ones quite often.

for example, here are the number of distinct Java PIDs when submitting
1, 4, 32 copies of the same job in parallel:
1   28
2   39
4   106
32  740

The relevant killing and spawing logic should be in
src/mapred/org/mapred/org/apache/hadoop/mapred/JvmManager.java,
particularly the reapJvm() method, but I haven't dug deeper. I am
wondering if it would be possible and worthwhile from a performance
standpoint to be able to reuse JVMs across jobs i.e. have a common JVM
pool for all hadoop jobs?
thanks,

- Vasilis