You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rahul Bhattacharjee <ra...@gmail.com> on 2013/04/16 10:34:47 UTC

VM reuse!

Hi,

I have a question related to VM reuse in Hadoop.I now understand the
purpose of VM reuse , but I am wondering how is it useful.

Example. for VM reuse to be effective or kicked in , we need more than one
mapper task to be submitted to a single node (for the same job).Hadoop
would consider spawning mappers into nodes which actually contains the data
, it might rarely happen that multiple mappers are allocated to a single
task tracker. And even if a single task nodes gets to run multiple mappers
then it might as well run in parallel in multiple VM rather than
sequentially in a single VM.

I am sure I am missing some link here , please help me find that.

Thanks,
Rahul

Re: VM reuse!

Posted by be...@gmail.com.
Hi Rahul

AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both can be on N1 as well.

JT has no notion of JVM reuse. It doesn't consider that for task scheduling.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Rahul Bhattacharjee <ra...@gmail.com>
Date: Tue, 16 Apr 2013 21:13:54 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: VM reuse!

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>


Re: VM reuse!

Posted by be...@gmail.com.
Hi Rahul

AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both can be on N1 as well.

JT has no notion of JVM reuse. It doesn't consider that for task scheduling.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Rahul Bhattacharjee <ra...@gmail.com>
Date: Tue, 16 Apr 2013 21:13:54 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: VM reuse!

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>


Re: VM reuse!

Posted by be...@gmail.com.
Hi Rahul

AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both can be on N1 as well.

JT has no notion of JVM reuse. It doesn't consider that for task scheduling.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Rahul Bhattacharjee <ra...@gmail.com>
Date: Tue, 16 Apr 2013 21:13:54 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: VM reuse!

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>


Re: VM reuse!

Posted by be...@gmail.com.
Hi Rahul

AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both can be on N1 as well.

JT has no notion of JVM reuse. It doesn't consider that for task scheduling.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Rahul Bhattacharjee <ra...@gmail.com>
Date: Tue, 16 Apr 2013 21:13:54 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: VM reuse!

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>


Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul




On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks <be...@gmail.com> wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question related to VM reuse in Hadoop.I now understand the
>>>> purpose of VM reuse , but I am wondering how is it useful.
>>>>
>>>> Example. for VM reuse to be effective or kicked in , we need more than
>>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>>> would consider spawning mappers into nodes which actually contains the data
>>>> , it might rarely happen that multiple mappers are allocated to a single
>>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>>> then it might as well run in parallel in multiple VM rather than
>>>> sequentially in a single VM.
>>>>
>>>> I am sure I am missing some link here , please help me find that.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>
>>>
>>
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Ok, Thanks Bejoy.
>
> Only in some typical scenarios it's possible , like the one that you have
> mentioned.
> Much more number of mappers and less number of mappers slots.
>
> Regards,
> Rahul
>
>
> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Rahul
>>
>> If you look at larger cluster and jobs that involve larger input data
>> sets. The data would be spread across the whole cluster, and a single node
>> might have  various blocks of that entire data set. Imagine you have a
>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>> there should be multiple map tasks in a single task tracker based on slot
>> availability.
>>
>> Here if you enable jvm reuse, all tasks related to a job on a single
>> TaskTracker would use the same jvm. The benefit here is just the time you
>> are saving in spawning and cleaning up jvm for individual tasks.
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to VM reuse in Hadoop.I now understand the
>>> purpose of VM reuse , but I am wondering how is it useful.
>>>
>>> Example. for VM reuse to be effective or kicked in , we need more than
>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>> would consider spawning mappers into nodes which actually contains the data
>>> , it might rarely happen that multiple mappers are allocated to a single
>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>> then it might as well run in parallel in multiple VM rather than
>>> sequentially in a single VM.
>>>
>>> I am sure I am missing some link here , please help me find that.
>>>
>>> Thanks,
>>> Rahul
>>>
>>
>>
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Ok, Thanks Bejoy.
>
> Only in some typical scenarios it's possible , like the one that you have
> mentioned.
> Much more number of mappers and less number of mappers slots.
>
> Regards,
> Rahul
>
>
> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Rahul
>>
>> If you look at larger cluster and jobs that involve larger input data
>> sets. The data would be spread across the whole cluster, and a single node
>> might have  various blocks of that entire data set. Imagine you have a
>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>> there should be multiple map tasks in a single task tracker based on slot
>> availability.
>>
>> Here if you enable jvm reuse, all tasks related to a job on a single
>> TaskTracker would use the same jvm. The benefit here is just the time you
>> are saving in spawning and cleaning up jvm for individual tasks.
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to VM reuse in Hadoop.I now understand the
>>> purpose of VM reuse , but I am wondering how is it useful.
>>>
>>> Example. for VM reuse to be effective or kicked in , we need more than
>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>> would consider spawning mappers into nodes which actually contains the data
>>> , it might rarely happen that multiple mappers are allocated to a single
>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>> then it might as well run in parallel in multiple VM rather than
>>> sequentially in a single VM.
>>>
>>> I am sure I am missing some link here , please help me find that.
>>>
>>> Thanks,
>>> Rahul
>>>
>>
>>
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Ok, Thanks Bejoy.
>
> Only in some typical scenarios it's possible , like the one that you have
> mentioned.
> Much more number of mappers and less number of mappers slots.
>
> Regards,
> Rahul
>
>
> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Rahul
>>
>> If you look at larger cluster and jobs that involve larger input data
>> sets. The data would be spread across the whole cluster, and a single node
>> might have  various blocks of that entire data set. Imagine you have a
>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>> there should be multiple map tasks in a single task tracker based on slot
>> availability.
>>
>> Here if you enable jvm reuse, all tasks related to a job on a single
>> TaskTracker would use the same jvm. The benefit here is just the time you
>> are saving in spawning and cleaning up jvm for individual tasks.
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to VM reuse in Hadoop.I now understand the
>>> purpose of VM reuse , but I am wondering how is it useful.
>>>
>>> Example. for VM reuse to be effective or kicked in , we need more than
>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>> would consider spawning mappers into nodes which actually contains the data
>>> , it might rarely happen that multiple mappers are allocated to a single
>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>> then it might as well run in parallel in multiple VM rather than
>>> sequentially in a single VM.
>>>
>>> I am sure I am missing some link here , please help me find that.
>>>
>>> Thanks,
>>> Rahul
>>>
>>
>>
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Ok, Thanks Bejoy.
>
> Only in some typical scenarios it's possible , like the one that you have
> mentioned.
> Much more number of mappers and less number of mappers slots.
>
> Regards,
> Rahul
>
>
> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Rahul
>>
>> If you look at larger cluster and jobs that involve larger input data
>> sets. The data would be spread across the whole cluster, and a single node
>> might have  various blocks of that entire data set. Imagine you have a
>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>> there should be multiple map tasks in a single task tracker based on slot
>> availability.
>>
>> Here if you enable jvm reuse, all tasks related to a job on a single
>> TaskTracker would use the same jvm. The benefit here is just the time you
>> are saving in spawning and cleaning up jvm for individual tasks.
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to VM reuse in Hadoop.I now understand the
>>> purpose of VM reuse , but I am wondering how is it useful.
>>>
>>> Example. for VM reuse to be effective or kicked in , we need more than
>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>> would consider spawning mappers into nodes which actually contains the data
>>> , it might rarely happen that multiple mappers are allocated to a single
>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>> then it might as well run in parallel in multiple VM rather than
>>> sequentially in a single VM.
>>>
>>> I am sure I am missing some link here , please help me find that.
>>>
>>> Thanks,
>>> Rahul
>>>
>>
>>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Ok, Thanks Bejoy.

Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.

Regards,
Rahul


On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Rahul
>
> If you look at larger cluster and jobs that involve larger input data
> sets. The data would be spread across the whole cluster, and a single node
> might have  various blocks of that entire data set. Imagine you have a
> cluster with 100 map slots and your job has 500 map tasks, now in that case
> there should be multiple map tasks in a single task tracker based on slot
> availability.
>
> Here if you enable jvm reuse, all tasks related to a job on a single
> TaskTracker would use the same jvm. The benefit here is just the time you
> are saving in spawning and cleaning up jvm for individual tasks.
>
>
>
>
> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to VM reuse in Hadoop.I now understand the
>> purpose of VM reuse , but I am wondering how is it useful.
>>
>> Example. for VM reuse to be effective or kicked in , we need more than
>> one mapper task to be submitted to a single node (for the same job).Hadoop
>> would consider spawning mappers into nodes which actually contains the data
>> , it might rarely happen that multiple mappers are allocated to a single
>> task tracker. And even if a single task nodes gets to run multiple mappers
>> then it might as well run in parallel in multiple VM rather than
>> sequentially in a single VM.
>>
>> I am sure I am missing some link here , please help me find that.
>>
>> Thanks,
>> Rahul
>>
>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Ok, Thanks Bejoy.

Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.

Regards,
Rahul


On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Rahul
>
> If you look at larger cluster and jobs that involve larger input data
> sets. The data would be spread across the whole cluster, and a single node
> might have  various blocks of that entire data set. Imagine you have a
> cluster with 100 map slots and your job has 500 map tasks, now in that case
> there should be multiple map tasks in a single task tracker based on slot
> availability.
>
> Here if you enable jvm reuse, all tasks related to a job on a single
> TaskTracker would use the same jvm. The benefit here is just the time you
> are saving in spawning and cleaning up jvm for individual tasks.
>
>
>
>
> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to VM reuse in Hadoop.I now understand the
>> purpose of VM reuse , but I am wondering how is it useful.
>>
>> Example. for VM reuse to be effective or kicked in , we need more than
>> one mapper task to be submitted to a single node (for the same job).Hadoop
>> would consider spawning mappers into nodes which actually contains the data
>> , it might rarely happen that multiple mappers are allocated to a single
>> task tracker. And even if a single task nodes gets to run multiple mappers
>> then it might as well run in parallel in multiple VM rather than
>> sequentially in a single VM.
>>
>> I am sure I am missing some link here , please help me find that.
>>
>> Thanks,
>> Rahul
>>
>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Ok, Thanks Bejoy.

Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.

Regards,
Rahul


On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Rahul
>
> If you look at larger cluster and jobs that involve larger input data
> sets. The data would be spread across the whole cluster, and a single node
> might have  various blocks of that entire data set. Imagine you have a
> cluster with 100 map slots and your job has 500 map tasks, now in that case
> there should be multiple map tasks in a single task tracker based on slot
> availability.
>
> Here if you enable jvm reuse, all tasks related to a job on a single
> TaskTracker would use the same jvm. The benefit here is just the time you
> are saving in spawning and cleaning up jvm for individual tasks.
>
>
>
>
> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to VM reuse in Hadoop.I now understand the
>> purpose of VM reuse , but I am wondering how is it useful.
>>
>> Example. for VM reuse to be effective or kicked in , we need more than
>> one mapper task to be submitted to a single node (for the same job).Hadoop
>> would consider spawning mappers into nodes which actually contains the data
>> , it might rarely happen that multiple mappers are allocated to a single
>> task tracker. And even if a single task nodes gets to run multiple mappers
>> then it might as well run in parallel in multiple VM rather than
>> sequentially in a single VM.
>>
>> I am sure I am missing some link here , please help me find that.
>>
>> Thanks,
>> Rahul
>>
>
>

Re: VM reuse!

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Ok, Thanks Bejoy.

Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.

Regards,
Rahul


On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Rahul
>
> If you look at larger cluster and jobs that involve larger input data
> sets. The data would be spread across the whole cluster, and a single node
> might have  various blocks of that entire data set. Imagine you have a
> cluster with 100 map slots and your job has 500 map tasks, now in that case
> there should be multiple map tasks in a single task tracker based on slot
> availability.
>
> Here if you enable jvm reuse, all tasks related to a job on a single
> TaskTracker would use the same jvm. The benefit here is just the time you
> are saving in spawning and cleaning up jvm for individual tasks.
>
>
>
>
> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to VM reuse in Hadoop.I now understand the
>> purpose of VM reuse , but I am wondering how is it useful.
>>
>> Example. for VM reuse to be effective or kicked in , we need more than
>> one mapper task to be submitted to a single node (for the same job).Hadoop
>> would consider spawning mappers into nodes which actually contains the data
>> , it might rarely happen that multiple mappers are allocated to a single
>> task tracker. And even if a single task nodes gets to run multiple mappers
>> then it might as well run in parallel in multiple VM rather than
>> sequentially in a single VM.
>>
>> I am sure I am missing some link here , please help me find that.
>>
>> Thanks,
>> Rahul
>>
>
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.




On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to VM reuse in Hadoop.I now understand the
> purpose of VM reuse , but I am wondering how is it useful.
>
> Example. for VM reuse to be effective or kicked in , we need more than one
> mapper task to be submitted to a single node (for the same job).Hadoop
> would consider spawning mappers into nodes which actually contains the data
> , it might rarely happen that multiple mappers are allocated to a single
> task tracker. And even if a single task nodes gets to run multiple mappers
> then it might as well run in parallel in multiple VM rather than
> sequentially in a single VM.
>
> I am sure I am missing some link here , please help me find that.
>
> Thanks,
> Rahul
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.




On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to VM reuse in Hadoop.I now understand the
> purpose of VM reuse , but I am wondering how is it useful.
>
> Example. for VM reuse to be effective or kicked in , we need more than one
> mapper task to be submitted to a single node (for the same job).Hadoop
> would consider spawning mappers into nodes which actually contains the data
> , it might rarely happen that multiple mappers are allocated to a single
> task tracker. And even if a single task nodes gets to run multiple mappers
> then it might as well run in parallel in multiple VM rather than
> sequentially in a single VM.
>
> I am sure I am missing some link here , please help me find that.
>
> Thanks,
> Rahul
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.




On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to VM reuse in Hadoop.I now understand the
> purpose of VM reuse , but I am wondering how is it useful.
>
> Example. for VM reuse to be effective or kicked in , we need more than one
> mapper task to be submitted to a single node (for the same job).Hadoop
> would consider spawning mappers into nodes which actually contains the data
> , it might rarely happen that multiple mappers are allocated to a single
> task tracker. And even if a single task nodes gets to run multiple mappers
> then it might as well run in parallel in multiple VM rather than
> sequentially in a single VM.
>
> I am sure I am missing some link here , please help me find that.
>
> Thanks,
> Rahul
>

Re: VM reuse!

Posted by Bejoy Ks <be...@gmail.com>.
Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.




On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I have a question related to VM reuse in Hadoop.I now understand the
> purpose of VM reuse , but I am wondering how is it useful.
>
> Example. for VM reuse to be effective or kicked in , we need more than one
> mapper task to be submitted to a single node (for the same job).Hadoop
> would consider spawning mappers into nodes which actually contains the data
> , it might rarely happen that multiple mappers are allocated to a single
> task tracker. And even if a single task nodes gets to run multiple mappers
> then it might as well run in parallel in multiple VM rather than
> sequentially in a single VM.
>
> I am sure I am missing some link here , please help me find that.
>
> Thanks,
> Rahul
>