You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Archit Thakur <ar...@gmail.com> on 2014/01/04 20:36:24 UTC

Will JVM be reused?

A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
machine, but trasformation done sequentially launching a JVM every
transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but
different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine
but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.

Re: Will JVM be reused?

Posted by Archit Thakur <ar...@gmail.com>.

Oh, you meant main driver. Yes, correct.


On Sun, Jan 5, 2014 at 1:36 AM, Archit Thakur <ar...@gmail.com>wrote:

> Yeah, I believed that too.
>
> The last being the jvm in which your driver runs.??? Isn't it in the 3
> worker daemon, we have already considered.
>
>
> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <ro...@indix.com> wrote:
>
>> I missed this. Its actually 1+3+3+1. The last being the jvm in which your
>> driver runs.
>>
>> Roshan
>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:
>>
>>> Hi Archit,
>>>
>>> I believe its the last case - 1+3+3.
>>>
>>> From what I've seen its one jvm per worker per spark application.
>>>
>>> You will have multiple threads within a worker jvm working on different
>>> partitions concurrently. The number of partitions that a worker handles
>>> concurrently appears to be determined by the number of cores you've set the
>>> worker(or app) to use.
>>>
>>> You'd have to save to disk and reload an RDD into memory between stages,
>>> which is why spark won't do that.
>>>
>>> Roshan
>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com>
>>> wrote:
>>>
>>>> A JVM reuse doubt.
>>>> Lets say I have a job which has 5 stages:
>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>>> launched?
>>>>
>>>> 1-Master Daemon 3-Worker Daemon
>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>>> machine, but trasformation done sequentially launching a JVM every
>>>> transformation for each stage.)
>>>> OR
>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine
>>>> but different stage in different set of JVM)
>>>> OR
>>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>>> machine but different stage in different set of JVM)
>>>> OR
>>>> 1+3+3 (So, One JVM per Worker in any case).
>>>> OR
>>>> none
>>>>
>>>> Thx,
>>>> Archit_Thakur.
>>>>
>>>>
>>>>
>

Re: Will JVM be reused?

Posted by Archit Thakur <ar...@gmail.com>.

I am facing a general problem actually, which seem to be related to how
many JVM get launched.
In my map task I read a file and fill a map out of it.
Now, since the data is static and map tasks are called for every record of
RDD and I want to read it only once, so I kept the map as static (in Java)
, so that atleast for a single JVM I do not have to do more than one I/O ,
but keeping it static gives me NPE and sometimes throws exception from
somewhere deep inside. (Seems like spark is serializing things here and not
able to load static members ) However, not keeping it static runs
successfully.

I know I can do it by reading it on master and then broadcasting, but there
is a reason I want to do it this way.




On Sun, Jan 5, 2014 at 1:43 AM, Archit Thakur <ar...@gmail.com>wrote:

> ya ya had got that. Thx.
>
>
> On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <ro...@indix.com> wrote:
>
>> The driver jvm is the jvm in which you create the sparkContext and launch
>> your job. Its different from the master and worker daemons.
>>
>> Roshan
>> On Jan 5, 2014 1:37 AM, "Archit Thakur" <ar...@gmail.com>
>> wrote:
>>
>>> Yeah, I believed that too.
>>>
>>> The last being the jvm in which your driver runs.??? Isn't it in the 3
>>> worker daemon, we have already considered.
>>>
>>>
>>> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <ro...@indix.com> wrote:
>>>
>>>> I missed this. Its actually 1+3+3+1. The last being the jvm in which
>>>> your driver runs.
>>>>
>>>> Roshan
>>>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:
>>>>
>>>>> Hi Archit,
>>>>>
>>>>> I believe its the last case - 1+3+3.
>>>>>
>>>>> From what I've seen its one jvm per worker per spark application.
>>>>>
>>>>> You will have multiple threads within a worker jvm working on
>>>>> different partitions concurrently. The number of partitions that a worker
>>>>> handles concurrently appears to be determined by the number of cores you've
>>>>> set the worker(or app) to use.
>>>>>
>>>>> You'd have to save to disk and reload an RDD into memory between
>>>>> stages, which is why spark won't do that.
>>>>>
>>>>> Roshan
>>>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> A JVM reuse doubt.
>>>>>> Lets say I have a job which has 5 stages:
>>>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>>>>> launched?
>>>>>>
>>>>>> 1-Master Daemon 3-Worker Daemon
>>>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>>>>> machine, but trasformation done sequentially launching a JVM every
>>>>>> transformation for each stage.)
>>>>>> OR
>>>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine
>>>>>> but different stage in different set of JVM)
>>>>>> OR
>>>>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>>>>> machine but different stage in different set of JVM)
>>>>>> OR
>>>>>> 1+3+3 (So, One JVM per Worker in any case).
>>>>>> OR
>>>>>> none
>>>>>>
>>>>>> Thx,
>>>>>> Archit_Thakur.
>>>>>>
>>>>>>
>>>>>>
>>>
>

Re: Will JVM be reused?

Posted by Archit Thakur <ar...@gmail.com>.

ya ya had got that. Thx.


On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <ro...@indix.com> wrote:

> The driver jvm is the jvm in which you create the sparkContext and launch
> your job. Its different from the master and worker daemons.
>
> Roshan
> On Jan 5, 2014 1:37 AM, "Archit Thakur" <ar...@gmail.com> wrote:
>
>> Yeah, I believed that too.
>>
>> The last being the jvm in which your driver runs.??? Isn't it in the 3
>> worker daemon, we have already considered.
>>
>>
>> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <ro...@indix.com> wrote:
>>
>>> I missed this. Its actually 1+3+3+1. The last being the jvm in which
>>> your driver runs.
>>>
>>> Roshan
>>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:
>>>
>>>> Hi Archit,
>>>>
>>>> I believe its the last case - 1+3+3.
>>>>
>>>> From what I've seen its one jvm per worker per spark application.
>>>>
>>>> You will have multiple threads within a worker jvm working on different
>>>> partitions concurrently. The number of partitions that a worker handles
>>>> concurrently appears to be determined by the number of cores you've set the
>>>> worker(or app) to use.
>>>>
>>>> You'd have to save to disk and reload an RDD into memory between
>>>> stages, which is why spark won't do that.
>>>>
>>>> Roshan
>>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> A JVM reuse doubt.
>>>>> Lets say I have a job which has 5 stages:
>>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>>>> launched?
>>>>>
>>>>> 1-Master Daemon 3-Worker Daemon
>>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>>>> machine, but trasformation done sequentially launching a JVM every
>>>>> transformation for each stage.)
>>>>> OR
>>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine
>>>>> but different stage in different set of JVM)
>>>>> OR
>>>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>>>> machine but different stage in different set of JVM)
>>>>> OR
>>>>> 1+3+3 (So, One JVM per Worker in any case).
>>>>> OR
>>>>> none
>>>>>
>>>>> Thx,
>>>>> Archit_Thakur.
>>>>>
>>>>>
>>>>>
>>

Re: Will JVM be reused?

Posted by Roshan Nair <ro...@indix.com>.

The driver jvm is the jvm in which you create the sparkContext and launch
your job. Its different from the master and worker daemons.

Roshan
On Jan 5, 2014 1:37 AM, "Archit Thakur" <ar...@gmail.com> wrote:

> Yeah, I believed that too.
>
> The last being the jvm in which your driver runs.??? Isn't it in the 3
> worker daemon, we have already considered.
>
>
> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <ro...@indix.com> wrote:
>
>> I missed this. Its actually 1+3+3+1. The last being the jvm in which your
>> driver runs.
>>
>> Roshan
>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:
>>
>>> Hi Archit,
>>>
>>> I believe its the last case - 1+3+3.
>>>
>>> From what I've seen its one jvm per worker per spark application.
>>>
>>> You will have multiple threads within a worker jvm working on different
>>> partitions concurrently. The number of partitions that a worker handles
>>> concurrently appears to be determined by the number of cores you've set the
>>> worker(or app) to use.
>>>
>>> You'd have to save to disk and reload an RDD into memory between stages,
>>> which is why spark won't do that.
>>>
>>> Roshan
>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com>
>>> wrote:
>>>
>>>> A JVM reuse doubt.
>>>> Lets say I have a job which has 5 stages:
>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>>> launched?
>>>>
>>>> 1-Master Daemon 3-Worker Daemon
>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>>> machine, but trasformation done sequentially launching a JVM every
>>>> transformation for each stage.)
>>>> OR
>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine
>>>> but different stage in different set of JVM)
>>>> OR
>>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>>> machine but different stage in different set of JVM)
>>>> OR
>>>> 1+3+3 (So, One JVM per Worker in any case).
>>>> OR
>>>> none
>>>>
>>>> Thx,
>>>> Archit_Thakur.
>>>>
>>>>
>>>>
>

Re: Will JVM be reused?

Posted by Archit Thakur <ar...@gmail.com>.

Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3
worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <ro...@indix.com> wrote:

> I missed this. Its actually 1+3+3+1. The last being the jvm in which your
> driver runs.
>
> Roshan
> On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:
>
>> Hi Archit,
>>
>> I believe its the last case - 1+3+3.
>>
>> From what I've seen its one jvm per worker per spark application.
>>
>> You will have multiple threads within a worker jvm working on different
>> partitions concurrently. The number of partitions that a worker handles
>> concurrently appears to be determined by the number of cores you've set the
>> worker(or app) to use.
>>
>> You'd have to save to disk and reload an RDD into memory between stages,
>> which is why spark won't do that.
>>
>> Roshan
>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com>
>> wrote:
>>
>>> A JVM reuse doubt.
>>> Lets say I have a job which has 5 stages:
>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>> launched?
>>>
>>> 1-Master Daemon 3-Worker Daemon
>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>> machine, but trasformation done sequentially launching a JVM every
>>> transformation for each stage.)
>>> OR
>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but
>>> different stage in different set of JVM)
>>> OR
>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>> machine but different stage in different set of JVM)
>>> OR
>>> 1+3+3 (So, One JVM per Worker in any case).
>>> OR
>>> none
>>>
>>> Thx,
>>> Archit_Thakur.
>>>
>>>
>>>

Re: Will JVM be reused?

Posted by Roshan Nair <ro...@indix.com>.

I missed this. Its actually 1+3+3+1. The last being the jvm in which your
driver runs.

Roshan
On Jan 5, 2014 1:24 AM, "Roshan Nair" <ro...@indix.com> wrote:

> Hi Archit,
>
> I believe its the last case - 1+3+3.
>
> From what I've seen its one jvm per worker per spark application.
>
> You will have multiple threads within a worker jvm working on different
> partitions concurrently. The number of partitions that a worker handles
> concurrently appears to be determined by the number of cores you've set the
> worker(or app) to use.
>
> You'd have to save to disk and reload an RDD into memory between stages,
> which is why spark won't do that.
>
> Roshan
> On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com> wrote:
>
>> A JVM reuse doubt.
>> Lets say I have a job which has 5 stages:
>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>> launched?
>>
>> 1-Master Daemon 3-Worker Daemon
>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>> machine, but trasformation done sequentially launching a JVM every
>> transformation for each stage.)
>> OR
>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but
>> different stage in different set of JVM)
>> OR
>> 1+3+5*3 (So, JVM will be reused for different partition on single machine
>> but different stage in different set of JVM)
>> OR
>> 1+3+3 (So, One JVM per Worker in any case).
>> OR
>> none
>>
>> Thx,
>> Archit_Thakur.
>>
>>
>>

Re: Will JVM be reused?

Posted by Roshan Nair <ro...@indix.com>.

Hi Archit,

I believe its the last case - 1+3+3.

>From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different
partitions concurrently. The number of partitions that a worker handles
concurrently appears to be determined by the number of cores you've set the
worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages,
which is why spark won't do that.

Roshan
On Jan 5, 2014 1:06 AM, "Archit Thakur" <ar...@gmail.com> wrote:

> A JVM reuse doubt.
> Lets say I have a job which has 5 stages:
> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?
>
> 1-Master Daemon 3-Worker Daemon
> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
> machine, but trasformation done sequentially launching a JVM every
> transformation for each stage.)
> OR
> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but
> different stage in different set of JVM)
> OR
> 1+3+5*3 (So, JVM will be reused for different partition on single machine
> but different stage in different set of JVM)
> OR
> 1+3+3 (So, One JVM per Worker in any case).
> OR
> none
>
> Thx,
> Archit_Thakur.
>
>
>