You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manoj Samel <ma...@gmail.com> on 2014/01/19 07:34:22 UTC

Do RDD actions run only on driver ?

Are RDD actions like count etc. run only on driver node or can they be
parallelized ?

Thanks,

Re: Do RDD actions run only on driver ?

Posted by Tathagata Das <ta...@gmail.com>.
Yes. However, those jobs will share the available cores in the N worker
nodes. Depending on the resource requirements of the jobs, each job may run
slower than what it would have if they were not sharing with other jobs. To
ensure a fair share of resources between the concurrent jobs, you can turn
on fair scheduler. Please see
http://spark.incubator.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

TD


On Sun, Jan 19, 2014 at 10:01 AM, Manoj Samel <ma...@gmail.com>wrote:

> So each action (in driver node) creates a job that can still be executed
> by 1:N worker node(s) ?
>
>
> On Sat, Jan 18, 2014 at 10:56 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Yes, RDD actions can be called only in the driver program, therefore only
>> in the driver node. However, they can be parallelized within the driver
>> program by calling multiple actions from multiple threads. The jobs
>> corresponding to each action will be executed simultaneously in the Spark
>> cluster, sharing the available resources.
>>
>> TD
>>
>>
>>
>>
>> On Sat, Jan 18, 2014 at 10:34 PM, Manoj Samel <ma...@gmail.com>wrote:
>>
>>> Are RDD actions like count etc. run only on driver node or can they be
>>> parallelized ?
>>>
>>> Thanks,
>>>
>>
>>
>

Re: Do RDD actions run only on driver ?

Posted by Manoj Samel <ma...@gmail.com>.
So each action (in driver node) creates a job that can still be executed by
1:N worker node(s) ?


On Sat, Jan 18, 2014 at 10:56 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> Yes, RDD actions can be called only in the driver program, therefore only
> in the driver node. However, they can be parallelized within the driver
> program by calling multiple actions from multiple threads. The jobs
> corresponding to each action will be executed simultaneously in the Spark
> cluster, sharing the available resources.
>
> TD
>
>
>
>
> On Sat, Jan 18, 2014 at 10:34 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> Are RDD actions like count etc. run only on driver node or can they be
>> parallelized ?
>>
>> Thanks,
>>
>
>

Re: Do RDD actions run only on driver ?

Posted by Tathagata Das <ta...@gmail.com>.
Yes, RDD actions can be called only in the driver program, therefore only
in the driver node. However, they can be parallelized within the driver
program by calling multiple actions from multiple threads. The jobs
corresponding to each action will be executed simultaneously in the Spark
cluster, sharing the available resources.

TD



On Sat, Jan 18, 2014 at 10:34 PM, Manoj Samel <ma...@gmail.com>wrote:

> Are RDD actions like count etc. run only on driver node or can they be
> parallelized ?
>
> Thanks,
>