You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Swapnil Shinde <sw...@gmail.com> on 2015/08/27 18:30:12 UTC

Spark driver locality

Hello
I am new to spark world and started to explore recently in standalone mode.
It would be great if I get clarifications on below doubts-

1. Driver locality - It is mentioned in documentation that "client"
deploy-mode is not good if machine running "spark-submit" is not co-located
with worker machines. cluster mode is not available with standalone
clusters. Therefore, do we have to submit all applications on master
machine? (Assuming we don't have separate co-located gateway machine)

2. How does above driver locality work with spark shell running on local
machine ?

3. I am little confused with role of driver program. Does driver do any
computation in spark app life cycle? For instance, in simple row count app,
worker node calculates local row counts. Does driver sum up local row
counts? In short where does reduce phase runs in this case?

4. In case of accessing hdfs data over network, do worker nodes read data
in parallel? How does hdfs data over network get accessed in spark
application?

Sorry if these questions were already discussed..

Thanks
Swapnil

Re: Spark driver locality

Posted by Swapnil Shinde <sw...@gmail.com>.

Thanks..
On Aug 28, 2015 4:55 AM, "Rishitesh Mishra" <ri...@gmail.com>
wrote:

> Hi Swapnil,
>
> 1. All the task scheduling/retry happens from Driver. So you are right
> that a lot of communication happens from driver to cluster. It all depends
> on the how you want to go about your Spark application, whether your
> application has direct access to Spark cluster or its routed through a
> gateway machine. Accordingly you can take your decision.
>
> 2. I am not familiar with NFS layer concurrency. But parallel reads should
> be OK I think. Some one with the knowledge of NFS workings should correct
> if I am wrong.
>
>
> On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde <sw...@gmail.com>
> wrote:
>
>> Thanks Rishitesh !!
>> 1. I get that driver doesn't need to be on master but there is lot of
>> communication between driver and cluster. That's why co-located gateway was
>> recommended. How much is the impact of driver not being co-located with
>> cluster?
>>
>> 4. How does hdfs split get assigned to worker node to read data from
>> remote hadoop cluster? I am more interested to know how mapr NFS layer is
>> accessed in parallel.
>>
>> -
>> Swapnil
>>
>>
>> On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <
>> rishi80.mishra@gmail.com> wrote:
>>
>>> Hi Swapnil,
>>> Let me try to answer some of the questions. Answers inline. Hope it
>>> helps.
>>>
>>> On Thursday, August 27, 2015, Swapnil Shinde <sw...@gmail.com>
>>> wrote:
>>>
>>>> Hello
>>>> I am new to spark world and started to explore recently in standalone
>>>> mode. It would be great if I get clarifications on below doubts-
>>>>
>>>> 1. Driver locality - It is mentioned in documentation that "client"
>>>> deploy-mode is not good if machine running "spark-submit" is not co-located
>>>> with worker machines. cluster mode is not available with standalone
>>>> clusters. Therefore, do we have to submit all applications on master
>>>> machine? (Assuming we don't have separate co-located gateway machine)
>>>>
>>>
>>> No. In standalone mode also your master and driver machines can be
>>> different.
>>>
>>>> Driver should have access to Master as well as worker machines.
>>>>
>>>
>>>
>>>> 2. How does above driver locality work with spark shell running on
>>>> local machine ?
>>>>
>>>
>>> Spark shell itself acts as driver. This means your local machine should
>>> have access to all the cluster machines.
>>>
>>>>
>>>> 3. I am little confused with role of driver program. Does driver do any
>>>> computation in spark app life cycle? For instance, in simple row count app,
>>>> worker node calculates local row counts. Does driver sum up local row
>>>> counts? In short where does reduce phase runs in this case?
>>>>
>>>
>>> Role of driver is to co-ordinate with cluster manager for initial
>>> resource allocation. After that it needs to schedule tasks to different
>>> executors assigned to it. It does not do any computation.(unless the
>>> application itself does something on its own ). Reduce phase is also a
>>> bunch of tasks, which gets assigned to one or more executors.
>>>
>>>>
>>>> 4. In case of accessing hdfs data over network, do worker nodes read
>>>> data in parallel? How does hdfs data over network get accessed in spark
>>>> application?
>>>>
>>>
>>>
>>>> Yes. All worker will get a split to read. They read their own split in
>>>> parallel.This means all worker nodes should have access to Hadoop file
>>>> system.
>>>>
>>>
>>>
>>>> Sorry if these questions were already discussed..
>>>>
>>>> Thanks
>>>> Swapnil
>>>>
>>>
>>
>

Re: Spark driver locality

Posted by Rishitesh Mishra <ri...@gmail.com>.

Hi Swapnil,

1. All the task scheduling/retry happens from Driver. So you are right that
a lot of communication happens from driver to cluster. It all depends on
the how you want to go about your Spark application, whether your
application has direct access to Spark cluster or its routed through a
gateway machine. Accordingly you can take your decision.

2. I am not familiar with NFS layer concurrency. But parallel reads should
be OK I think. Some one with the knowledge of NFS workings should correct
if I am wrong.


On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde <sw...@gmail.com>
wrote:

> Thanks Rishitesh !!
> 1. I get that driver doesn't need to be on master but there is lot of
> communication between driver and cluster. That's why co-located gateway was
> recommended. How much is the impact of driver not being co-located with
> cluster?
>
> 4. How does hdfs split get assigned to worker node to read data from
> remote hadoop cluster? I am more interested to know how mapr NFS layer is
> accessed in parallel.
>
> -
> Swapnil
>
>
> On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <
> rishi80.mishra@gmail.com> wrote:
>
>> Hi Swapnil,
>> Let me try to answer some of the questions. Answers inline. Hope it helps.
>>
>> On Thursday, August 27, 2015, Swapnil Shinde <sw...@gmail.com>
>> wrote:
>>
>>> Hello
>>> I am new to spark world and started to explore recently in standalone
>>> mode. It would be great if I get clarifications on below doubts-
>>>
>>> 1. Driver locality - It is mentioned in documentation that "client"
>>> deploy-mode is not good if machine running "spark-submit" is not co-located
>>> with worker machines. cluster mode is not available with standalone
>>> clusters. Therefore, do we have to submit all applications on master
>>> machine? (Assuming we don't have separate co-located gateway machine)
>>>
>>
>> No. In standalone mode also your master and driver machines can be
>> different.
>>
>>> Driver should have access to Master as well as worker machines.
>>>
>>
>>
>>> 2. How does above driver locality work with spark shell running on local
>>> machine ?
>>>
>>
>> Spark shell itself acts as driver. This means your local machine should
>> have access to all the cluster machines.
>>
>>>
>>> 3. I am little confused with role of driver program. Does driver do any
>>> computation in spark app life cycle? For instance, in simple row count app,
>>> worker node calculates local row counts. Does driver sum up local row
>>> counts? In short where does reduce phase runs in this case?
>>>
>>
>> Role of driver is to co-ordinate with cluster manager for initial
>> resource allocation. After that it needs to schedule tasks to different
>> executors assigned to it. It does not do any computation.(unless the
>> application itself does something on its own ). Reduce phase is also a
>> bunch of tasks, which gets assigned to one or more executors.
>>
>>>
>>> 4. In case of accessing hdfs data over network, do worker nodes read
>>> data in parallel? How does hdfs data over network get accessed in spark
>>> application?
>>>
>>
>>
>>> Yes. All worker will get a split to read. They read their own split in
>>> parallel.This means all worker nodes should have access to Hadoop file
>>> system.
>>>
>>
>>
>>> Sorry if these questions were already discussed..
>>>
>>> Thanks
>>> Swapnil
>>>
>>
>

Re: Spark driver locality

Posted by Swapnil Shinde <sw...@gmail.com>.

Thanks Rishitesh !!
1. I get that driver doesn't need to be on master but there is lot of
communication between driver and cluster. That's why co-located gateway was
recommended. How much is the impact of driver not being co-located with
cluster?

4. How does hdfs split get assigned to worker node to read data from remote
hadoop cluster? I am more interested to know how mapr NFS layer is accessed
in parallel.

-
Swapnil


On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <ri...@gmail.com>
wrote:

> Hi Swapnil,
> Let me try to answer some of the questions. Answers inline. Hope it helps.
>
> On Thursday, August 27, 2015, Swapnil Shinde <sw...@gmail.com>
> wrote:
>
>> Hello
>> I am new to spark world and started to explore recently in standalone
>> mode. It would be great if I get clarifications on below doubts-
>>
>> 1. Driver locality - It is mentioned in documentation that "client"
>> deploy-mode is not good if machine running "spark-submit" is not co-located
>> with worker machines. cluster mode is not available with standalone
>> clusters. Therefore, do we have to submit all applications on master
>> machine? (Assuming we don't have separate co-located gateway machine)
>>
>
> No. In standalone mode also your master and driver machines can be
> different.
>
>> Driver should have access to Master as well as worker machines.
>>
>
>
>> 2. How does above driver locality work with spark shell running on local
>> machine ?
>>
>
> Spark shell itself acts as driver. This means your local machine should
> have access to all the cluster machines.
>
>>
>> 3. I am little confused with role of driver program. Does driver do any
>> computation in spark app life cycle? For instance, in simple row count app,
>> worker node calculates local row counts. Does driver sum up local row
>> counts? In short where does reduce phase runs in this case?
>>
>
> Role of driver is to co-ordinate with cluster manager for initial resource
> allocation. After that it needs to schedule tasks to different executors
> assigned to it. It does not do any computation.(unless the application
> itself does something on its own ). Reduce phase is also a bunch of tasks,
> which gets assigned to one or more executors.
>
>>
>> 4. In case of accessing hdfs data over network, do worker nodes read data
>> in parallel? How does hdfs data over network get accessed in spark
>> application?
>>
>
>
>> Yes. All worker will get a split to read. They read their own split in
>> parallel.This means all worker nodes should have access to Hadoop file
>> system.
>>
>
>
>> Sorry if these questions were already discussed..
>>
>> Thanks
>> Swapnil
>>
>

Re: Spark driver locality

Posted by Rishitesh Mishra <ri...@gmail.com>.

Hi Swapnil,
Let me try to answer some of the questions. Answers inline. Hope it helps.

On Thursday, August 27, 2015, Swapnil Shinde <sw...@gmail.com>
wrote:

> Hello
> I am new to spark world and started to explore recently in standalone
> mode. It would be great if I get clarifications on below doubts-
>
> 1. Driver locality - It is mentioned in documentation that "client"
> deploy-mode is not good if machine running "spark-submit" is not co-located
> with worker machines. cluster mode is not available with standalone
> clusters. Therefore, do we have to submit all applications on master
> machine? (Assuming we don't have separate co-located gateway machine)
>

No. In standalone mode also your master and driver machines can be
different.

> Driver should have access to Master as well as worker machines.
>

> 2. How does above driver locality work with spark shell running on local
> machine ?
>

Spark shell itself acts as driver. This means your local machine should
have access to all the cluster machines.

>
> 3. I am little confused with role of driver program. Does driver do any
> computation in spark app life cycle? For instance, in simple row count app,
> worker node calculates local row counts. Does driver sum up local row
> counts? In short where does reduce phase runs in this case?
>

Role of driver is to co-ordinate with cluster manager for initial resource
allocation. After that it needs to schedule tasks to different executors
assigned to it. It does not do any computation.(unless the application
itself does something on its own ). Reduce phase is also a bunch of tasks,
which gets assigned to one or more executors.

>
> 4. In case of accessing hdfs data over network, do worker nodes read data
> in parallel? How does hdfs data over network get accessed in spark
> application?
>

> Yes. All worker will get a split to read. They read their own split in
> parallel.This means all worker nodes should have access to Hadoop file
> system.
>

> Sorry if these questions were already discussed..
>
> Thanks
> Swapnil
>