You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by joshuata <jo...@gmail.com> on 2016/07/18 21:57:42 UTC

Execute function once on each node

I am working on a spark application that requires the ability to run a
function on each node in the cluster. This is used to read data from a
directory that is not globally accessible to the cluster. I have tried
creating an RDD with n elements and n partitions so that it is evenly
distributed among the n nodes, and then mapping a function over the RDD.
However, the runtime makes no guarantees that each partition will be stored
on a separate node. This means that the code will run multiple times on the
same node while never running on another.

I have looked through the documentation and source code for both RDDs and
the scheduler, but I haven't found anything that will do what I need. Does
anybody know of a solution I could use?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Execute function once on each node

Posted by Josh Asplund <jo...@gmail.com>.

Thank you for that advice. I have tried similar techniques, but not that
one.

On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Thanks for the explanation. Try creating a custom RDD whose getPartitions
> returns an array of custom partition objects of size n (= number of nodes).
> In a custom partition object, you can have the file path and ip/hostname
> where the partition needs to be computed. Then, have getPreferredLocations
> return the ip/hostname from the partition object and in compute function,
> assert that you are in right ip/hostname (or fail) and read the content of
> the file.
>
> Not a 100% sure it will work though.
>
> On Tue, Jul 19, 2016, 2:54 AM Josh Asplund <jo...@gmail.com> wrote:
>
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>>
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>>
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> aniket.bhatnagar@gmail.com> wrote:
>>
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>>
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:
>>>
>>>> I am working on a spark application that requires the ability to run a
>>>> function on each node in the cluster. This is used to read data from a
>>>> directory that is not globally accessible to the cluster. I have tried
>>>> creating an RDD with n elements and n partitions so that it is evenly
>>>> distributed among the n nodes, and then mapping a function over the RDD.
>>>> However, the runtime makes no guarantees that each partition will be
>>>> stored
>>>> on a separate node. This means that the code will run multiple times on
>>>> the
>>>> same node while never running on another.
>>>>
>>>> I have looked through the documentation and source code for both RDDs
>>>> and
>>>> the scheduler, but I haven't found anything that will do what I need.
>>>> Does
>>>> anybody know of a solution I could use?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>

Re: Execute function once on each node

Posted by Aniket Bhatnagar <an...@gmail.com>.

Thanks for the explanation. Try creating a custom RDD whose getPartitions
returns an array of custom partition objects of size n (= number of nodes).
In a custom partition object, you can have the file path and ip/hostname
where the partition needs to be computed. Then, have getPreferredLocations
return the ip/hostname from the partition object and in compute function,
assert that you are in right ip/hostname (or fail) and read the content of
the file.

Not a 100% sure it will work though.

On Tue, Jul 19, 2016, 2:54 AM Josh Asplund <jo...@gmail.com> wrote:

> The spark workers are running side-by-side with scientific simulation
> code. The code writes output to local SSDs to keep latency low. Due to the
> volume of data being moved (10's of terabytes +), it isn't really feasible
> to copy the data to a global filesystem. Executing a function on each node
> would allow us to read the data in situ without a copy.
>
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
>
> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>> You can't assume that the number to nodes will be constant as some may
>> fail, hence you can't guarantee that a function will execute at most once
>> or atleast once on a node. Can you explain your use case in a bit more
>> detail?
>>
>> On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:
>>
>>> I am working on a spark application that requires the ability to run a
>>> function on each node in the cluster. This is used to read data from a
>>> directory that is not globally accessible to the cluster. I have tried
>>> creating an RDD with n elements and n partitions so that it is evenly
>>> distributed among the n nodes, and then mapping a function over the RDD.
>>> However, the runtime makes no guarantees that each partition will be
>>> stored
>>> on a separate node. This means that the code will run multiple times on
>>> the
>>> same node while never running on another.
>>>
>>> I have looked through the documentation and source code for both RDDs and
>>> the scheduler, but I haven't found anything that will do what I need.
>>> Does
>>> anybody know of a solution I could use?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Re: Execute function once on each node

Posted by Josh Asplund <jo...@gmail.com>.

Technical limitations keep us from running another filesystem on the SSDs.
We are running on a very large HPC cluster without control over low-level
system components. We have tried setting up an ad-hoc HDFS cluster on the
nodes in our allocation, but we have had very little luck. It ends up being
very brittle and difficult for the simulation code to access.

On Tue, Jul 19, 2016 at 7:08 AM Koert Kuipers <ko...@tresata.com> wrote:

> The whole point of a well designed global filesystem is to not move the
> data
>
> On Jul 19, 2016 10:07, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> If you run hdfs on those ssds (with low replication factor) wouldn't it
>> also effectively write to local disk with low latency?
>>
>> On Jul 18, 2016 21:54, "Josh Asplund" <jo...@gmail.com> wrote:
>>
>> The spark workers are running side-by-side with scientific simulation
>> code. The code writes output to local SSDs to keep latency low. Due to the
>> volume of data being moved (10's of terabytes +), it isn't really feasible
>> to copy the data to a global filesystem. Executing a function on each node
>> would allow us to read the data in situ without a copy.
>>
>> I understand that manually assigning tasks to nodes reduces fault
>> tolerance, but the simulation codes already explicitly assign tasks, so a
>> failure of any one node is already a full-job failure.
>>
>> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
>> aniket.bhatnagar@gmail.com> wrote:
>>
>>> You can't assume that the number to nodes will be constant as some may
>>> fail, hence you can't guarantee that a function will execute at most once
>>> or atleast once on a node. Can you explain your use case in a bit more
>>> detail?
>>>
>>> On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:
>>>
>>>> I am working on a spark application that requires the ability to run a
>>>> function on each node in the cluster. This is used to read data from a
>>>> directory that is not globally accessible to the cluster. I have tried
>>>> creating an RDD with n elements and n partitions so that it is evenly
>>>> distributed among the n nodes, and then mapping a function over the RDD.
>>>> However, the runtime makes no guarantees that each partition will be
>>>> stored
>>>> on a separate node. This means that the code will run multiple times on
>>>> the
>>>> same node while never running on another.
>>>>
>>>> I have looked through the documentation and source code for both RDDs
>>>> and
>>>> the scheduler, but I haven't found anything that will do what I need.
>>>> Does
>>>> anybody know of a solution I could use?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>

Re: Execute function once on each node

Posted by Koert Kuipers <ko...@tresata.com>.

The whole point of a well designed global filesystem is to not move the data

On Jul 19, 2016 10:07, "Koert Kuipers" <ko...@tresata.com> wrote:

> If you run hdfs on those ssds (with low replication factor) wouldn't it
> also effectively write to local disk with low latency?
>
> On Jul 18, 2016 21:54, "Josh Asplund" <jo...@gmail.com> wrote:
>
> The spark workers are running side-by-side with scientific simulation
> code. The code writes output to local SSDs to keep latency low. Due to the
> volume of data being moved (10's of terabytes +), it isn't really feasible
> to copy the data to a global filesystem. Executing a function on each node
> would allow us to read the data in situ without a copy.
>
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
>
> On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>> You can't assume that the number to nodes will be constant as some may
>> fail, hence you can't guarantee that a function will execute at most once
>> or atleast once on a node. Can you explain your use case in a bit more
>> detail?
>>
>> On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:
>>
>>> I am working on a spark application that requires the ability to run a
>>> function on each node in the cluster. This is used to read data from a
>>> directory that is not globally accessible to the cluster. I have tried
>>> creating an RDD with n elements and n partitions so that it is evenly
>>> distributed among the n nodes, and then mapping a function over the RDD.
>>> However, the runtime makes no guarantees that each partition will be
>>> stored
>>> on a separate node. This means that the code will run multiple times on
>>> the
>>> same node while never running on another.
>>>
>>> I have looked through the documentation and source code for both RDDs and
>>> the scheduler, but I haven't found anything that will do what I need.
>>> Does
>>> anybody know of a solution I could use?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>

Re: Execute function once on each node

Posted by Josh Asplund <jo...@gmail.com>.

The spark workers are running side-by-side with scientific simulation code.
The code writes output to local SSDs to keep latency low. Due to the volume
of data being moved (10's of terabytes +), it isn't really feasible to copy
the data to a global filesystem. Executing a function on each node would
allow us to read the data in situ without a copy.

I understand that manually assigning tasks to nodes reduces fault
tolerance, but the simulation codes already explicitly assign tasks, so a
failure of any one node is already a full-job failure.

On Mon, Jul 18, 2016 at 3:43 PM Aniket Bhatnagar <an...@gmail.com>
wrote:

> You can't assume that the number to nodes will be constant as some may
> fail, hence you can't guarantee that a function will execute at most once
> or atleast once on a node. Can you explain your use case in a bit more
> detail?
>
> On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:
>
>> I am working on a spark application that requires the ability to run a
>> function on each node in the cluster. This is used to read data from a
>> directory that is not globally accessible to the cluster. I have tried
>> creating an RDD with n elements and n partitions so that it is evenly
>> distributed among the n nodes, and then mapping a function over the RDD.
>> However, the runtime makes no guarantees that each partition will be
>> stored
>> on a separate node. This means that the code will run multiple times on
>> the
>> same node while never running on another.
>>
>> I have looked through the documentation and source code for both RDDs and
>> the scheduler, but I haven't found anything that will do what I need. Does
>> anybody know of a solution I could use?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Execute function once on each node

Posted by Aniket Bhatnagar <an...@gmail.com>.

You can't assume that the number to nodes will be constant as some may
fail, hence you can't guarantee that a function will execute at most once
or atleast once on a node. Can you explain your use case in a bit more
detail?

On Mon, Jul 18, 2016, 10:57 PM joshuata <jo...@gmail.com> wrote:

> I am working on a spark application that requires the ability to run a
> function on each node in the cluster. This is used to read data from a
> directory that is not globally accessible to the cluster. I have tried
> creating an RDD with n elements and n partitions so that it is evenly
> distributed among the n nodes, and then mapping a function over the RDD.
> However, the runtime makes no guarantees that each partition will be stored
> on a separate node. This means that the code will run multiple times on the
> same node while never running on another.
>
> I have looked through the documentation and source code for both RDDs and
> the scheduler, but I haven't found anything that will do what I need. Does
> anybody know of a solution I could use?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Execute function once on each node

Posted by Rabin Banerjee <de...@gmail.com>.

" I am working on a spark application that requires the ability to run a
function on each node in the cluster
"
--
Use Apache Ignite instead of Spark. Trust me it's awesome for this use case.

Regards,
Rabin Banerjee
On Jul 19, 2016 3:27 AM, "joshuata" <jo...@gmail.com> wrote:

> I am working on a spark application that requires the ability to run a
> function on each node in the cluster. This is used to read data from a
> directory that is not globally accessible to the cluster. I have tried
> creating an RDD with n elements and n partitions so that it is evenly
> distributed among the n nodes, and then mapping a function over the RDD.
> However, the runtime makes no guarantees that each partition will be stored
> on a separate node. This means that the code will run multiple times on the
> same node while never running on another.
>
> I have looked through the documentation and source code for both RDDs and
> the scheduler, but I haven't found anything that will do what I need. Does
> anybody know of a solution I could use?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>