You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by AlexG <sw...@gmail.com> on 2016/09/17 01:08:35 UTC

feasibility of ignite and alluxio for interfacing MPI and Spark

Do Ignite and Alluxio offer reasonable means of transferring data, in memory,
from Spark to MPI? A straightforward way to transfer data is use piping, but
unless you have MPI processes running in a one-to-one mapping to the Spark
partitions, this will require some complicated logic to get working (you'll
have to handle multiple tasks sending their data to one process). 

It seems like potentially Ignite and Alluxio might allow you to pull the
data you want into each of your MPI processes without worrying about such a
requirement, but it's not clear to me from the high-level descriptions of
the systems whether this is something that can be readily realized. Is this
the case?

Another issue is that with the piping solution, you only need to store two
copies of the data: one each on the Spark and MPI sides. With Ignite and
Alluxio, would you need three? It seems that they let you replace the
standard RDDs with RDDs backed with their memory stores, but do those
perform as efficiently as the standard Spark RDDs that are persisted in
memory?

More generally, I'd be interested to know if there are existing solutions to
this problem of transferring data between MPI and Spark. Thanks for any
insight you can offer!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/feasibility-of-ignite-and-alluxio-for-interfacing-MPI-and-Spark-tp27745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: feasibility of ignite and alluxio for interfacing MPI and Spark

Posted by Calvin Jia <ji...@gmail.com>.

Hi,

Alluxio allows for data sharing between applications through a File System
API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse).
If your MPI applications can use any of these interfaces, you should be
able to use Alluxio for data sharing out of the box.

In terms of duplicating in-memory data, you should only need one copy in
Alluxio if you are able to stream your dataset. As for the performance of
using Alluxio to back your data compared to using Spark's native in-memory
representation, here is a blog
<http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/> which
details the pros and cons of the two approaches. At a high level, Alluxio
performs better with larger datasets or if you plan to use your dataset in
more than one Spark job.

Hope this helps,
Calvin