You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Wei Da <xw...@gmail.com> on 2014/06/16 09:17:57 UTC

Is There Any Benchmarks Comparing C++ MPI with Spark

Hi guys,
We are making choices between C++ MPI and Spark. Is there any official
comparation between them? Thanks a lot!

Wei

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Larry,

I don't see any reference to Spark in particular there.

Additionally, the benchmark only scales up to datasets that are roughly
10gb (though I realize they've picked some fairly computationally intensive
tasks), and they don't present their results on more than 4 nodes. This can
hide things like, for example, a communication pattern that is O(n^2) in
the number of cluster nodes.

Obviously they've gotten some great performance out of SciDB, but I don't
think this answers the MPI vs. Spark question directly.

My own experience suggests that as long as your algorithm fits in a BSP
programming model, with Spark you can achieve performance that is
comparable to a tuned C++/MPI codebase by leveraging the right libraries
locally and thinking carefully about what and when you have to communicate.

- Evan


On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo <la...@intel.com> wrote:

>
> Here is a partial comparison.
>
>
> http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2
>
> SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost
> of more work.
>
> In case the link stops working:
> A Complex Analytics Genomics Benchmark Rebecca Taft-, Manasi Vartak-,
> Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, and Michael
> Stonebraker
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-C-MPI-with-Spark-tp7661p7919.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

Posted by ldmtwo <la...@intel.com>.
Here is a partial comparison.

http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2

SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost
of more work.

In case the link stops working:
A Complex Analytics Genomics Benchmark Rebecca Taft-, Manasi Vartak-,
Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, and Michael
Stonebraker



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-C-MPI-with-Spark-tp7661p7919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

Posted by Tom Vacek <mi...@gmail.com>.
Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather.  There are also a few additional primitives, mostly
based on a join.  Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal.  Spark has one theoretical disadvantage
compared to MPI: every collective operation requires the task closures to
be distributed, and---to my knowledge---this is an O(p) operation.
 (Perhaps there has been some progress on this??)  That O(p) term spoils
any parallel isoefficiency analysis.  In MPI, binaries are distributed
once, and wireup is a O(log p).  In practice, it prevents
reasonable-looking strong scaling curves; with MPI, the overall runtime
will stop declining and level off with increasing p, but with Spark it can
go up sharply.  So, Spark is great for a small cluster.  For a huge
cluster, or a job with a lot of collectives, it isn't so great.


On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux <de...@gmail.com>
wrote:

> I guess you have to understand the difference of architecture. I don't
> know much about C++ MPI but it is basically MPI whereas Spark is inspired
> from Hadoop MapReduce and optimised for reading/writing large amount of
> data with a smart caching and locality strategy. Intuitively, if you have a
> high ratio CPU/message then MPI might be better. But what is the ratio is
> hard to say and in the end this ratio will depend on your specific
> application. Finally, in real life, this difference of performance due to
> the architecture may not be the only or the most important factor of choice
> like Michael already explained.
>
> Bertrand
>
> On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler <mi...@tumra.com> wrote:
>
>> Hello Wei,
>>
>> I talk from experience of writing many HPC distributed application using
>> Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
>> Virtual Machine (PVM) way before that back in the 90's.  I can say with
>> absolute certainty:
>>
>> *Any gains you believe there are because "C++ is faster than Java/Scala"
>> will be completely blown by the inordinate amount of time you spend
>> debugging your code and/or reinventing the wheel to do even basic tasks
>> like linear regression.*
>>
>>
>> There are undoubtably some very specialised use-cases where MPI and its
>> brethren still dominate for High Performance Computing tasks -- like for
>> example the nuclear decay simulations run by the US Department of Energy on
>> supercomputers where they've invested billions solving that use case.
>>
>> Spark is part of the wider "Big Data" ecosystem, and its biggest
>> advantages are traction amongst internet scale companies, hundreds of
>> developers contributing to it and a community of thousands using it.
>>
>> Need a distributed fault-tolerant file system? Use HDFS.  Need a
>> distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
>> between your worker processes? Use Zookeeper.  Need to run it on a flexible
>> grid of computing resources and handle failures? Run it on Mesos!
>>
>> The barrier to entry to get going with Spark is very low, download the
>> latest distribution and start the Spark shell.  Language bindings for Scala
>> / Java / Python are excellent meaning you spend less time writing
>> boilerplate code, and more time solving problems.
>>
>> Even if you believe you *need* to use native code to do something
>> specific, like fetching HD video frames from satellite video capture cards
>> -- wrap it in a small native library and use the Java Native Access
>> interface to call it from your Java/Scala code.
>>
>> Have fun, and if you get stuck we're here to help!
>>
>> MC
>>
>>
>> On 16 June 2014 08:17, Wei Da <xw...@gmail.com> wrote:
>>
>>> Hi guys,
>>> We are making choices between C++ MPI and Spark. Is there any official
>>> comparation between them? Thanks a lot!
>>>
>>> Wei
>>>
>>
>>
>

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

Posted by Bertrand Dechoux <de...@gmail.com>.
I guess you have to understand the difference of architecture. I don't know
much about C++ MPI but it is basically MPI whereas Spark is inspired from
Hadoop MapReduce and optimised for reading/writing large amount of data
with a smart caching and locality strategy. Intuitively, if you have a high
ratio CPU/message then MPI might be better. But what is the ratio is hard
to say and in the end this ratio will depend on your specific application.
Finally, in real life, this difference of performance due to the
architecture may not be the only or the most important factor of choice
like Michael already explained.

Bertrand

On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler <mi...@tumra.com> wrote:

> Hello Wei,
>
> I talk from experience of writing many HPC distributed application using
> Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
> Virtual Machine (PVM) way before that back in the 90's.  I can say with
> absolute certainty:
>
> *Any gains you believe there are because "C++ is faster than Java/Scala"
> will be completely blown by the inordinate amount of time you spend
> debugging your code and/or reinventing the wheel to do even basic tasks
> like linear regression.*
>
>
> There are undoubtably some very specialised use-cases where MPI and its
> brethren still dominate for High Performance Computing tasks -- like for
> example the nuclear decay simulations run by the US Department of Energy on
> supercomputers where they've invested billions solving that use case.
>
> Spark is part of the wider "Big Data" ecosystem, and its biggest
> advantages are traction amongst internet scale companies, hundreds of
> developers contributing to it and a community of thousands using it.
>
> Need a distributed fault-tolerant file system? Use HDFS.  Need a
> distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
> between your worker processes? Use Zookeeper.  Need to run it on a flexible
> grid of computing resources and handle failures? Run it on Mesos!
>
> The barrier to entry to get going with Spark is very low, download the
> latest distribution and start the Spark shell.  Language bindings for Scala
> / Java / Python are excellent meaning you spend less time writing
> boilerplate code, and more time solving problems.
>
> Even if you believe you *need* to use native code to do something
> specific, like fetching HD video frames from satellite video capture cards
> -- wrap it in a small native library and use the Java Native Access
> interface to call it from your Java/Scala code.
>
> Have fun, and if you get stuck we're here to help!
>
> MC
>
>
> On 16 June 2014 08:17, Wei Da <xw...@gmail.com> wrote:
>
>> Hi guys,
>> We are making choices between C++ MPI and Spark. Is there any official
>> comparation between them? Thanks a lot!
>>
>> Wei
>>
>
>

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

Posted by Michael Cutler <mi...@tumra.com>.
Hello Wei,

I talk from experience of writing many HPC distributed application using
Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
Virtual Machine (PVM) way before that back in the 90's.  I can say with
absolute certainty:

*Any gains you believe there are because "C++ is faster than Java/Scala"
will be completely blown by the inordinate amount of time you spend
debugging your code and/or reinventing the wheel to do even basic tasks
like linear regression.*


There are undoubtably some very specialised use-cases where MPI and its
brethren still dominate for High Performance Computing tasks -- like for
example the nuclear decay simulations run by the US Department of Energy on
supercomputers where they've invested billions solving that use case.

Spark is part of the wider "Big Data" ecosystem, and its biggest advantages
are traction amongst internet scale companies, hundreds of developers
contributing to it and a community of thousands using it.

Need a distributed fault-tolerant file system? Use HDFS.  Need a
distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
between your worker processes? Use Zookeeper.  Need to run it on a flexible
grid of computing resources and handle failures? Run it on Mesos!

The barrier to entry to get going with Spark is very low, download the
latest distribution and start the Spark shell.  Language bindings for Scala
/ Java / Python are excellent meaning you spend less time writing
boilerplate code, and more time solving problems.

Even if you believe you *need* to use native code to do something specific,
like fetching HD video frames from satellite video capture cards -- wrap it
in a small native library and use the Java Native Access interface to call
it from your Java/Scala code.

Have fun, and if you get stuck we're here to help!

MC


On 16 June 2014 08:17, Wei Da <xw...@gmail.com> wrote:

> Hi guys,
> We are making choices between C++ MPI and Spark. Is there any official
> comparation between them? Thanks a lot!
>
> Wei
>