You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by howard chen <ho...@gmail.com> on 2013/10/25 17:32:59 UTC

Compare with Storm

Hello,

We have some in house real time streaming jobs written for Storm and want
to see the possibility to migrate to Spark Streaming in the future as our
team all think Spark is a very promising technologies (one platform to
execute both realtime & interactive jobs) and with excellent documentations.

1. If we focus on the streaming capabilities, what are the main pros/cons
at the current moment, is Spark streaming suitable for production use now?

2. In term of message reliability and transaction support, I assume both
need to rely on zookeeper, right?

3. In Storm, we are using Topology/Spout/Bolt as the data model, how to
translate them to Spark streaming if we want to rewrite our system? Are
there any migration guide?

4. Can Spark do distributed RPC like Storm?



Thanks for any idea.

Re: Compare with Storm

Posted by Matei Zaharia <ma...@gmail.com>.

Hey Howard,

Great to hear that you're looking at Spark Streaming!

> We have some in house real time streaming jobs written for Storm and want to see the possibility to migrate to Spark Streaming in the future as our team all think Spark is a very promising technologies (one platform to execute both realtime & interactive jobs) and with excellent documentations.
> 
> 1. If we focus on the streaming capabilities, what are the main pros/cons at the current moment, is Spark streaming suitable for production use now?

Spark Streaming provides all the facilities for continuous use, but currently master fault tolerance takes a bit more manual setup (in particular you have to manually restart your app from a checkpoint if the master crashes). We plan to improve this later. Several groups are using it in production though as far as I know, so it's worth a try as long as you read about this stuff.

> 2. In term of message reliability and transaction support, I assume both need to rely on zookeeper, right?

As far as I know neither of them uses ZooKeeper for message reliability -- they implement this within the application, and maybe use ZooKeeper for leader election. Spark Streaming is designed to compute reliably (everything has "exactly-once" semantics by default) and does this using a mechanism called "discretized streams" (http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf). Storm only provides "at-least-once" delivery by default and does not provide fault tolerance for state, though Trident (built on top) gives you that and exactly-once semantics at the cost of using a database to do transactions. Storm also uses ZooKeeper for automatic leader election while Spark requires the manual setup of master restart as described above.

> 3. In Storm, we are using Topology/Spout/Bolt as the data model, how to translate them to Spark streaming if we want to rewrite our system? Are there any migration guide?

This is probably the biggest difference -- Spark Streaming only works in terms of high-level operations, such as map() and groupBy(), and doesn't expose a lower-level "nodes exchanging messages" model. You can probably take the code you use within a bolt and use it as a map function or whatever the right operation is in Spark or the way you want to combine data.

> 4. Can Spark do distributed RPC like Storm?

Since Spark Streaming is just running on top of Spark, you can actually run code to grab the state of the computation any time as an RDD and then run Spark queries on it. So again while it's not exactly the same as the thing called distributed RPC in Storm, it is a way to do arbitrary parallel computations on your data, and it comes with a high-level API.

Matei