You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Rajiv Onat <or...@gmail.com> on 2014/06/09 21:16:10 UTC

Apache Storm vs Apache Spark

I'm trying to figure out whether these are competitive technologies for
stream processing or complimentary? From the initial read, from a stream
processing capabilities both provides a framework for scaling while Spark
has window constructs, Apache Spark has a Spark Streaming and promises one
platform for batch, interactive and stream processing.

Any comments or thoughts?

Re: Apache Storm vs Apache Spark

Posted by Ted Dunning <te...@gmail.com>.

The big difference that I see is that Spark streaming inherently does
micro-batching.  Storm can do or not do it.




On Mon, Jun 9, 2014 at 1:40 PM, Machiel Groeneveld <ma...@gmail.com>
wrote:

> I know Storm fairly well, Spark is probably new to everyone, I haven't
> used it yet. Some thoughts on where Spark Streaming would be a more natural
> fit than Storm
> Spark Streaming
> + Counting messages or do other statistics on them
> + Sliding windows on streams
> - Programming model (Spark seems be one big procedure for the entire
> process)
> - Maturity?
>
> Storm
> + More complex topologies (complexity is delegated to the bolts)
> + Always concurrent (Spark only for some operations)
> + Multiple receivers for one message
> + Maturity
>
> Maybe it's the examples, but Spark seems to be geared towards ad-hoc
> programming. I'm curious what a complex application would look like. Also
> performance compared to Storm is a question mark. Storm has a great
> programming model which on first glance looks more universal than Spark's.
>
>
>
>
> On 9 June 2014 22:04, Rajiv Onat <or...@gmail.com> wrote:
>
>> Thanks. Not sure why you say it is different, from a stream processing
>> use case perspective both seems to accomplish the same thing while the
>> implementation may take different approaches. If I want to aggregate and do
>> stats in Storm, I would have to microbatch the tuples at some level. e.g.
>> count of orders in last 1 minute, in Storm I have to write code to for
>> sliding windows and state management, while Spark seems to provide
>> operators to accomplish that. Tuple level operations such as enrichment,
>> filters etc.. seems also doable in both.
>>
>>
>> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> They are different.
>>>
>>> Storm allows right now processing of tuples.  Spark streaming requires
>>> micro batching (which may be a really short time).  Spark streaming allows
>>> checkpointing of partial results in the stream supported by the framework.
>>>  Storm says you should roll your own or use trident.
>>>
>>> Applications that fit one like a glove are likely to bind a bit on the
>>> other.
>>>
>>>
>>>
>>>
>>> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
>>>
>>>> I'm trying to figure out whether these are competitive technologies for
>>>> stream processing or complimentary? From the initial read, from a stream
>>>> processing capabilities both provides a framework for scaling while Spark
>>>> has window constructs, Apache Spark has a Spark Streaming and promises one
>>>> platform for batch, interactive and stream processing.
>>>>
>>>> Any comments or thoughts?
>>>>
>>>
>>>
>>
>

Re: Apache Storm vs Apache Spark

Posted by Machiel Groeneveld <ma...@gmail.com>.

I know Storm fairly well, Spark is probably new to everyone, I haven't used
it yet. Some thoughts on where Spark Streaming would be a more natural fit
than Storm
Spark Streaming
+ Counting messages or do other statistics on them
+ Sliding windows on streams
- Programming model (Spark seems be one big procedure for the entire
process)
- Maturity?

Storm
+ More complex topologies (complexity is delegated to the bolts)
+ Always concurrent (Spark only for some operations)
+ Multiple receivers for one message
+ Maturity

Maybe it's the examples, but Spark seems to be geared towards ad-hoc
programming. I'm curious what a complex application would look like. Also
performance compared to Storm is a question mark. Storm has a great
programming model which on first glance looks more universal than Spark's.




On 9 June 2014 22:04, Rajiv Onat <or...@gmail.com> wrote:

> Thanks. Not sure why you say it is different, from a stream processing use
> case perspective both seems to accomplish the same thing while the
> implementation may take different approaches. If I want to aggregate and do
> stats in Storm, I would have to microbatch the tuples at some level. e.g.
> count of orders in last 1 minute, in Storm I have to write code to for
> sliding windows and state management, while Spark seems to provide
> operators to accomplish that. Tuple level operations such as enrichment,
> filters etc.. seems also doable in both.
>
>
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> They are different.
>>
>> Storm allows right now processing of tuples.  Spark streaming requires
>> micro batching (which may be a really short time).  Spark streaming allows
>> checkpointing of partial results in the stream supported by the framework.
>>  Storm says you should roll your own or use trident.
>>
>> Applications that fit one like a glove are likely to bind a bit on the
>> other.
>>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
>>
>>> I'm trying to figure out whether these are competitive technologies for
>>> stream processing or complimentary? From the initial read, from a stream
>>> processing capabilities both provides a framework for scaling while Spark
>>> has window constructs, Apache Spark has a Spark Streaming and promises one
>>> platform for batch, interactive and stream processing.
>>>
>>> Any comments or thoughts?
>>>
>>
>>
>

Re: Apache Storm vs Apache Spark

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> There is one study that I’m aware of that claims Spark streaming is
> insanely faster than Storm.

I like your way of describing the two tools as starting from differing
extremes with a common territory around micro-batching.

As such, it is hardly surprising that either system looks insanely faster
on their home turf.  I can imagine that if storm is compared on record at a
time apps (with a few millisecond latency) then it looks insanely faster
than what would happen if you distort Spark streaming to have micro-batches
of, say, 100 ms.  Not only would Spark grump about the tiny batch times,
but it would look 100 x slower right out of the gate.  The same game can
undoubtedly be played at the other extreme to make Spark look really
spiffy.  Such comparisons are essentially vacuous and just a measure that
one tool doesn't fit all applications.

Re: Apache Storm vs Apache Spark

Posted by Ted Dunning <te...@gmail.com>.

In left field.


On Mon, Jun 9, 2014 at 4:57 PM, Dan <dc...@hotmail.com> wrote:

> Where would Akka fit on the Storm/Spark spectrum?
>
> Thanks
> Dan
>
> ------------------------------
> Date: Mon, 9 Jun 2014 15:48:49 -0700
> Subject: Re: Apache Storm vs Apache Spark
> From: orajiv@gmail.com
> To: user@storm.incubator.apache.org
>
> Thanks Taylor. Storm seems more flexible in terms of its framework in that
> it provides key primitives, the onus is on the developers depending on
> their QOS needs how to fine tune it. On the other hand, looking at Lambda
> architecture, Storm only fulfills the speed layer while Spark could be
> batch/speed/serving (Spark SQL).  Based on the use cases and compromises
> one would like to make on throughput/latency/QOS, guess have to pick the
> right one.
>
> My simple use case is
> a) I have stream of orders (keyed on customerid, source is socket)
> b) I filter for those orders that is from my high value customers (I have
> to make sure I have this list of high value customers available on all bolt
> tasks in memory for fast correlation/projection), so  customer id in
> streams correlated to customer id in the list and if the customer type is
> in platinum and gold
> c) Count the orders/amount for last 5 minutes and group by products,
> customer type
>
>
>
>
>
> On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
>
> The way I usually describe the difference is that Spark is a batch
> processing framework that also does micro-batching (Spark Streaming), while
> Storm is a stream processing framework that also does micro-batching
> (Trident). So architecturally they are very different, but have some
> similarity on the functional side.
>
> With micro-batching you can achieve higher throughput at the cost of
> increased latency. With Spark this is unavoidable. With Storm you can use
> the core API (spouts and bolts) to do one-at-a-time processing to avoid the
> inherent latency overhead imposed by micro-batching. With Trident, you get
> state management out of the box, and sliding windows are supported as well.
>
> In terms of adoption and production deployments, Storm has been around
> longer and there are a LOT of production deployments. I’m not aware of that
> many production Spark deployments, but I’d expect that to change over time.
>
> In terms of performance, I can’t really point to any valid comparisons.
> When I say “valid” I mean open and independently verifiable. There is one
> study that I’m aware of that claims Spark streaming is insanely faster than
> Storm. The problem with that study is that none of the code or
> configurations used are publicly available (that I’m aware of). So without
> a way to independently verify those claims, I’d dismiss it as marketing
> fluff (the same goes for the IBM InfoStreams comparison). Storm is very
> tunable when it comes to performance, allowing it to be tuned to the use
> case at hand. However, it is also easy to cripple performance with the
> wrong config.
>
> I can personally verify that it is possible to process 1.2+ million
> (relatively small) messages per second with a 10-15 node cluster — and that
> includes writing to HBase, and other components (I don’t have the hardware
> specs handy, but can probably dig them up).
>
>
> - Taylor
>
>
>
>
> On Jun 9, 2014, at 4:04 PM, Rajiv Onat <or...@gmail.com> wrote:
>
> Thanks. Not sure why you say it is different, from a stream processing use
> case perspective both seems to accomplish the same thing while the
> implementation may take different approaches. If I want to aggregate and do
> stats in Storm, I would have to microbatch the tuples at some level. e.g.
> count of orders in last 1 minute, in Storm I have to write code to for
> sliding windows and state management, while Spark seems to provide
> operators to accomplish that. Tuple level operations such as enrichment,
> filters etc.. seems also doable in both.
>
>
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>
> They are different.
>
> Storm allows right now processing of tuples.  Spark streaming requires
> micro batching (which may be a really short time).  Spark streaming allows
> checkpointing of partial results in the stream supported by the framework.
>  Storm says you should roll your own or use trident.
>
> Applications that fit one like a glove are likely to bind a bit on the
> other.
>
>
>
>
> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
>
> I'm trying to figure out whether these are competitive technologies for
> stream processing or complimentary? From the initial read, from a stream
> processing capabilities both provides a framework for scaling while Spark
> has window constructs, Apache Spark has a Spark Streaming and promises one
> platform for batch, interactive and stream processing.
>
> Any comments or thoughts?
>
>
>
>
>
>

RE: Apache Storm vs Apache Spark

Posted by Dan <dc...@hotmail.com>.

Where would Akka fit on the Storm/Spark spectrum?
ThanksDan

Date: Mon, 9 Jun 2014 15:48:49 -0700
Subject: Re: Apache Storm vs Apache Spark
From: orajiv@gmail.com
To: user@storm.incubator.apache.org

Thanks Taylor. Storm seems more flexible in terms of its framework in that it provides key primitives, the onus is on the developers depending on their QOS needs how to fine tune it. On the other hand, looking at Lambda architecture, Storm only fulfills the speed layer while Spark could be batch/speed/serving (Spark SQL).  Based on the use cases and compromises one would like to make on throughput/latency/QOS, guess have to pick the right one.

My simple use case isa) I have stream of orders (keyed on customerid, source is socket)b) I filter for those orders that is from my high value customers (I have to make sure I have this list of high value customers available on all bolt tasks in memory for fast correlation/projection), so  customer id in streams correlated to customer id in the list and if the customer type is in platinum and gold
c) Count the orders/amount for last 5 minutes and group by products, customer type

On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

The way I usually describe the difference is that Spark is a batch processing framework that also does micro-batching (Spark Streaming), while Storm is a stream processing framework that also does micro-batching (Trident). So architecturally they are very different, but have some similarity on the functional side.

With micro-batching you can achieve higher throughput at the cost of increased latency. With Spark this is unavoidable. With Storm you can use the core API (spouts and bolts) to do one-at-a-time processing to avoid the inherent latency overhead imposed by micro-batching. With Trident, you get state management out of the box, and sliding windows are supported as well.

In terms of adoption and production deployments, Storm has been around longer and there are a LOT of production deployments. I’m not aware of that many production Spark deployments, but I’d expect that to change over time.

In terms of performance, I can’t really point to any valid comparisons. When I say “valid” I mean open and independently verifiable. There is one study that I’m aware of that claims Spark streaming is insanely faster than Storm. The problem with that study is that none of the code or configurations used are publicly available (that I’m aware of). So without a way to independently verify those claims, I’d dismiss it as marketing fluff (the same goes for the IBM InfoStreams comparison). Storm is very tunable when it comes to performance, allowing it to be tuned to the use case at hand. However, it is also easy to cripple performance with the wrong config.

I can personally verify that it is possible to process 1.2+ million (relatively small) messages per second with a 10-15 node cluster — and that includes writing to HBase, and other components (I don’t have the hardware specs handy, but can probably dig them up).

- Taylor

On Jun 9, 2014, at 4:04 PM, Rajiv Onat <or...@gmail.com> wrote:

Thanks. Not sure why you say it is different, from a stream processing use case perspective both seems to accomplish the same thing while the implementation may take different approaches. If I want to aggregate and do stats in Storm, I would have to microbatch the tuples at some level. e.g. count of orders in last 1 minute, in Storm I have to write code to for sliding windows and state management, while Spark seems to provide operators to accomplish that. Tuple level operations such as enrichment, filters etc.. seems also doable in both.

On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com> wrote:

They are different.
Storm allows right now processing of tuples.  Spark streaming requires micro batching (which may be a really short time).  Spark streaming allows checkpointing of partial results in the stream supported by the framework.  Storm says you should roll your own or use trident.

Applications that fit one like a glove are likely to bind a bit on the other.

On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:

I'm trying to figure out whether these are competitive technologies for stream processing or complimentary? From the initial read, from a stream processing capabilities both provides a framework for scaling while Spark has window constructs, Apache Spark has a Spark Streaming and promises one platform for batch, interactive and stream processing. 

Any comments or thoughts?

Re: Apache Storm vs Apache Spark

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 9, 2014 at 3:48 PM, Rajiv Onat <or...@gmail.com> wrote:

> a) I have stream of orders (keyed on customerid, source is socket)
> b) I filter for those orders that is from my high value customers (I have
> to make sure I have this list of high value customers available on all bolt
> tasks in memory for fast correlation/projection), so  customer id in
> streams correlated to customer id in the list and if the customer type is
> in platinum and gold
> c) Count the orders/amount for last 5 minutes and group by products,
> customer type
>

If you want to join against a customer table, divide into low and high
value and do micro-batched 5 minute summaries for these categories, then
Spark Streaming is an ideal match.

Storm isn't a bad match at all either, especially since the number of
orders is likely to be small enough that none of the approaches are going
to be stressed performance wise.

Re: Apache Storm vs Apache Spark

Posted by Rajiv Onat <or...@gmail.com>.

Thanks Taylor. Storm seems more flexible in terms of its framework in that
it provides key primitives, the onus is on the developers depending on
their QOS needs how to fine tune it. On the other hand, looking at Lambda
architecture, Storm only fulfills the speed layer while Spark could be
batch/speed/serving (Spark SQL).  Based on the use cases and compromises
one would like to make on throughput/latency/QOS, guess have to pick the
right one.

My simple use case is
a) I have stream of orders (keyed on customerid, source is socket)
b) I filter for those orders that is from my high value customers (I have
to make sure I have this list of high value customers available on all bolt
tasks in memory for fast correlation/projection), so  customer id in
streams correlated to customer id in the list and if the customer type is
in platinum and gold
c) Count the orders/amount for last 5 minutes and group by products,
customer type





On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> The way I usually describe the difference is that Spark is a batch
> processing framework that also does micro-batching (Spark Streaming), while
> Storm is a stream processing framework that also does micro-batching
> (Trident). So architecturally they are very different, but have some
> similarity on the functional side.
>
> With micro-batching you can achieve higher throughput at the cost of
> increased latency. With Spark this is unavoidable. With Storm you can use
> the core API (spouts and bolts) to do one-at-a-time processing to avoid the
> inherent latency overhead imposed by micro-batching. With Trident, you get
> state management out of the box, and sliding windows are supported as well.
>
> In terms of adoption and production deployments, Storm has been around
> longer and there are a LOT of production deployments. I’m not aware of that
> many production Spark deployments, but I’d expect that to change over time.
>
> In terms of performance, I can’t really point to any valid comparisons.
> When I say “valid” I mean open and independently verifiable. There is one
> study that I’m aware of that claims Spark streaming is insanely faster than
> Storm. The problem with that study is that none of the code or
> configurations used are publicly available (that I’m aware of). So without
> a way to independently verify those claims, I’d dismiss it as marketing
> fluff (the same goes for the IBM InfoStreams comparison). Storm is very
> tunable when it comes to performance, allowing it to be tuned to the use
> case at hand. However, it is also easy to cripple performance with the
> wrong config.
>
> I can personally verify that it is possible to process 1.2+ million
> (relatively small) messages per second with a 10-15 node cluster — and that
> includes writing to HBase, and other components (I don’t have the hardware
> specs handy, but can probably dig them up).
>
>
> - Taylor
>
>
>
>
> On Jun 9, 2014, at 4:04 PM, Rajiv Onat <or...@gmail.com> wrote:
>
> Thanks. Not sure why you say it is different, from a stream processing use
> case perspective both seems to accomplish the same thing while the
> implementation may take different approaches. If I want to aggregate and do
> stats in Storm, I would have to microbatch the tuples at some level. e.g.
> count of orders in last 1 minute, in Storm I have to write code to for
> sliding windows and state management, while Spark seems to provide
> operators to accomplish that. Tuple level operations such as enrichment,
> filters etc.. seems also doable in both.
>
>
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> They are different.
>>
>> Storm allows right now processing of tuples.  Spark streaming requires
>> micro batching (which may be a really short time).  Spark streaming allows
>> checkpointing of partial results in the stream supported by the framework.
>>  Storm says you should roll your own or use trident.
>>
>> Applications that fit one like a glove are likely to bind a bit on the
>> other.
>>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
>>
>>> I'm trying to figure out whether these are competitive technologies for
>>> stream processing or complimentary? From the initial read, from a stream
>>> processing capabilities both provides a framework for scaling while Spark
>>> has window constructs, Apache Spark has a Spark Streaming and promises one
>>> platform for batch, interactive and stream processing.
>>>
>>> Any comments or thoughts?
>>>
>>
>>
>
>

Re: Apache Storm vs Apache Spark

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

The way I usually describe the difference is that Spark is a batch processing framework that also does micro-batching (Spark Streaming), while Storm is a stream processing framework that also does micro-batching (Trident). So architecturally they are very different, but have some similarity on the functional side.

With micro-batching you can achieve higher throughput at the cost of increased latency. With Spark this is unavoidable. With Storm you can use the core API (spouts and bolts) to do one-at-a-time processing to avoid the inherent latency overhead imposed by micro-batching. With Trident, you get state management out of the box, and sliding windows are supported as well.

In terms of adoption and production deployments, Storm has been around longer and there are a LOT of production deployments. I’m not aware of that many production Spark deployments, but I’d expect that to change over time.

In terms of performance, I can’t really point to any valid comparisons. When I say “valid” I mean open and independently verifiable. There is one study that I’m aware of that claims Spark streaming is insanely faster than Storm. The problem with that study is that none of the code or configurations used are publicly available (that I’m aware of). So without a way to independently verify those claims, I’d dismiss it as marketing fluff (the same goes for the IBM InfoStreams comparison). Storm is very tunable when it comes to performance, allowing it to be tuned to the use case at hand. However, it is also easy to cripple performance with the wrong config.

I can personally verify that it is possible to process 1.2+ million (relatively small) messages per second with a 10-15 node cluster — and that includes writing to HBase, and other components (I don’t have the hardware specs handy, but can probably dig them up).

- Taylor

On Jun 9, 2014, at 4:04 PM, Rajiv Onat <or...@gmail.com> wrote:

> Thanks. Not sure why you say it is different, from a stream processing use case perspective both seems to accomplish the same thing while the implementation may take different approaches. If I want to aggregate and do stats in Storm, I would have to microbatch the tuples at some level. e.g. count of orders in last 1 minute, in Storm I have to write code to for sliding windows and state management, while Spark seems to provide operators to accomplish that. Tuple level operations such as enrichment, filters etc.. seems also doable in both.
> 
> 
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> They are different.
> 
> Storm allows right now processing of tuples.  Spark streaming requires micro batching (which may be a really short time).  Spark streaming allows checkpointing of partial results in the stream supported by the framework.  Storm says you should roll your own or use trident.
> 
> Applications that fit one like a glove are likely to bind a bit on the other.
> 
> 
> 
> 
> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
> I'm trying to figure out whether these are competitive technologies for stream processing or complimentary? From the initial read, from a stream processing capabilities both provides a framework for scaling while Spark has window constructs, Apache Spark has a Spark Streaming and promises one platform for batch, interactive and stream processing. 
> 
> Any comments or thoughts?
> 
>

Re: Apache Storm vs Apache Spark

Posted by Rajiv Onat <or...@gmail.com>.

Thanks. Not sure why you say it is different, from a stream processing use
case perspective both seems to accomplish the same thing while the
implementation may take different approaches. If I want to aggregate and do
stats in Storm, I would have to microbatch the tuples at some level. e.g.
count of orders in last 1 minute, in Storm I have to write code to for
sliding windows and state management, while Spark seems to provide
operators to accomplish that. Tuple level operations such as enrichment,
filters etc.. seems also doable in both.

On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <te...@gmail.com> wrote:

>
> They are different.
>
> Storm allows right now processing of tuples.  Spark streaming requires
> micro batching (which may be a really short time).  Spark streaming allows
> checkpointing of partial results in the stream supported by the framework.
>  Storm says you should roll your own or use trident.
>
> Applications that fit one like a glove are likely to bind a bit on the
> other.
>
>
>
>
> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:
>
>> I'm trying to figure out whether these are competitive technologies for
>> stream processing or complimentary? From the initial read, from a stream
>> processing capabilities both provides a framework for scaling while Spark
>> has window constructs, Apache Spark has a Spark Streaming and promises one
>> platform for batch, interactive and stream processing.
>>
>> Any comments or thoughts?
>>
>
>

Re: Apache Storm vs Apache Spark

Posted by Ted Dunning <te...@gmail.com>.

They are different.

Storm allows right now processing of tuples.  Spark streaming requires
micro batching (which may be a really short time).  Spark streaming allows
checkpointing of partial results in the stream supported by the framework.
 Storm says you should roll your own or use trident.

Applications that fit one like a glove are likely to bind a bit on the
other.

On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <or...@gmail.com> wrote:

> I'm trying to figure out whether these are competitive technologies for
> stream processing or complimentary? From the initial read, from a stream
> processing capabilities both provides a framework for scaling while Spark
> has window constructs, Apache Spark has a Spark Streaming and promises one
> platform for batch, interactive and stream processing.
>
> Any comments or thoughts?
>