You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Travis Brady <tr...@gmail.com> on 2013/09/01 22:15:55 UTC

Re: Using Samza to build a CEP (Complex Events Processing) system?

I looked at Druid when it was first open sourced and I believe it's quite
different from Samza.  I considered it as an alternative to Redshift, but
ultimately Redshift is just so easy to operate and scale that it's hard to
turn down.

It's primary goal (if memory serves) is to be a fast data warehousing
system that allows for low-latency ingestion of data.

If you look at the "Druid vs." section here:
https://github.com/metamx/druid/wiki you'll see they compare it against a
bunch of datastores and not against Storm, Esper or any other stream
processing systems.
So I believe the model in Druid is probably:
1. Make some tables
2. Set up a real-time ETL pipeline
3. Query

If you're looking at Storm-ish stream processing systems I think the list
is currently:
1. Storm
2. Samza
3. Mupd8 (https://github.com/walmartlabs/mupd8)
4. Esper (admittedly a different animal but relevant)
5. EEP (
http://blog.clojurewerkz.org/blog/2013/08/29/stream-processing-with-eep/)
6. RxJava (https://github.com/Netflix/RxJava) not exactly the same, but
applicable
7. S4 http://incubator.apache.org/s4/
8. Dempsy: http://dempsy.github.io/Dempsy/
9. Spark Streaming:
http://spark.incubator.apache.org/docs/0.7.3/streaming-programming-guide.html
10. Commercial stuff (StreamBase, Truviso, Aleri, etc etc)


On Sat, Aug 31, 2013 at 2:56 PM, Alex The Rocker <al...@gmail.com>wrote:

> Chris,
>
> Thanks you very much for your detailed.
> Another system for processing real-time data just came to my attention
> (thanks to Kafka mailing list, again).
> It's called Druid (more at: http://druid.io).
>
> While I now understand Samza advantages over Storm for building a CEP, I am
> wondering how Samza compares to Druid.
> I guess I may not alone wondering about Samza vs. Druid, so you may want to
> add a Samza vs. Druid" item in Samza documenation :)
>
> Thanks,
> Alex.
>
>
>
>
> On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <criccomini@linkedin.com
> >wrote:
>
> > Hey Alex,
> >
> > As I understand it, the CEP pattern you describing is, "look for a series
> > of events within some bounded time frame, and take an action based on the
> > combination of events." You use an example of three events arriving
> within
> > 10 minutes of each other, consecutively. Wikipedia uses a similar example
> > (wedding bell event + man in suit event + woman in white dress event +
> > rice thrown event = wedding) on their CEP page.
> >
> > This pattern can be implemented in Samza fairly easily using Samza's
> > key/value store (or some other StorageEngine, if you choose to implement
> > it). It's best to use a key/value store for this use case, since the
> > window might be quite long (10 minutes), and all events in the window
> > might not fit in memory. If you use Samza's key/value store, you can put
> > each message (and a timestamp) into the key/value store as the messages
> > arrive. You can then implement the WindowableTask interface along with
> the
> > StreamTask interface, and configure Samza to call window() on your task
> > every N seconds (say, task.window.ms=60000). The window method could
> then
> > do a range query on the key/value store, and check for message chains
> > (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an
> > expected message was missing, you could then take some action (send an
> > alert, or whatever).
> >
> > In general, when I think CEP, I think Esper (http://esper.codehaus.org/
> ).
> > You should be able to implement a lot of CEP/SQL type commands (SELECT,
> > JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc)
> > using Samza's StreamTask interface, and is state management facilities.
> >
> > Beyond state management, most features in Samza enable CEP processing, in
> > one way or another. From your perspective, you can look at Samza as the
> > underlying framework with which you might choose to implement a CEP type
> > system (think MapReduce is to Hive as Samza is to a CEP system). Specific
> > things that help are its WindowableTask interface, the partitioning model
> > (which lends itself to distributed joins and aggregation), and Samza's
> > state management features.
> >
> > One thing to be aware of right now is Samza's "at least once" messaging
> > guarantee when failures occur (inherited from Kafka). You might receive
> > duplicate messages. This means you can potentially double count, if
> you're
> > doing aggregation. In the example you give (E1, E2, E3), this shouldn¹t
> be
> > a problem. We have plans to provide exactly once messaging, but we
> haven't
> > implemented the feature yet.
> >
> > Cheers,
> > Chris
> >
> > On 8/24/13 12:05 PM, "Alex The Rocker" <al...@gmail.com> wrote:
> >
> > >Hello,
> > >
> > >I just began to read about Samza, and I very excited about it (I was
> > >warned
> > >of its existence by Jay Kreps' post in Kafka users list, BTW).
> > >
> > >My first reaction is: are you guys using it at LinkedIn for applications
> > >which lies in the CEP (Complex Event Processing) system domain?
> > >
> > >To be more specific, would stateful Samza tasks be used in order to
> > >compute
> > >complex states such as "event E1 is followed by E2 then by E3 with less
> > >than 10 minutes interval between each event" ?
> > >
> > >I was looking at Storm for CEP, but as pointed out in Samza Storm page,
> > >Storm leaves state management to the bolts code, whereas Samza has
> > >"something".
> > >
> > >Beyond state management, what else would make Samza a good building
> block
> > >for a CEP?  Or a bad one?
> > >
> > >Thanks,
> > >Alex.
> >
> >
>

RE: Using Samza to build a CEP (Complex Events Processing) system?

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,

Travis' answer is pretty spot on.

To get a little more nitty-gritty about Druid, my understanding is that it's an in-memory OLAP system with a real-time ingestion pipeline.

Samza is actually quite complimentary to Druid. For example, if you are ingesting events into Druid, you might need to transform them into a standard format first. You could use Samza for that part of the ingestion pipeline. We actually do some of this at LinkedIn for some of our ingestion pipelines. The pattern is:

1. Send event to Kafka
2. Samza process consumes event from Kafka, and transforms it into a "standard" event (e.g. a search update event), and sends it to a new Kafka topic.
3. Downstream system consumes transformed event and updates data stores.

Travis' list of stream processing systems is great. We've done a brief comparison to Storm and MUPD8, which are up on our docs page. See http://samza.incubator.apache.org/learn/documentation/0.7.0/comparisons/introduction.html for a bit of information on what makes Samza different from most other stream processing systems.

Cheers,
Chris
________________________________________
From: Travis Brady [travis.brady@gmail.com]
Sent: Sunday, September 01, 2013 1:15 PM
To: dev@samza.incubator.apache.org
Subject: Re: Using Samza to build a CEP (Complex Events Processing) system?

I looked at Druid when it was first open sourced and I believe it's quite
different from Samza.  I considered it as an alternative to Redshift, but
ultimately Redshift is just so easy to operate and scale that it's hard to
turn down.

It's primary goal (if memory serves) is to be a fast data warehousing
system that allows for low-latency ingestion of data.

If you look at the "Druid vs." section here:
https://github.com/metamx/druid/wiki you'll see they compare it against a
bunch of datastores and not against Storm, Esper or any other stream
processing systems.
So I believe the model in Druid is probably:
1. Make some tables
2. Set up a real-time ETL pipeline
3. Query

If you're looking at Storm-ish stream processing systems I think the list
is currently:
1. Storm
2. Samza
3. Mupd8 (https://github.com/walmartlabs/mupd8)
4. Esper (admittedly a different animal but relevant)
5. EEP (
http://blog.clojurewerkz.org/blog/2013/08/29/stream-processing-with-eep/)
6. RxJava (https://github.com/Netflix/RxJava) not exactly the same, but
applicable
7. S4 http://incubator.apache.org/s4/
8. Dempsy: http://dempsy.github.io/Dempsy/
9. Spark Streaming:
http://spark.incubator.apache.org/docs/0.7.3/streaming-programming-guide.html
10. Commercial stuff (StreamBase, Truviso, Aleri, etc etc)


On Sat, Aug 31, 2013 at 2:56 PM, Alex The Rocker <al...@gmail.com>wrote:

> Chris,
>
> Thanks you very much for your detailed.
> Another system for processing real-time data just came to my attention
> (thanks to Kafka mailing list, again).
> It's called Druid (more at: http://druid.io).
>
> While I now understand Samza advantages over Storm for building a CEP, I am
> wondering how Samza compares to Druid.
> I guess I may not alone wondering about Samza vs. Druid, so you may want to
> add a Samza vs. Druid" item in Samza documenation :)
>
> Thanks,
> Alex.
>
>
>
>
> On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <criccomini@linkedin.com
> >wrote:
>
> > Hey Alex,
> >
> > As I understand it, the CEP pattern you describing is, "look for a series
> > of events within some bounded time frame, and take an action based on the
> > combination of events." You use an example of three events arriving
> within
> > 10 minutes of each other, consecutively. Wikipedia uses a similar example
> > (wedding bell event + man in suit event + woman in white dress event +
> > rice thrown event = wedding) on their CEP page.
> >
> > This pattern can be implemented in Samza fairly easily using Samza's
> > key/value store (or some other StorageEngine, if you choose to implement
> > it). It's best to use a key/value store for this use case, since the
> > window might be quite long (10 minutes), and all events in the window
> > might not fit in memory. If you use Samza's key/value store, you can put
> > each message (and a timestamp) into the key/value store as the messages
> > arrive. You can then implement the WindowableTask interface along with
> the
> > StreamTask interface, and configure Samza to call window() on your task
> > every N seconds (say, task.window.ms=60000). The window method could
> then
> > do a range query on the key/value store, and check for message chains
> > (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an
> > expected message was missing, you could then take some action (send an
> > alert, or whatever).
> >
> > In general, when I think CEP, I think Esper (http://esper.codehaus.org/
> ).
> > You should be able to implement a lot of CEP/SQL type commands (SELECT,
> > JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc)
> > using Samza's StreamTask interface, and is state management facilities.
> >
> > Beyond state management, most features in Samza enable CEP processing, in
> > one way or another. From your perspective, you can look at Samza as the
> > underlying framework with which you might choose to implement a CEP type
> > system (think MapReduce is to Hive as Samza is to a CEP system). Specific
> > things that help are its WindowableTask interface, the partitioning model
> > (which lends itself to distributed joins and aggregation), and Samza's
> > state management features.
> >
> > One thing to be aware of right now is Samza's "at least once" messaging
> > guarantee when failures occur (inherited from Kafka). You might receive
> > duplicate messages. This means you can potentially double count, if
> you're
> > doing aggregation. In the example you give (E1, E2, E3), this shouldn¹t
> be
> > a problem. We have plans to provide exactly once messaging, but we
> haven't
> > implemented the feature yet.
> >
> > Cheers,
> > Chris
> >
> > On 8/24/13 12:05 PM, "Alex The Rocker" <al...@gmail.com> wrote:
> >
> > >Hello,
> > >
> > >I just began to read about Samza, and I very excited about it (I was
> > >warned
> > >of its existence by Jay Kreps' post in Kafka users list, BTW).
> > >
> > >My first reaction is: are you guys using it at LinkedIn for applications
> > >which lies in the CEP (Complex Event Processing) system domain?
> > >
> > >To be more specific, would stateful Samza tasks be used in order to
> > >compute
> > >complex states such as "event E1 is followed by E2 then by E3 with less
> > >than 10 minutes interval between each event" ?
> > >
> > >I was looking at Storm for CEP, but as pointed out in Samza Storm page,
> > >Storm leaves state management to the bolts code, whereas Samza has
> > >"something".
> > >
> > >Beyond state management, what else would make Samza a good building
> block
> > >for a CEP?  Or a bad one?
> > >
> > >Thanks,
> > >Alex.
> >
> >
>