You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Arvind Kalyan <ar...@gmail.com> on 2016/09/25 06:25:45 UTC

Help related to KStreams and Stateful event processing.

Hello there!

I read about Kafka Streams recently, pretty interesting the way it solves
the stream processing problem in a more cleaner way with less overheads and
complexities.

I work as a Software Engineer in a startup, and we are in the design stage
for building a stream processing pipeline (if you will) for the millions of
events we get every day. We use Kafka (cluster) as the log aggregation
layer already in production a 5-6 months back and very happy about it.

I went through a few confluent blogs (by Jay, Neha) as to how KStreams
solve for sort of a state-ful event processing, and maybe I missed the
whole point in this regard, I have some doubts.

We have use-cases like the following:

There is an event E1, which is sort-of the base event after which we have a
lot of sub- events E2,E3..En enriching E1 with lot of extra properties
(with considerable delay, say 30-40 mins).

Eg. 1: An order event has come in where the user has ordered an item on our
website (This is the base event). After say 30-40 minutes, we get events
like packaging_time, shipping_time, delivered_time or cancelled_time etc
related to that order (These are the sub-events).

So before we get the whole event to a warehouse, we need to collect all
these (ordered, packaged, shipped, cancelled/delivered), and whenever I get
a cancelled or delivered event for an order, I know that completes the
lifecycle for that order, and can put it in the warehouse.

Eg. 2: User login events - If we are to capture events like User-Logged-In,
User-Logged-Out, I need it to be in the warehouse as a single row. Eg. row
would have these columns *user_id, login_time, logout_time*. So as and when
I receive a logout event (and if I have login event stored in some store),
there would be a trigger which combines both, and send it across to the
warehouse.

All these involve storing the state of the events and act as-and-when
another event (that completes lifecycle) occurs, after which you have a
trigger for further steps (warehouse or anything else).

Does KStream help me do this? If not, how should I go about solving this
problem?

Also, I wanted some advice as to whether it is a standard practice to
aggregate like this and *then* store to warehouse, or should I append each
event into the warehouse and do sort-of an ELT on that using the warehouse?
(Run a query to re-structure the data in the database and store it off as a
separate table)

Eagerly waiting for your reply,
Arvind

Re: Help related to KStreams and Stateful event processing.

Posted by Michael Noll <mi...@confluent.io>.

Arvind,

your use cases sound very similar to what Bobby Calderwell recently
described and presented at StrangeLoop this year:

Commander: Better Distributed Applications through CQRS, Event Sourcing,
and Immutable Logs
https://speakerdeck.com/bobbycalderwood/commander-better-distributed-applications-through-cqrs-event-sourcing-and-immutable-logs
(The link also includes pointers to the recorded talk and example code.)

See also
http://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
for a higher-level introduction into such architectures.

Hope this helps!
Michael





On Mon, Sep 26, 2016 at 11:02 AM, Michael Noll <mi...@confluent.io> wrote:

> PS: Arvind, I think typically such questions should be sent to the
> kafka-user mailing list (not to kafka-dev).
>
> On Mon, Sep 26, 2016 at 3:41 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
>> Hi Arvind,
>>
>> short answer, Kafka Streams does definitely help you!
>>
>> Long answer, Kafka Streams offers two layer to program your stream
>> processing job. The low-level Processor API and the high level DSL.
>> Please check the documentation to get further details:
>> http://docs.confluent.io/3.0.1/streams/index.html
>>
>> With Processor API you are able to do anything -- on the cost of lower
>> abstraction and thus more coding. I guess, this would be there best way
>> for your use case to program your Kafka Streams application.
>>
>> The DSL is easier to use and provides high level abstractions --
>> however, I am not sure if it covers what you need in your use case. But
>> maybe it's worth to try it out before using Processor API...
>>
>> For your second question, I would recommend to use Processor API an
>> attach a state store to a processor node and write to your data
>> warehouse whenever a state is "complete" (see
>> http://docs.confluent.io/3.0.1/streams/developer-guide.html#
>> defining-a-state-store).
>>
>> One more hint: you can actually mix DSL and Processor API by using (eg.
>> process() or transform() within DSL).
>>
>>
>> Hope this gives you some initial pointers. Please follow up if you have
>> more questions.
>>
>> -Matthias
>>
>>
>> On 09/24/2016 11:25 PM, Arvind Kalyan wrote:
>> > Hello there!
>> >
>> > I read about Kafka Streams recently, pretty interesting the way it
>> solves
>> > the stream processing problem in a more cleaner way with less overheads
>> and
>> > complexities.
>> >
>> > I work as a Software Engineer in a startup, and we are in the design
>> stage
>> > for building a stream processing pipeline (if you will) for the
>> millions of
>> > events we get every day. We use Kafka (cluster) as the log aggregation
>> > layer already in production a 5-6 months back and very happy about it.
>> >
>> > I went through a few confluent blogs (by Jay, Neha) as to how KStreams
>> > solve for sort of a state-ful event processing, and maybe I missed the
>> > whole point in this regard, I have some doubts.
>> >
>> > We have use-cases like the following:
>> >
>> > There is an event E1, which is sort-of the base event after which we
>> have a
>> > lot of sub- events E2,E3..En enriching E1 with lot of extra properties
>> > (with considerable delay, say 30-40 mins).
>> >
>> > Eg. 1: An order event has come in where the user has ordered an item on
>> our
>> > website (This is the base event). After say 30-40 minutes, we get events
>> > like packaging_time, shipping_time, delivered_time or cancelled_time etc
>> > related to that order (These are the sub-events).
>> >
>> > So before we get the whole event to a warehouse, we need to collect all
>> > these (ordered, packaged, shipped, cancelled/delivered), and whenever I
>> get
>> > a cancelled or delivered event for an order, I know that completes the
>> > lifecycle for that order, and can put it in the warehouse.
>> >
>> > Eg. 2: User login events - If we are to capture events like
>> User-Logged-In,
>> > User-Logged-Out, I need it to be in the warehouse as a single row. Eg.
>> row
>> > would have these columns *user_id, login_time, logout_time*. So as and
>> when
>> > I receive a logout event (and if I have login event stored in some
>> store),
>> > there would be a trigger which combines both, and send it across to the
>> > warehouse.
>> >
>> > All these involve storing the state of the events and act as-and-when
>> > another event (that completes lifecycle) occurs, after which you have a
>> > trigger for further steps (warehouse or anything else).
>> >
>> > Does KStream help me do this? If not, how should I go about solving this
>> > problem?
>> >
>> > Also, I wanted some advice as to whether it is a standard practice to
>> > aggregate like this and *then* store to warehouse, or should I append
>> each
>> > event into the warehouse and do sort-of an ELT on that using the
>> warehouse?
>> > (Run a query to re-structure the data in the database and store it off
>> as a
>> > separate table)
>> >
>> > Eagerly waiting for your reply,
>> > Arvind
>> >
>>
>>
>
>
>

Re: Help related to KStreams and Stateful event processing.

Posted by Michael Noll <mi...@confluent.io>.

PS: Arvind, I think typically such questions should be sent to the
kafka-user mailing list (not to kafka-dev).

On Mon, Sep 26, 2016 at 3:41 AM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Hi Arvind,
>
> short answer, Kafka Streams does definitely help you!
>
> Long answer, Kafka Streams offers two layer to program your stream
> processing job. The low-level Processor API and the high level DSL.
> Please check the documentation to get further details:
> http://docs.confluent.io/3.0.1/streams/index.html
>
> With Processor API you are able to do anything -- on the cost of lower
> abstraction and thus more coding. I guess, this would be there best way
> for your use case to program your Kafka Streams application.
>
> The DSL is easier to use and provides high level abstractions --
> however, I am not sure if it covers what you need in your use case. But
> maybe it's worth to try it out before using Processor API...
>
> For your second question, I would recommend to use Processor API an
> attach a state store to a processor node and write to your data
> warehouse whenever a state is "complete" (see
> http://docs.confluent.io/3.0.1/streams/developer-guide.
> html#defining-a-state-store).
>
> One more hint: you can actually mix DSL and Processor API by using (eg.
> process() or transform() within DSL).
>
>
> Hope this gives you some initial pointers. Please follow up if you have
> more questions.
>
> -Matthias
>
>
> On 09/24/2016 11:25 PM, Arvind Kalyan wrote:
> > Hello there!
> >
> > I read about Kafka Streams recently, pretty interesting the way it solves
> > the stream processing problem in a more cleaner way with less overheads
> and
> > complexities.
> >
> > I work as a Software Engineer in a startup, and we are in the design
> stage
> > for building a stream processing pipeline (if you will) for the millions
> of
> > events we get every day. We use Kafka (cluster) as the log aggregation
> > layer already in production a 5-6 months back and very happy about it.
> >
> > I went through a few confluent blogs (by Jay, Neha) as to how KStreams
> > solve for sort of a state-ful event processing, and maybe I missed the
> > whole point in this regard, I have some doubts.
> >
> > We have use-cases like the following:
> >
> > There is an event E1, which is sort-of the base event after which we
> have a
> > lot of sub- events E2,E3..En enriching E1 with lot of extra properties
> > (with considerable delay, say 30-40 mins).
> >
> > Eg. 1: An order event has come in where the user has ordered an item on
> our
> > website (This is the base event). After say 30-40 minutes, we get events
> > like packaging_time, shipping_time, delivered_time or cancelled_time etc
> > related to that order (These are the sub-events).
> >
> > So before we get the whole event to a warehouse, we need to collect all
> > these (ordered, packaged, shipped, cancelled/delivered), and whenever I
> get
> > a cancelled or delivered event for an order, I know that completes the
> > lifecycle for that order, and can put it in the warehouse.
> >
> > Eg. 2: User login events - If we are to capture events like
> User-Logged-In,
> > User-Logged-Out, I need it to be in the warehouse as a single row. Eg.
> row
> > would have these columns *user_id, login_time, logout_time*. So as and
> when
> > I receive a logout event (and if I have login event stored in some
> store),
> > there would be a trigger which combines both, and send it across to the
> > warehouse.
> >
> > All these involve storing the state of the events and act as-and-when
> > another event (that completes lifecycle) occurs, after which you have a
> > trigger for further steps (warehouse or anything else).
> >
> > Does KStream help me do this? If not, how should I go about solving this
> > problem?
> >
> > Also, I wanted some advice as to whether it is a standard practice to
> > aggregate like this and *then* store to warehouse, or should I append
> each
> > event into the warehouse and do sort-of an ELT on that using the
> warehouse?
> > (Run a query to re-structure the data in the database and store it off
> as a
> > separate table)
> >
> > Eagerly waiting for your reply,
> > Arvind
> >
>
>

Re: Help related to KStreams and Stateful event processing.

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Hi Arvind,

short answer, Kafka Streams does definitely help you!

Long answer, Kafka Streams offers two layer to program your stream
processing job. The low-level Processor API and the high level DSL.
Please check the documentation to get further details:
http://docs.confluent.io/3.0.1/streams/index.html

With Processor API you are able to do anything -- on the cost of lower
abstraction and thus more coding. I guess, this would be there best way
for your use case to program your Kafka Streams application.

The DSL is easier to use and provides high level abstractions --
however, I am not sure if it covers what you need in your use case. But
maybe it's worth to try it out before using Processor API...

For your second question, I would recommend to use Processor API an
attach a state store to a processor node and write to your data
warehouse whenever a state is "complete" (see
http://docs.confluent.io/3.0.1/streams/developer-guide.html#defining-a-state-store).

One more hint: you can actually mix DSL and Processor API by using (eg.
process() or transform() within DSL).

Hope this gives you some initial pointers. Please follow up if you have
more questions.

-Matthias

On 09/24/2016 11:25 PM, Arvind Kalyan wrote:
> Hello there!
> 
> I read about Kafka Streams recently, pretty interesting the way it solves
> the stream processing problem in a more cleaner way with less overheads and
> complexities.
> 
> I work as a Software Engineer in a startup, and we are in the design stage
> for building a stream processing pipeline (if you will) for the millions of
> events we get every day. We use Kafka (cluster) as the log aggregation
> layer already in production a 5-6 months back and very happy about it.
> 
> I went through a few confluent blogs (by Jay, Neha) as to how KStreams
> solve for sort of a state-ful event processing, and maybe I missed the
> whole point in this regard, I have some doubts.
> 
> We have use-cases like the following:
> 
> There is an event E1, which is sort-of the base event after which we have a
> lot of sub- events E2,E3..En enriching E1 with lot of extra properties
> (with considerable delay, say 30-40 mins).
> 
> Eg. 1: An order event has come in where the user has ordered an item on our
> website (This is the base event). After say 30-40 minutes, we get events
> like packaging_time, shipping_time, delivered_time or cancelled_time etc
> related to that order (These are the sub-events).
> 
> So before we get the whole event to a warehouse, we need to collect all
> these (ordered, packaged, shipped, cancelled/delivered), and whenever I get
> a cancelled or delivered event for an order, I know that completes the
> lifecycle for that order, and can put it in the warehouse.
> 
> Eg. 2: User login events - If we are to capture events like User-Logged-In,
> User-Logged-Out, I need it to be in the warehouse as a single row. Eg. row
> would have these columns *user_id, login_time, logout_time*. So as and when
> I receive a logout event (and if I have login event stored in some store),
> there would be a trigger which combines both, and send it across to the
> warehouse.
> 
> All these involve storing the state of the events and act as-and-when
> another event (that completes lifecycle) occurs, after which you have a
> trigger for further steps (warehouse or anything else).
> 
> Does KStream help me do this? If not, how should I go about solving this
> problem?
> 
> Also, I wanted some advice as to whether it is a standard practice to
> aggregate like this and *then* store to warehouse, or should I append each
> event into the warehouse and do sort-of an ELT on that using the warehouse?
> (Run a query to re-structure the data in the database and store it off as a
> separate table)
> 
> Eagerly waiting for your reply,
> Arvind
>