You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pavel Molchanov <pa...@infodesk.com> on 2019/05/08 17:06:23 UTC

Event Sourcing question

I have an architectural question.

I am planning to create a data transformation pipeline for document
transformation. Each component will send processing events to the Kafka
'events' topic.

It will have the following steps:

1) Upload data to the repository (S3 or other storage). Get public URL to
the uploaded document. Create 'received' event with the document URL and
send the event to the Kafka 'events' topic.

2) Tranformer process will be listening to the Kafka 'events' topic. It
will react on the 'received' event in the 'events' topic, will download the
document, transform it, push the transformed document to the repository (S3
or other storage), create 'transformed' event and send 'transformed' event
to the same 'events' topic.

Tranformer process can break in the middle (exception, died, crashed,
etc.). Upon startup, Tranformer process needs to check 'events' topic for
documents that were received but not transformed.

Should it read all events from the 'events' topic? Should it join
'received' and 'transformed' events somehow to understand what was received
but not transformed?

I don't have a clear idea of how it should behave.

Please help.

*Pavel Molchanov*

Re: Event Sourcing question

Posted by Pavel Molchanov <pa...@infodesk.com>.
Ryanne,

Thank you for the advice. It's exactly what we have now. Different topics
for different processing steps.

However, I would like to make extensible architecture with multiple
processing steps.

We will introduce other transformers later on. I was thinking that if all
of them will send events on the same bus it will be easier to replay the
event sequence later.

Say, I will have 4 other steps later on in the pipeline. Should I create 4
more topics? Or should all other step transformers listen to the same bus
and pick up events they need?

*Pavel Molchanov*
Director of Software Development
InfoDesk
www.infodesk.com

1 Bridge Street  | Suite 105 | Irvington | New York | 10551
<https://maps.google.com/?q=660+White+Plains+Road+%7C+Suite+300+%7C+Tarrytown+%7C+New+York+%7C+10591&entry=gmail&source=g>|
Office: +1 (914) 332-5940
Change Privacy Settings
<https://www.infodesk.com/unsubscription-options> | Contact
Privacy Team <un...@infodesk.com> | Privacy Policy
<https://www.infodesk.com/privacy-policy>

This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended recipient(s).
Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.



On Wed, May 8, 2019 at 3:45 PM Ryanne Dolan <ry...@gmail.com> wrote:

> Pavel, one thing I'd recommend: don't jam multiple event types into a
> single topic. You are better served with multiple topics, each with a
> single schema and event type. In your case, you might have a received topic
> and a transformed topic, with an app consuming received and producing
> transformed.
>
> If your transformer process consumes, produces, and commits in the right
> order, your app can crash and restart without skipping records. Consider
> using Kafka Streams for this purpose, as it takes care of the semantics you
> need to do this correctly.
>
> Ryanne
>
> On Wed, May 8, 2019 at 12:06 PM Pavel Molchanov <
> pavel.molchanov@infodesk.com> wrote:
>
> > I have an architectural question.
> >
> > I am planning to create a data transformation pipeline for document
> > transformation. Each component will send processing events to the Kafka
> > 'events' topic.
> >
> > It will have the following steps:
> >
> > 1) Upload data to the repository (S3 or other storage). Get public URL to
> > the uploaded document. Create 'received' event with the document URL and
> > send the event to the Kafka 'events' topic.
> >
> > 2) Tranformer process will be listening to the Kafka 'events' topic. It
> > will react on the 'received' event in the 'events' topic, will download
> the
> > document, transform it, push the transformed document to the repository
> (S3
> > or other storage), create 'transformed' event and send 'transformed'
> event
> > to the same 'events' topic.
> >
> > Tranformer process can break in the middle (exception, died, crashed,
> > etc.). Upon startup, Tranformer process needs to check 'events' topic for
> > documents that were received but not transformed.
> >
> > Should it read all events from the 'events' topic? Should it join
> > 'received' and 'transformed' events somehow to understand what was
> received
> > but not transformed?
> >
> > I don't have a clear idea of how it should behave.
> >
> > Please help.
> >
> > *Pavel Molchanov*
> >
>

Re: Event Sourcing question

Posted by Raman Gupta <ro...@gmail.com>.
If ordering of these events is important, then putting them in the
same topic is not only desired, it's necessary. See
https://www.confluent.io/blog/put-several-event-types-kafka-topic/.
However, think hard about whether ordering is actually important in
your use case or not, as things are certainly simpler when a topic
contains a single message type.

To the original question: your transformer process can be in a stream
as Ryanne suggests, which should take care of most crash situations --
Kafka won't advance the stream consumption offset unless your
transformer completes successfully. Make the transformer idempotent
i.e. if the transformed record already exists in S3, it should just
overwrite it. If you do put both types of events in the same topic,
then the transformer stream can skip "transformed" events and just
execute its transformation on "received" events.

Think hard about how you want to handle failures in the transformer
stream. Some failures are unrecoverable, and you don't want the stream
process to die if they occur, otherwise it won't be able to make
progress past the failing message. For these cases, you'll likely want
to send the data to a dead-letter queue or something similar, so you
can examine why the failure occurred, determine if it is fixable, and
reprocess the event if it is. In our case we actually don't send the
original data to the DLQ, but rather just the meta-data of the failing
message (topic, partition, offset) and error information. For
temporary failures e.g. S3 is unavailable, OutOfMemory errors, and so
forth, its ok for the stream to die, so that the process will start
over at the current offset and retry. This is a great intro to DLQs
with Kafka: https://eng.uber.com/reliable-reprocessing/.

Finally, if you need an aggregated "status" view, you can use KTables
to aggregate all the event information together.

Regards,
Raman

On Wed, May 8, 2019 at 3:44 PM Ryanne Dolan <ry...@gmail.com> wrote:
>
> Pavel, one thing I'd recommend: don't jam multiple event types into a
> single topic. You are better served with multiple topics, each with a
> single schema and event type. In your case, you might have a received topic
> and a transformed topic, with an app consuming received and producing
> transformed.
>
> If your transformer process consumes, produces, and commits in the right
> order, your app can crash and restart without skipping records. Consider
> using Kafka Streams for this purpose, as it takes care of the semantics you
> need to do this correctly.
>
> Ryanne
>
> On Wed, May 8, 2019 at 12:06 PM Pavel Molchanov <
> pavel.molchanov@infodesk.com> wrote:
>
> > I have an architectural question.
> >
> > I am planning to create a data transformation pipeline for document
> > transformation. Each component will send processing events to the Kafka
> > 'events' topic.
> >
> > It will have the following steps:
> >
> > 1) Upload data to the repository (S3 or other storage). Get public URL to
> > the uploaded document. Create 'received' event with the document URL and
> > send the event to the Kafka 'events' topic.
> >
> > 2) Tranformer process will be listening to the Kafka 'events' topic. It
> > will react on the 'received' event in the 'events' topic, will download the
> > document, transform it, push the transformed document to the repository (S3
> > or other storage), create 'transformed' event and send 'transformed' event
> > to the same 'events' topic.
> >
> > Tranformer process can break in the middle (exception, died, crashed,
> > etc.). Upon startup, Tranformer process needs to check 'events' topic for
> > documents that were received but not transformed.
> >
> > Should it read all events from the 'events' topic? Should it join
> > 'received' and 'transformed' events somehow to understand what was received
> > but not transformed?
> >
> > I don't have a clear idea of how it should behave.
> >
> > Please help.
> >
> > *Pavel Molchanov*
> >

Re: Event Sourcing question

Posted by Ryanne Dolan <ry...@gmail.com>.
Pavel, one thing I'd recommend: don't jam multiple event types into a
single topic. You are better served with multiple topics, each with a
single schema and event type. In your case, you might have a received topic
and a transformed topic, with an app consuming received and producing
transformed.

If your transformer process consumes, produces, and commits in the right
order, your app can crash and restart without skipping records. Consider
using Kafka Streams for this purpose, as it takes care of the semantics you
need to do this correctly.

Ryanne

On Wed, May 8, 2019 at 12:06 PM Pavel Molchanov <
pavel.molchanov@infodesk.com> wrote:

> I have an architectural question.
>
> I am planning to create a data transformation pipeline for document
> transformation. Each component will send processing events to the Kafka
> 'events' topic.
>
> It will have the following steps:
>
> 1) Upload data to the repository (S3 or other storage). Get public URL to
> the uploaded document. Create 'received' event with the document URL and
> send the event to the Kafka 'events' topic.
>
> 2) Tranformer process will be listening to the Kafka 'events' topic. It
> will react on the 'received' event in the 'events' topic, will download the
> document, transform it, push the transformed document to the repository (S3
> or other storage), create 'transformed' event and send 'transformed' event
> to the same 'events' topic.
>
> Tranformer process can break in the middle (exception, died, crashed,
> etc.). Upon startup, Tranformer process needs to check 'events' topic for
> documents that were received but not transformed.
>
> Should it read all events from the 'events' topic? Should it join
> 'received' and 'transformed' events somehow to understand what was received
> but not transformed?
>
> I don't have a clear idea of how it should behave.
>
> Please help.
>
> *Pavel Molchanov*
>