You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Tim Gent <ti...@gmail.com> on 2019/03/22 16:50:16 UTC

Tracking progress for messages generated by a batch process

Hi all,

We have a data processing system where a daily batch process generates
some data into a Kafka topic. This then goes through several other
components that enrich the data, these are also integrated via Kafka.
So overall we have something like:

Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3

We would like to know when all the data generated onto topic A finally
gets processed by streaming app 3, as we may trigger some other
processes from this (e.g. notifying customers their data is processed
for that day). We've come up with a possible solution, and it would be
great to get feedback to see what we missed.

Assumptions:
- Consumers all track their offsets using Kafka, committing once
they've done all required processing for a message
- We have some "batch-monitor" component which will track progress,
described below
- It isn't important to us to know exactly when the batch finished
processing, sometime soon after batch finished processing is good
enough

Broad flow:
- Batch job reads some input data and publishes output to topic A
- Batch job sends data to our "batch-monitor" component about the
offsets on each partition at the time it finishes it's processing
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic A for streaming app 2 consumer
- "batch-monitor" can therefore see when streaming app 2 has committed
all the offsets that were in the batch
- Once "batch-monitor" detects that streaming app 2 has finished it's
processing for the batch it records max offsets for all partitions for
messages in topic b -> these can be used to know when streaming app 3
has finished processing the batch
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic B for streaming app 3 consumer
- "batch-monitor" can therefore see when streaming app 3 has committed
all the offsets that were in the batch
- Once that happens "batch-monitor" can send some notification somewhere else

Any thoughts gratefully received

Tim

Re: Tracking progress for messages generated by a batch process

Posted by Harper Henn <ha...@datto.com>.
Assuming you know how many items are in a batch ahead of time, could you
just add a batch ID and position of a message within a batch to each
message you send to topic A? Then your end application (streaming app 3)
could check if every message in that batch has been processed, and trigger
events when that condition is true. This would require some kind of
tracking (perhaps in another database), but would get rid of the need for a
batch monitoring program that tracks offsets.

Kafka seems like an awkward fit for batch processing. Is it possible
there's another datastore that's better suited for your use case?

On Fri, Mar 22, 2019 at 11:04 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> Sounds reasonable to me.
>
> -Matthias
>
> On 3/22/19 9:50 AM, Tim Gent wrote:
> > Hi all,
> >
> > We have a data processing system where a daily batch process generates
> > some data into a Kafka topic. This then goes through several other
> > components that enrich the data, these are also integrated via Kafka.
> > So overall we have something like:
> >
> > Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3
> >
> > We would like to know when all the data generated onto topic A finally
> > gets processed by streaming app 3, as we may trigger some other
> > processes from this (e.g. notifying customers their data is processed
> > for that day). We've come up with a possible solution, and it would be
> > great to get feedback to see what we missed.
> >
> > Assumptions:
> > - Consumers all track their offsets using Kafka, committing once
> > they've done all required processing for a message
> > - We have some "batch-monitor" component which will track progress,
> > described below
> > - It isn't important to us to know exactly when the batch finished
> > processing, sometime soon after batch finished processing is good
> > enough
> >
> > Broad flow:
> > - Batch job reads some input data and publishes output to topic A
> > - Batch job sends data to our "batch-monitor" component about the
> > offsets on each partition at the time it finishes it's processing
> > - "batch-monitor" subscribes to the topic containing the committed
> > offsets for topic A for streaming app 2 consumer
> > - "batch-monitor" can therefore see when streaming app 2 has committed
> > all the offsets that were in the batch
> > - Once "batch-monitor" detects that streaming app 2 has finished it's
> > processing for the batch it records max offsets for all partitions for
> > messages in topic b -> these can be used to know when streaming app 3
> > has finished processing the batch
> > - "batch-monitor" subscribes to the topic containing the committed
> > offsets for topic B for streaming app 3 consumer
> > - "batch-monitor" can therefore see when streaming app 3 has committed
> > all the offsets that were in the batch
> > - Once that happens "batch-monitor" can send some notification somewhere
> else
> >
> > Any thoughts gratefully received
> >
> > Tim
> >
>
>

Re: Tracking progress for messages generated by a batch process

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Sounds reasonable to me.

-Matthias

On 3/22/19 9:50 AM, Tim Gent wrote:
> Hi all,
> 
> We have a data processing system where a daily batch process generates
> some data into a Kafka topic. This then goes through several other
> components that enrich the data, these are also integrated via Kafka.
> So overall we have something like:
> 
> Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3
> 
> We would like to know when all the data generated onto topic A finally
> gets processed by streaming app 3, as we may trigger some other
> processes from this (e.g. notifying customers their data is processed
> for that day). We've come up with a possible solution, and it would be
> great to get feedback to see what we missed.
> 
> Assumptions:
> - Consumers all track their offsets using Kafka, committing once
> they've done all required processing for a message
> - We have some "batch-monitor" component which will track progress,
> described below
> - It isn't important to us to know exactly when the batch finished
> processing, sometime soon after batch finished processing is good
> enough
> 
> Broad flow:
> - Batch job reads some input data and publishes output to topic A
> - Batch job sends data to our "batch-monitor" component about the
> offsets on each partition at the time it finishes it's processing
> - "batch-monitor" subscribes to the topic containing the committed
> offsets for topic A for streaming app 2 consumer
> - "batch-monitor" can therefore see when streaming app 2 has committed
> all the offsets that were in the batch
> - Once "batch-monitor" detects that streaming app 2 has finished it's
> processing for the batch it records max offsets for all partitions for
> messages in topic b -> these can be used to know when streaming app 3
> has finished processing the batch
> - "batch-monitor" subscribes to the topic containing the committed
> offsets for topic B for streaming app 3 consumer
> - "batch-monitor" can therefore see when streaming app 3 has committed
> all the offsets that were in the batch
> - Once that happens "batch-monitor" can send some notification somewhere else
> 
> Any thoughts gratefully received
> 
> Tim
>