You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pierre DOR <pd...@amadeus.com> on 2018/03/01 19:03:36 UTC

Detecting end of distributed batch processing within a streaming graph

Hello all,

This question is not strictly related to Kafka but rather to a streaming design using Kafka. Hope it still stays within the scope of this list.

I would like to distribute the processing of a monolithic batch within a streaming DAG, so with multiple parallel branches, each branch being composed of one or several streaming micro-services doing tasks in sequence, and communication between each micro-service being done through Kafka.
The problem is to detect instantaneously that all subtasks have been processed (successfully or not) within the DAG, thing that is rather trivial in the monolithic design.

To solve this I’m thinking about a special “control topic”, responsible for carrying end processing statuses, and a specific micro-service responsible for handling those processing statuses and for eventually reporting proactively the batch end event.
Knowing that there will be process, let’s call it “the feeder”, injecting all batch subtasks in the primary input topic of the DAG, the idea in a nutshell would be:
- To have the feeder generating first a START <batchID> event in the control topic, so that the “controller” micro-service initializes a state for the given batchID.
- To have the feeder injecting in sequence all subtasks in the primary input topic.
- To have the feeder injecting right after a special END <batchID> event in the primary input topic. This END event is special because instead of being sent over Kafka as any other message, i.e. on one partition, this message would be broadcast on every topic’s partitions, and cascaded within the DAG in the same way so that it is sure that it will go through absolutely all partitions (and all consumers) of the global streaming graph.
- Each consumer micro-service receiving this END message (so possibly multiple times) will report to the “control topic” that this END message has been received, together with some IDs (batch ID and partition ID).
- The “controller” micro-service updates a counter each time it receives an END event for a given batch ID. As the total number of partitions in the DAG is relatively static, this number can be pre-configured in the controller. When the total number of received END events matches the total number of partitions that means that the batch has been processed.

What do you think about this design?
Would you have more elegant ideas to solve this problem? I don’t like very much the idea to broadcast messages on all partitions, nor to couple the configuration with the total number of partitions, even if technically it seems to work.

Many thanks,
- Pierre