You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by "Sunny, Mani Kolbe" <Su...@DNB.com> on 2020/06/24 12:04:41 UTC

How to avoid data loss during streaming stops

Hello,

We are developing a beam pipeline which runs on SparkRunner on streaming mode. This pipeline read from Kinesis, do some translations, filtering and finally output to S3 using AvroIO writer. We are using Fixed windows with triggers based on element count and processing time intervals. Outputs path is partitioned by window start timestamp. allowedLateness=0sec

This is working fine. But as we productionize it, we need to ensure no data loss during failures or when stopping streaming for planned downtime etc. We are worried if already-read-but-yet-to-be-processed records will be lost during such events. Essentially we need a way to pause reading from source, allow it drain already read records, then do some maintenance activity and then resume streaming. Or move kinesis checkpointing to after "processing" parDos. Is any way to implement these?

Regards,
Mani