You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Mateusz Rataj (Jira)" <ji...@apache.org> on 2021/05/26 12:29:00 UTC

[jira] [Updated] (BEAM-12406) Progressing watermark for not available Kinesis stream

     [ https://issues.apache.org/jira/browse/BEAM-12406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mateusz Rataj updated BEAM-12406:
---------------------------------
    Description: 
We use Dataflow with Apache Beam to read events from Kinesis streams. Recently, we've spotted that in a case when one of the streams was not available in the middle of events processing (due to removal or problem with the credentials), the data watermark for this stream was still being updated.  

Imagine scenario:
 # Permissions allow to read from stream A
 # Data is read from stream A
 # Permissions are changed and don’t allow to read from stream A
 # Watermark for stream A is progressing (but stream data is not read due to permissions issue)
 # Permissions are fixed to read stream A
 # Data is read from stream A but from the updated watermark

As a result, stream data between steps 3-5 is lost and the client doesn’t know that.

Additionally, it may be confusing from the Dataflow console perspective, as it suggests that events are still being read from the stream. It is hard to rely on the watermark as a source metric for alerting purposes as well.

Brief investigation suggests that maybe the _KinesisReader.getWatermark()_ logic doesn’t consider the state of the stream i.e. is it available or not, and it treats the removed stream as a stream without traffic. Watermark calculation should be adjusted to take that information into account.

  was:
We use Dataflow with Apache Beam to read events from Kinesis streams. Recently, we've spotted that in a case when one of the streams was not available in the middle of events processing (due to removal or problem with the credentials), the data watermark for this stream was still being updated. 

 

Imagine scenario:
 # Permissions allow to read from stream A
 # Data is read from stream A
 # Permissions are changed and don’t allow to read from stream A
 # Watermark for stream A is progressing (but stream data is not read due to permissions issue)
 # Permissions are fixed to read stream A
 # Data is read from stream A but from the updated watermark

 

As a result, stream data between steps 3-5 is lost and the client doesn’t know that.

 

Additionally, it may be confusing from the Dataflow console perspective, as it suggests that events are still being read from the stream. It is hard to rely on the watermark as a source metric for alerting purposes as well.

 

Brief investigation suggests that maybe the _KinesisReader.getWatermark()_ logic doesn’t consider the state of the stream i.e. is it available or not, and it treats the removed stream as a stream without traffic. Watermark calculation should be adjusted to take that information into account.


> Progressing watermark for not available Kinesis stream
> ------------------------------------------------------
>
>                 Key: BEAM-12406
>                 URL: https://issues.apache.org/jira/browse/BEAM-12406
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kinesis
>    Affects Versions: 2.27.0
>            Reporter: Mateusz Rataj
>            Priority: P3
>
> We use Dataflow with Apache Beam to read events from Kinesis streams. Recently, we've spotted that in a case when one of the streams was not available in the middle of events processing (due to removal or problem with the credentials), the data watermark for this stream was still being updated.  
> Imagine scenario:
>  # Permissions allow to read from stream A
>  # Data is read from stream A
>  # Permissions are changed and don’t allow to read from stream A
>  # Watermark for stream A is progressing (but stream data is not read due to permissions issue)
>  # Permissions are fixed to read stream A
>  # Data is read from stream A but from the updated watermark
> As a result, stream data between steps 3-5 is lost and the client doesn’t know that.
> Additionally, it may be confusing from the Dataflow console perspective, as it suggests that events are still being read from the stream. It is hard to rely on the watermark as a source metric for alerting purposes as well.
> Brief investigation suggests that maybe the _KinesisReader.getWatermark()_ logic doesn’t consider the state of the stream i.e. is it available or not, and it treats the removed stream as a stream without traffic. Watermark calculation should be adjusted to take that information into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)