You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "kvudata (via GitHub)" <gi...@apache.org> on 2023/07/19 20:39:16 UTC

[GitHub] [beam] kvudata commented on issue #27156: [Bug]: unexpected duplicate outputs triggered by quota exceeded exception in finish_bundle()

kvudata commented on issue #27156:
URL: https://github.com/apache/beam/issues/27156#issuecomment-1642724845

   Yes, we are using TFRecordIO (specifically `beam.io.WriteToTFRecord`).
   
   > it sounds like your process and/or finish_bundle methods may be not idempotent. Idempotency is an important consideration when writing IO code.
   
   Note that we're only logging to Stackdriver aka Google Cloud Logging in our DoFn (we gather logs in `process()` and then log the batch in `finish_bundle()`), and we're ok with duplicate logs being generated. We're observing that Dataflow seems to perform some kind of retry if our `finish_bundle()` fails, and the later `WriteToTFRecord` (which doesn't depend on this DoFn in our pipeline) ends up writing duplicates.
   
   Another interesting observation is that the metrics for the number of items processed / written in the Dataflow UI is the number of elements we would expect if duplicates were not being written - the duplicates are only apparent from inspecting the output tfrecord(s).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org