You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 15:22:12 UTC

[GitHub] [beam] damccorm opened a new issue, #20056: How long does PubSubIO message deduplication last?

damccorm opened a new issue, #20056:
URL: https://github.com/apache/beam/issues/20056

   GCP documentation heavily [promotes](https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub) Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the documentation, including the [source code](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java), tells users how long this deduplication is supposed to last. 
   
   In [`PubsubIO.java`](https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853):
   ```
   
       /**
        * When reading from Cloud Pub/Sub where unique record identifiers are provided as Pub/Sub
   
       * message attributes, specifies the name of the attribute containing the unique identifier. The
   
       * value of the attribute can be any string that uniquely identifies this record.
        *
        *
   <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream.
        *
   If {@code idAttribute} is not provided, Beam cannot guarantee that no duplicate data will be
        *
   delivered, and deduplication of the stream will be strictly best effort.
        */
       public Read<T>
   withIdAttribute(String idAttribute) {
         return toBuilder().setIdAttribute(idAttribute).build();
   
      }
   
   ```
   
   This information here isn't enough for users to know if a second message, published with the same custom IdAttribute as that of a first message, which was published `x` minutes ago, would be deduplicated by the Dataflow runner. 
   
   Better documentation will help. I imagine a lot of users will wonder about this and may even ask how to configure this period, but that will probably need a separate ticket.
   
   Imported from Jira [BEAM-9354](https://issues.apache.org/jira/browse/BEAM-9354). Original Jira may contain additional context.
   Reported by: tianzi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] anguillanneuf commented on issue #20056: How long does PubSubIO message deduplication last?

Posted by GitBox <gi...@apache.org>.
anguillanneuf commented on issue #20056:
URL: https://github.com/apache/beam/issues/20056#issuecomment-1148009357

   Thanks @damccorm. I know I reported this issue. The [documentation](https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#efficient_deduplication) has since been updated.
   
   > If PubsubIO is configured to use the Pub/Sub message attribute for deduplication instead of the message ID, Dataflow deduplicates messages published to Pub/Sub within 10 minutes of each other.
   
   Please feel free to close this bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm closed issue #20056: How long does PubSubIO message deduplication last?

Posted by GitBox <gi...@apache.org>.
damccorm closed issue #20056: How long does PubSubIO message deduplication last?
URL: https://github.com/apache/beam/issues/20056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org