You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Jeff Webb (Jira)" <ji...@apache.org> on 2021/09/14 22:50:00 UTC

[jira] [Updated] (BEAM-9354) How long does PubSubIO message deduplication last?

     [ https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Webb updated BEAM-9354:
----------------------------
    Status: Open  (was: Triage Needed)

> How long does PubSubIO message deduplication last?
> --------------------------------------------------
>
>                 Key: BEAM-9354
>                 URL: https://issues.apache.org/jira/browse/BEAM-9354
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Tianzi Cai
>            Priority: P3
>              Labels: gcp, pubsubio
>
> GCP documentation heavily [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub] Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the documentation, including the [source code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java], tells users how long this deduplication is supposed to last. 
> In [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
>     /**
>      * When reading from Cloud Pub/Sub where unique record identifiers are provided as Pub/Sub
>      * message attributes, specifies the name of the attribute containing the unique identifier. The
>      * value of the attribute can be any string that uniquely identifies this record.
>      *
>      * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream.
>      * If {@code idAttribute} is not provided, Beam cannot guarantee that no duplicate data will be
>      * delivered, and deduplication of the stream will be strictly best effort.
>      */
>     public Read<T> withIdAttribute(String idAttribute) {
>       return toBuilder().setIdAttribute(idAttribute).build();
>     }
> {code}
> This information here isn't enough for users to know if a second message, published with the same custom IdAttribute as that of a first message, which was published `x` minutes ago, would be deduplicated by the Dataflow runner. 
> Better documentation will help. I imagine a lot of users will wonder about this and may even ask how to configure this period, but that will probably need a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)