You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2020/08/10 17:07:12 UTC

[jira] [Commented] (BEAM-9354) How long does PubSubIO message deduplication last?

    [ https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17174531#comment-17174531 ] 

Beam JIRA Bot commented on BEAM-9354:
-------------------------------------

This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3.

Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean.


> How long does PubSubIO message deduplication last?
> --------------------------------------------------
>
>                 Key: BEAM-9354
>                 URL: https://issues.apache.org/jira/browse/BEAM-9354
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Tianzi Cai
>            Priority: P2
>              Labels: gcp, pubsubio, stale-P2
>
> GCP documentation heavily [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub] Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the documentation, including the [source code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java], tells users how long this deduplication is supposed to last. 
> In [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
>     /**
>      * When reading from Cloud Pub/Sub where unique record identifiers are provided as Pub/Sub
>      * message attributes, specifies the name of the attribute containing the unique identifier. The
>      * value of the attribute can be any string that uniquely identifies this record.
>      *
>      * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream.
>      * If {@code idAttribute} is not provided, Beam cannot guarantee that no duplicate data will be
>      * delivered, and deduplication of the stream will be strictly best effort.
>      */
>     public Read<T> withIdAttribute(String idAttribute) {
>       return toBuilder().setIdAttribute(idAttribute).build();
>     }
> {code}
> This information here isn't enough for users to know if a second message, published with the same custom IdAttribute as that of a first message, which was published `x` minutes ago, would be deduplicated by the Dataflow runner. 
> Better documentation will help. I imagine a lot of users will wonder about this and may even ask how to configure this period, but that will probably need a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)