You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Aljoscha Krettek (Jira)" <ji...@apache.org> on 2020/12/17 08:31:00 UTC

[jira] [Commented] (FLINK-20625) Refactor Google Cloud PubSub Source in accordance with FLIP-27

    [ https://issues.apache.org/jira/browse/FLINK-20625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250898#comment-17250898 ] 

Aljoscha Krettek commented on FLINK-20625:
------------------------------------------

Does our current PubSub source also only support at-least-once delivery? I'm too lazy to look into it now, to see if we maybe to custom deduplication after the source. 😅

> Refactor Google Cloud PubSub Source in accordance with FLIP-27
> --------------------------------------------------------------
>
>                 Key: FLINK-20625
>                 URL: https://issues.apache.org/jira/browse/FLINK-20625
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Google Cloud PubSub
>            Reporter: Jakob Edding
>            Priority: Major
>
> The Source implementation for Google Cloud Pub/Sub should be refactored in accordance with [FLIP-27: Refactor Source Interface|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95653748].
> *Split Enumerator*
> Pub/Sub doesn't expose any partitions to consuming applications. Therefore, the implementation of the Pub/Sub Split Enumerator won't do any real work discovery. Instead, a static Source Split is handed to Source Readers which request a Source Split. This static Source Split merely contains details about the connection to Pub/Sub and the concrete Pub/Sub subscription to use but no Split-specific information like partitions/offsets because this information can not be obtained.
> *Source Reader*
> A Source Reader will use Pub/Sub's pull mechanism to read new messages from the Pub/Sub subscription specified in the SourceSplit. In the case of parallel-running Source Readers in Flink, every Source Reader will be passed the same Source Split from the Enumerator. Because of this, all Source Readers use the same connection details and the same Pub/Sub subscription to receive messages. In this case, Pub/Sub will automatically load-balance messages between all Source Readers pulling from the same subscription. This way, parallel processing can be achieved in the Source.
> *At-least-once guarantee*
> Pub/Sub itself guarantees at-least-once message delivery so it is the goal to keep up this guarantee in the Source as well. A mechanism that can be used to achieve this is that Pub/Sub expects a message to be acknowledged by the subscriber to signal that the message has been consumed successfully. Any message that has not been acknowledged yet will be automatically redelivered by Pub/Sub once an ack deadline has passed.
> After a certain time interval has elapsed...
>  # all pulled messages are checkpointed in the Source Reader
>  # same messages are acknowledged to Pub/Sub
>  # same messages are forwarded to downstream Flink tasks
> This should ensure at-least-once delivery in the Source because in the case of failure, non-checkpointed messages have not yet been acknowledged and will therefore be redelivered.
> Because of the static Source Split, it appears like checkpointing is not necessary in the Split Enumerator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)