You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Gyula Fora (Jira)" <ji...@apache.org> on 2022/03/01 08:30:00 UTC

[jira] [Commented] (FLINK-26370) Make Flink cluster communication asynchronous

    [ https://issues.apache.org/jira/browse/FLINK-26370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499391#comment-17499391 ] 

Gyula Fora commented on FLINK-26370:
------------------------------------

Thank you [~kelemensanyi] for the thorough assessment of the problem.

I think the original motivation of the ticket was that long sync calls in the main reconcile loop basically block other operations on a given resource. I think given the fact that these user triggered operations should not be too frequent this is not a huge pain point in my view as long as we make sure the operator has enough threads we should be good (basically option 2)

I think you suggestion for 3 is interesting and it represent the ideal world scenario where operations would execute in the background and progress tracking would happen through the status of the resource. As you outlined this is a quite complex mechanism with a bunch of corner cases to guard against so we have to decide together if the added complexity is worth it.

I would love to hear the opinion of others on this now that we have a good description of the problem.

cc [~thw] [~wangyang0918] 

> Make Flink cluster communication asynchronous
> ---------------------------------------------
>
>                 Key: FLINK-26370
>                 URL: https://issues.apache.org/jira/browse/FLINK-26370
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Assignee: Sandor Kelemen
>            Priority: Major
>
> In the current architecture calls to the flink clusters (through the rest client) are made synchronously from the reconcile loop. 
> These calls often take a long time due to various (compeltely normal) reasons:
>  - Cluster is not ready -> long call + timeoutexception
>  - Operation takes a long time -> cancel/savepoint operations are often expected to take seconds/minutes
> Both the observer and reconciler components make these calls.
> We should come up with a way to avoid making these sync calls from the main loop while still preserving the logic of the operator.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)