You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/05/10 13:10:22 UTC

[GitHub] [flink-kubernetes-operator] gyfora opened a new pull request, #200: [FLINK-27495] Observe last savepoint status directly from cluster

gyfora opened a new pull request, #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200

   This improves the savepoint tracking logic by observing the last savepoint/checkpoint directly from the cluster for terminal job states.
   
   This is required for the correct functioning of LAST_STATE and SAVEPOINT upgrades in terminal states:
    - Job fatally failed
    - Job failed
    - Operator got restarted during a savepoint upgrade
   
   It also requires users to set checkpoint directory for savepoint/last-state modes + sets externalized checkpoints config by default. The checkpoint directory is necessary to get externally addressable checkpoints.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] morhidi commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
morhidi commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872076090


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   I suggest we call it lastStateInfo, or add another field for the checkpoint. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872152706


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   After an offline discussion with @morhidi we agreed to keep it as is, and work on moving this out from the status to a configmap (after the release) as it's very much internal at this point and not really meant to be for user use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora commented on PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#issuecomment-1122366729

   cc @wangyang0918 @tweise @morhidi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r871937334


##########
examples/basic-checkpoint-ha.yaml:
##########
@@ -21,11 +21,13 @@ kind: FlinkDeployment
 metadata:
   name: basic-checkpoint-ha-example
 spec:
+  serviceAccount: flink

Review Comment:
   Duplicated service account.



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   I am not pretty sure whether we could wrap the checkpoint as the `Savepoint` and store in the `.status.SavepointInfo`. It is a little misleading since savepoint is usually triggered by user manually and is consistent between Flink major versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872058500


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   I decided to do this because when we record this we are actually in a special scenario. Terminal job, and it's only a checkpoint when the job failed/finished (otherwise it would be a savepoint due to stopwithsavepoint).
   
   I decided to put it in savepointinfo for simplicity from the operator side to avoid introducing new status fields and keeping the logic simple.
   
   The savpoint info in any case is not the real source of truth because anything can happen that prevents us from recording information so I think this is fair. With the savepoint history feature this will be improved further I believe



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora merged pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora merged PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r871981862


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   I am not pretty sure whether we could wrap the checkpoint as the `Savepoint` and store in the `.status.SavepointInfo`. It is a little misleading since savepoint is usually triggered by user manually and is consistent between Flink major versions. But the checkpoint does not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872073637


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   Given that the `.status.jobStatus.savepointInfo` not the real source of truth, it is reasonable to also store the checkpoint into this field.
   
   It will be great if we could also describe this in the documentation :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora commented on PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#issuecomment-1122541205

   The CI failure seems to be related to Flink base version mismatch, I will have to iterate on this after the 1.15 version bump is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
gyfora commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872061271


##########
examples/basic-checkpoint-ha.yaml:
##########
@@ -21,11 +21,13 @@ kind: FlinkDeployment
 metadata:
   name: basic-checkpoint-ha-example
 spec:
+  serviceAccount: flink

Review Comment:
   yea, I fixed this in parallel with someone else :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a diff in pull request #200: [FLINK-27495] Observe last savepoint status directly from cluster

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on code in PR #200:
URL: https://github.com/apache/flink-kubernetes-operator/pull/200#discussion_r872090743


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java:
##########
@@ -528,6 +538,71 @@ public void triggerSavepoint(
         }
     }
 
+    public Optional<Savepoint> getLastCheckpoint(JobID jobId, Configuration conf) throws Exception {
+        try (RestClusterClient<String> clusterClient =
+                (RestClusterClient<String>) getClusterClient(conf)) {
+
+            var headers = CustomCheckpointingStatisticsHeaders.getInstance();
+            var params = headers.getUnresolvedMessageParameters();
+            params.jobPathParameter.resolve(jobId);
+
+            CompletableFuture<CheckpointHistoryWrapper> response =
+                    clusterClient.sendRequest(headers, params, EmptyRequestBody.getInstance());
+
+            var checkpoints =
+                    response.get(
+                            configManager
+                                    .getOperatorConfiguration()
+                                    .getFlinkClientTimeout()
+                                    .getSeconds(),
+                            TimeUnit.SECONDS);
+
+            var latestCheckpointOpt =
+                    checkpoints.getHistory().stream()
+                            .filter(
+                                    cp ->
+                                            CheckpointStatsStatus.valueOf(
+                                                            cp.get(
+                                                                            CheckpointStatistics
+                                                                                    .FIELD_NAME_STATUS)
+                                                                    .asText())
+                                                    == CheckpointStatsStatus.COMPLETED)
+                            .filter(
+                                    cp ->
+                                            !cp.get(
+                                                            CheckpointStatistics
+                                                                    .CompletedCheckpointStatistics
+                                                                    .FIELD_NAME_EXTERNAL_PATH)
+                                                    .asText()
+                                                    .equals(
+                                                            NonPersistentMetadataCheckpointStorageLocation
+                                                                    .EXTERNAL_POINTER))
+                            .max(
+                                    Comparator.comparingLong(
+                                            cp ->
+                                                    cp.get(CheckpointStatistics.FIELD_NAME_ID)
+                                                            .asLong()))
+                            .map(
+                                    cp ->
+                                            new Savepoint(

Review Comment:
   `lastStateInfo` looks better and then we could have a `snapshotType` field(e.g. checkpoint, savepoint).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org