You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/08/17 16:22:12 UTC

[GitHub] [flink] rkhachatryan opened a new pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

rkhachatryan opened a new pull request #13180:
URL: https://github.com/apache/flink/pull/13180


   
   ## What is the purpose of the change
   
   *Improve logging when a checkpoint is declined (e.g. target directory is not writable).*
   
   ## Verifying this change
   
   Tested manually on a local cluster.
   
   ### Job Manager ###
   
   ```
   2020-08-17 18:06:47,238 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline checkpoint 4 by task 5271c210329e73bd743f3227edfb3b71_0_0 of job 815316dcc17ceea916d2a4da019aca13 at 127.0.1.1:44861-e3cf2c @ roman-ThinkPad-L380 (dataPort=33273).
   org.apache.flink.util.SerializedThrowable: Could not materialize checkpoint 4 for operator ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (1/2).
   Caused by: org.apache.flink.util.SerializedThrowable: com.esotericsoftware.kryo.KryoException: java.io.IOException: Could not open output stream for state backend
   	... 
   Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException: Could not open output stream for state backend
   	...
   Caused by: org.apache.flink.util.SerializedThrowable: Could not open output stream for state backend
   	... 
   Caused by: org.apache.flink.util.SerializedThrowable: Mkdirs failed to create file:/tmp/forbidden/savepoint-e2e-test-chckpt-dir/815316dcc17ceea916d2a4da019aca13/chk-4
   	...
   ```
   
   ### Task Manager ###
   
   ```
   2020-08-17 18:06:45,239 INFO  org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (2/2) - asynchronous part of checkpoint 2 could not be completed.
   java.util.concurrent.ExecutionException: com.esotericsoftware.kryo.KryoException: java.io.IOException: Could not open output stream for state backend
   Caused by: com.esotericsoftware.kryo.KryoException: java.io.IOException: Could not open output stream for state backend
   	... 
   Caused by: java.io.IOException: Could not open output stream for state backend
   	...
   Caused by: java.io.IOException: Mkdirs failed to create file:/tmp/forbidden/savepoint-e2e-test-chckpt-dir/815316dcc17ceea916d2a4da019aca13/chk-2
   	... 
   ```
   
   cc: @NicoK 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473052428



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Yes, it will be forwarded in most cases, but this forward RPC can also fail (I think it can be helpful to have it in both logs).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473623138



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Honestly, I have not had this to me, I’m not against this change here.
   
   After another look of `CheckpointFailureManager`, we have counted `CHECKPOINT_EXPIRED` in(we will count `CHECKPOINT_EXPIRED` and `CHECKPOINT_DECLINED` in now), before count `CHECKPOINT_EXPIRED` in, we may have expired checkpoints, and abort the ongoing the snapshot(this may print the log here) through `notifyCheckpointAbort` RPC.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 00115a2e9b08249705cc69403d726d494bb60e1d Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472234400



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -887,7 +887,8 @@ public void receiveDeclineMessage(DeclineCheckpoint message, String taskManagerL
 					checkpointId,
 					message.getTaskExecutionId(),
 					job,
-					taskManagerLocationInfo);
+					taskManagerLocationInfo,
+					message.getReason());

Review comment:
       Good to know, I was not aware of that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473803693



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       > Honestly, I have not had this to me, I’m not against this change here.
   
   Ok, in that case let's not overthink it and let's try this out :) Thanks for your inputs @NicoK and @klion26 .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472979610



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       @rkhachatryan I think in production it would not happen(I assume that users always configure checkpoint interval with minutes), but this may happen if we configure a very small checkpoint interval. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b823669c9164f5a6d12f2fa4f42621958a1bdcc4 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623) 
   * 00115a2e9b08249705cc69403d726d494bb60e1d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674978536


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit b823669c9164f5a6d12f2fa4f42621958a1bdcc4 (Mon Aug 17 16:23:46 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   * 4aea992b6d30869cc742525b56225e3bcead1439 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] NicoK commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
NicoK commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473012801



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       I would also be surprised if it would fail frequently on a regular basis.
   
   On the other hand: wouldn't this be forwarded to the checkpoint coordinator and reported there anyway?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-675319870


   @NicoK, can you take a look at this PR?
   
   AZP failure is unrelated (timeout) and private build succeeded.
   
   @flinkbot run azure


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473613780



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Honestly, I haven't had that happen to me. Two more things we may need to be aware of are that 1)we only count `CHECKPOINT_DECLINED` and `CHECKPOINT_EXPIRED` in CheckpointFailureManager, 2) we have `notifyCheckpointAborted` now which will cancel the snapshot of tasks if one checkpoint can’t complete.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b823669c9164f5a6d12f2fa4f42621958a1bdcc4 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473689643



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       I think everything said above about failure frequency is also true for expirations (in fact, there is only one counter for all types of failures).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473006654



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Even with a small checkpoint interval, failure frequency should be higher, otherwise, the job will fail.
   So I don't think it poses any problem.
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] NicoK commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
NicoK commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-676326703


   Thanks for this PR; this adds exactly that piece of information which was missing previously.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-675508813


   > I think we can/should skip tests for things like logging, which is less important, not complex, and changes less frequently.
   
   Ok :( but sooner or later I will get tired of everyone avoiding the tests for loggers and I will make a stronger stand. Time is ticking :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b823669c9164f5a6d12f2fa4f42621958a1bdcc4 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623) 
   * 00115a2e9b08249705cc69403d726d494bb60e1d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-675496979


   > having tests is always better ;)
   
   Not denying that the more coverage the better; the more code the worse :)
   
   I think we can/should skip tests for things like logging, which is less important, not complex, and changes less frequently.
   
   > it looks like there is a bug in one of the formats
   
   Replied in code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b823669c9164f5a6d12f2fa4f42621958a1bdcc4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] NicoK commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
NicoK commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473012801



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       I would also be surprised if it would fail frequently on a regular basis.
   
   On the other hand: wouldn't this be forwarded to the checkpoint coordinator and reported there anyway? (but maybe we need this in both logs as many other things)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472703243



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Yes, this was `debug` level because there would be flood log if use `INFO` or above log level. Currently, we support very frequent checkpoints.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 00115a2e9b08249705cc69403d726d494bb60e1d Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626) 
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472818246



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       But will be reported only in exceptional situations, do they happen that frequently?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski merged pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski merged pull request #13180:
URL: https://github.com/apache/flink/pull/13180


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472214667



##########
File path: flink-end-to-end-tests/test-scripts/common.sh
##########
@@ -387,6 +387,7 @@ function check_logs_for_exceptions {
    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
    | grep -v 'INFO.*AWSErrorCode' \
    | grep -v "RejectedExecutionException" \
+   | grep -v "CancellationException" \

Review comment:
       My change causes `CancellationException` to be logged at `INFO` level as any other exception (which I think is fine).
   
   In general, I think `CancellationException` in logs shouldn't fail the tests (unless it's cause by some other exception, which will be detected by script). So I made it a separate commit.

##########
File path: flink-end-to-end-tests/test-scripts/common.sh
##########
@@ -387,6 +387,7 @@ function check_logs_for_exceptions {
    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
    | grep -v 'INFO.*AWSErrorCode' \
    | grep -v "RejectedExecutionException" \
+   | grep -v "CancellationException" \

Review comment:
       My change causes `CancellationException` to be logged at `INFO` level as any other exception (which I think is fine).
   
   In general, I think `CancellationException` in logs shouldn't fail the tests (unless it's caused by some other exception, which will be detected by script). So I made it a separate commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472234731



##########
File path: flink-end-to-end-tests/test-scripts/common.sh
##########
@@ -387,6 +387,7 @@ function check_logs_for_exceptions {
    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
    | grep -v 'INFO.*AWSErrorCode' \
    | grep -v "RejectedExecutionException" \
+   | grep -v "CancellationException" \

Review comment:
       In that case can you squash this with the commit that introduced the problem?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5685",
       "triggerID" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   * 4aea992b6d30869cc742525b56225e3bcead1439 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5685) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5685",
       "triggerID" : "4aea992b6d30869cc742525b56225e3bcead1439",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4aea992b6d30869cc742525b56225e3bcead1439 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5685) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r473623138



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Honestly, I haven't had that to me. I agree that in most cases this would not happen(as my last reply said), I’m not against this change, but this may happen when we set checkpoint interval to very small(maybe in test).   
   1) we only count `CHECKPOINT_DECLINED` and `CHECKPOINT_EXPIRED` in `CheckpointFailureManager`, 2) we'll abort snapshot through `notifyCheckpointAbort` if some checkpoint can't complete




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472212733



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -887,7 +887,8 @@ public void receiveDeclineMessage(DeclineCheckpoint message, String taskManagerL
 					checkpointId,
 					message.getTaskExecutionId(),
 					job,
-					taskManagerLocationInfo);
+					taskManagerLocationInfo,
+					message.getReason());

Review comment:
       No, this is a vararg, and the last argument is interpreted as `Throwable`.
   The stacktrace is printed as I showed in the description.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472142222



##########
File path: flink-end-to-end-tests/test-scripts/common.sh
##########
@@ -387,6 +387,7 @@ function check_logs_for_exceptions {
    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
    | grep -v 'INFO.*AWSErrorCode' \
    | grep -v "RejectedExecutionException" \
+   | grep -v "CancellationException" \

Review comment:
       Why has this had to be added? Is it caused by one of your change?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -887,7 +887,8 @@ public void receiveDeclineMessage(DeclineCheckpoint message, String taskManagerL
 					checkpointId,
 					message.getTaskExecutionId(),
 					job,
-					taskManagerLocationInfo);
+					taskManagerLocationInfo,
+					message.getReason());

Review comment:
       Isn't this is missing a pattern/format change? Also how would you like it to be logged? Just the `message.getReason().toString()`? Do we care about the stack trace?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] klion26 commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
klion26 commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472703243



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Yes, this was `debug` level because there would be flood log if use `INFO` or above log level. Because we support very frequent checkpoints.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633",
       "triggerID" : "675319870",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2a1887f46b6baf1b9c09e13691380de416302c01 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5662) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5633) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2a1887f46b6baf1b9c09e13691380de416302c01",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 00115a2e9b08249705cc69403d726d494bb60e1d Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626) 
   * 2a1887f46b6baf1b9c09e13691380de416302c01 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472352387



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java
##########
@@ -129,12 +129,10 @@ public void run() {
 					checkpointMetaData.getCheckpointId());
 			}
 		} catch (Exception e) {
-			if (LOG.isDebugEnabled()) {
-				LOG.debug("{} - asynchronous part of checkpoint {} could not be completed.",
-					taskName,
-					checkpointMetaData.getCheckpointId(),
-					e);
-			}
+			LOG.info("{} - asynchronous part of checkpoint {} could not be completed.",

Review comment:
       Won't this flood log with exceptions in some cases?
   
   There was some [relevant discussion](https://github.com/apache/flink/pull/9873/files/d62e6daf023f440f25fa2ea5a55081e513a5fcd0#diff-9437cb4899bb946f234c272be6724770) about this when it was being introduced.
   
   CC @klion26 ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13180:
URL: https://github.com/apache/flink/pull/13180#issuecomment-674984135


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623",
       "triggerID" : "b823669c9164f5a6d12f2fa4f42621958a1bdcc4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626",
       "triggerID" : "00115a2e9b08249705cc69403d726d494bb60e1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b823669c9164f5a6d12f2fa4f42621958a1bdcc4 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5623) 
   * 00115a2e9b08249705cc69403d726d494bb60e1d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=5626) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #13180: [FLINK-18962][checkpointing] Improve logging when checkpoint declined

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #13180:
URL: https://github.com/apache/flink/pull/13180#discussion_r472284284



##########
File path: flink-end-to-end-tests/test-scripts/common.sh
##########
@@ -387,6 +387,7 @@ function check_logs_for_exceptions {
    | grep -v  "WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
    | grep -v 'INFO.*AWSErrorCode' \
    | grep -v "RejectedExecutionException" \
+   | grep -v "CancellationException" \

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org