You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/13 19:27:13 UTC

[GitHub] [iceberg] kbendick opened a new pull request #3106: FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

kbendick opened a new pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106


   We are occasionally seeing CI runs take 6 hours, and then ultimately timeout.
   
   After adding further logging, it seems that there is a Flink test that is still trying to checkpoint after the job has entered the FINISHED state.
   
   I'm not 100% sure if adding this config will help with that (as it might not be considered a checkpoint failure), but it's worth a shot for further debugging. Ultimately, we should resolve this issue, but for now I just want to see if this will help.
   
   Further details (and logs) can be found here: https://github.com/apache/iceberg/issues/3091
   
   The relevant log that is spewed for hours until timeout is:
   ```
   2021-09-13T08:19:47.7896411Z > Task :iceberg-flink:test
   2021-09-13T08:19:47.7899950Z     [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint.
   2021-09-13T08:19:47.7905489Z     [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint.
   2021-09-13T08:19:47.7914766Z     [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint.
   2021-09-13T08:19:47.7920502Z     [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint.
   ```
   
   cc @nastra @openinx @rdblue @RussellSpitzer @stevenzwu in case you have any insight on how to resolve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu edited a comment on pull request #3106: [WIP] FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

Posted by GitBox <gi...@apache.org>.
stevenzwu edited a comment on pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106#issuecomment-918655645


   @kbendick thx a lot for investigating this issue. 
   
   I added this test of `testTwoSinksInDisjointedDAG` probably a few weeks ago. It seems that we may have an infinite wait problem in the `BoundedTestSource` when the other part of the DAG has completed as @kbendick found out. Maybe we should ignore this test for now so that we can stop the bleeding first while investigate the right fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3106: FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106#issuecomment-918547243


   I actually don't think this will work.
   
   The checkpoints are being aborted, not necessarily failing.
   
   There's more discussion here: https://issues.apache.org/jira/browse/FLINK-2491
   
   It seems as though the source finishes, so its task(s) shut down, while other parts are still checkpointing, so the checkpoint cannot complete as all tasks need to be up and running to acknowledge a checkpoint.
   
   I think we'll need to update `BoundedTestSource` and its checkpoint abort logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick closed pull request #3106: [WIP] FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

Posted by GitBox <gi...@apache.org>.
kbendick closed pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on pull request #3106: [WIP] FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106#issuecomment-918655645


   @kbendick thx a lot for investigating this issue. 
   
   I added this test of `testTwoSinksInDisjointedDAG` probably a few weeks ago. It seems that we may have an infinite wait problem in the `BoundedTestSource`. Maybe we should ignore this test for now so that we can stop the bleeding first while investigate the right fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3106: [WIP] FLINK - Add tolerableCheckpointFailureNumber to debug CI timeouts

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3106:
URL: https://github.com/apache/iceberg/pull/3106#issuecomment-918664734


   We've decided to just ignore this test for now, so I'm closing this PR in favor of this one instead https://github.com/apache/iceberg/pull/3110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org