You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xccui (via GitHub)" <gi...@apache.org> on 2023/03/08 02:07:58 UTC

[GitHub] [hudi] xccui opened a new issue, #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering

xccui opened a new issue, #8120:
URL: https://github.com/apache/hudi/issues/8120

   Occasionally, our Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` when recovering from a checkpoint. All the `stream_write` functions fell into the dead loop below.
   ```
   while (confirming) {
         // wait condition:
         // 1. there is no inflight instant
         // 2. the inflight instant does not change and the checkpoint has buffering data
         if (instant == null || invalidInstant(instant, hasData)) {
           // sleep for a while
           timeWait.waitFor();
           // refresh the inflight instant
           instant = lastPendingInstant();
         } else {
           // the pending instant changed, that means the last instant was committed
           // successfully.
           confirming = false;
         }
       }
   ```
   I checked the `ckp_meta/` folder of each table, and all of them were empty. It seems that the `stream_write` functions tried to fetch the instant files (in a snapshot) before they were written by the `StreamWriteOperatorCoordinator`. Not sure if it's related, but the checkpoint interval was set to only 30s since it's a dev env.
   
   A workaround to solve the problem is by force restarting the JobManager.
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Flink version : 1.14.4
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : s3a
   
   
   **Additional context**
   
   Metadata table was not enabled.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8120:
URL: https://github.com/apache/hudi/issues/8120#issuecomment-1459633851

   We change the strategy for ckp metadata since 0.13.x: https://github.com/apache/hudi/pull/7620, maybe that would solve your problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xccui commented on issue #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering

Posted by "xccui (via GitHub)" <gi...@apache.org>.
xccui commented on issue #8120:
URL: https://github.com/apache/hudi/issues/8120#issuecomment-1460150659

   Thanks,@danny0405. I'll try the version and see if it solves the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xccui commented on issue #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering

Posted by "xccui (via GitHub)" <gi...@apache.org>.
xccui commented on issue #8120:
URL: https://github.com/apache/hudi/issues/8120#issuecomment-1587722160

   We didn't hit the same exception recently. Will mark this as resolved. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xccui closed issue #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering

Posted by "xccui (via GitHub)" <gi...@apache.org>.
xccui closed issue #8120: [SUPPORT] Flink Job was stuck on `AbstractStreamWriteFunction.instantToWrite()` after recovering
URL: https://github.com/apache/hudi/issues/8120


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org