You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "tpcross (via GitHub)" <gi...@apache.org> on 2023/04/29 06:07:52 UTC
[GitHub] [hudi] tpcross commented on issue #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

tpcross commented on issue #8584:
URL: https://github.com/apache/hudi/issues/8584#issuecomment-1528678707

   
   Thanks for taking a look, I’ll see if I can get the full .hoodie
   And I’ll also check the ingestion job time zone config whether it is set to UTC or local T+10 or changed
   
   Query started at 21 April 2023 at 06:01 UTC
   Yes the file group of interest 994d5334-bc27-439b-89a9-3f129f658c90 had no new slices between 20221123052731868 and 20230421070656147
   
   
   Headers for the clean at 07:12 following the delta commit at 07:06
   20230421071249337
   
   ```
     "earliestInstantToRetain": {
       "timestamp": "20230421013114885",
       "action": "deltacommit",
       "state": "COMPLETED"
     },
     "lastCompletedCommitTimestamp": "20230421070656147",
     "policy": "KEEP_LATEST_BY_HOURS",
   ```
   
   ```
     "startCleanTime": "20230421071249337",
     "timeTakenInMillis": 220976,
     "totalFilesDeleted": 10193,
     "earliestCommitToRetain": "20230421013114885",
     "lastCompletedCommitTimestamp": "20230421070656147",
   ```
   
   
   What I mean by irregular commits is there can be long periods of time with no activity (insert/update/delete) on the source table
   Like a period when there is only one commit per day, and then a period with 48 commits in 4 hours (every 5 mins)
   Also table is partitioned by customer (28 partitions in total) so partitions have different activity levels
   
   So when using KEEP_LATEST_COMMITS 10 the period kept varied from 90min to several days which is why changed to KEEP_LATEST_BY_HOURS
   
   I was thinking for this kind of data pattern maybe it would help avoid the file not found to retain one extra older slice in case it is still used by a running query, assuming it gets cleaned up next time there is a newer commit
   
   ```
   diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
   index 64e69b1d2a..99fadc0bc0 100644
   --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
   +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
   @@ -341,8 +341,9 @@ public class CleanPlanner<T extends HoodieRecordPayload, I, K, O> implements Ser
                }
              } else if (policy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
                // This block corresponds to KEEP_LATEST_BY_HOURS policy
   -            // Do not delete the latest commit.
   -            if (fileCommitTime.equals(lastVersion)) {
   +            // Dont delete the latest commit and also the last commit before the earliest commit we
   +            // are retaining
   +            if (fileCommitTime.equals(lastVersion) || (fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain))) {
                  // move on to the next file
                  continue;
                }
   ```
   
   Alternative would be to go back to KEEP_LATEST_COMMITS but I would need set a large number of commits, and also increase hoodie.keep.min.commits ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org