You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/14 23:54:06 UTC

[GitHub] [hudi] hussein-awala opened a new issue, #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean

hussein-awala opened a new issue, #6953:
URL: https://github.com/apache/hudi/issues/6953

   **Describe the problem you faced**
   
   As I understood, the [CleanPlanner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java) prepares a list of files to delete by listing the files on each partition, and checking if there are some files to delete based on the used cleaner policy. If the incremental cleaner mode is enabled, and there is an old clean operation metadata present in the timeline, it read starting instant of the previous clean for the avro file, and check only [the partitions that have change since the last clean](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L171).
   
   But if after listing all the files in the partitions (brute force in the first time, or the partitions that have change sine a previous clean) there is no file to delete, the CleanPlanActionExecutor [will not create a clean request](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java#L149-L156) avro file, then the state will not be transitioned to inflight or complete by the [CleanActionExecutor](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L195). In the next clean, we will check the same partitions which we already checked in this clean even if they haven't change since this clean, and we will not take advantage of the incremental cleaner mode feature.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Activate the incremental cleaner mode, enable the clean and use KEEP_LATEST_COMMIT policy with 1 retain commit
   2. write data to 10 different partitions in 10 different commits
   3. check if there is clean avro metadata file created in the timeline
   4. compare the time of the operation listing files to delete between the different commits
   
   **Expected behavior**
   
   A `.clean` avro metadata file should be created and added to the timeline with the start clean time in order to use it in the next clean to avoid re-checking all the partitions.
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * Spark version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   **Additional context**
   
   I already checked the new PRs which improve the cleaning ([PR1](https://github.com/apache/hudi/pull/6890) and [PR2](https://github.com/apache/hudi/pull/6548)) but they don't solve this problem.
   I'm willing to submit a PR to fix the problem when it is confirmed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6953:
URL: https://github.com/apache/hudi/issues/6953#issuecomment-1283143573

   thanks for the ask. Looks like a good enchancement to have. I have filed a tracking ticket here https://issues.apache.org/jira/browse/HUDI-5053
   If you see latency hit due to clean planning phase, you can consider increasing the value for https://hudi.apache.org/docs/configurations/#hoodiecleanmaxcommits
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean
URL: https://github.com/apache/hudi/issues/6953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6953:
URL: https://github.com/apache/hudi/issues/6953#issuecomment-1287490671

   done! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hussein-awala commented on issue #6953: [SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean

Posted by GitBox <gi...@apache.org>.
hussein-awala commented on issue #6953:
URL: https://github.com/apache/hudi/issues/6953#issuecomment-1284533076

   thank you! yes for the `CLEAN_MAX_COMMITS` we already use 24 to work around this problem.  
   can you please assign the ticket to [me](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hussein-awala), I will submit a PR this weekend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org