You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/04 07:26:30 UTC

[GitHub] [hudi] SabyasachiDasTR opened a new issue, #7600: Hoodie clean is not deleting old files for MOR table

SabyasachiDasTR opened a new issue, #7600:
URL: https://github.com/apache/hudi/issues/7600

   **Describe the problem you faced**
   
   We are incrementally upserting data into our Hudi table/s every 5 minutes. 
   We have set CLEANER_POLICY as KEEP_LATEST_BY_HOURS with CLEANER_HOURS_RETAINED = 48.
   
   The old delta log files in our partition from 2 months back are still not cleaned and we can see in cli last cleanup happened 2 months back on November. I do not see any action being performed on cleaning the old log files. The only command we execute is Upsert and we have single writer and compaction runs every hour. 
   We think this is causing out emr job to underperform and crash multiple times as very large number of delta log files are getting piled up in the partitions and compaction is trying to read them while processing the job.
   
   ![MicrosoftTeams-image (33)](https://user-images.githubusercontent.com/52735405/210500715-89227935-b74a-418a-9701-5b783c56a74e.png)
   
   **Options used during Upsert:**
   ![HudiOptionsLatest](https://user-images.githubusercontent.com/52735405/210503366-77d47c7c-169f-4a87-8234-0971079a9347.PNG)
   
   **Writing to s3**
   ![Upsertcmd](https://user-images.githubusercontent.com/52735405/210501558-28eb3712-fed8-4c93-9c85-ccb6ef3521dc.PNG)
   Partition structure: s3://bucket/table/partition/parquet and .log files
   
   **Expected behavior**
   As per my understanding the logs should be deleted beyond CLEANER_HOURS_RETAINED which is 2 days .
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2.1
   
   * Hive version : Hive not install on EMR Cluster emr-6.7.0
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] koochiswathiTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
koochiswathiTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1384830525

   09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: Nothing to clean here. It is already clean
   
   looking at this log, CleanPlanner.getPartitionPathsForCleanByCommits is not returning any List back , So cleanup is not triggering.
   
   @xushiyan @nsivabalan  Pls help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] koochiswathiTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
koochiswathiTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1383508743

   @xushiyan, we are missing our SLA`s badly as the log files are more, And the accumulated data size is morethan 120 TB.
   Any help is much appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] koochiswathiTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "koochiswathiTR (via GitHub)" <gi...@apache.org>.
koochiswathiTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1411949976

   Hi @umehrot2,
   	
   	Below are the cleanup config changes.
   	We process the batch in 5 mints interval. 
   	5 minute ingestion – which is 12 delta commits per hour and 288(12*24) delta commits per day
   	Compaction runs every hour, In a day 24 commits. 
   	In a day total number of commits = (Delta commits + compaction commits ) = 312 commits
   	We configured to retain 3 days of commits 312 *3 = 936 commits
   	Minimum commits retained is set to 937  ( 936 +1 ) 
   	Maximum commits retained is 960 (936 + 24) 
   	
   	HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
           HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "936",      
           HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "937",  //  CLEANER_COMMITS_RETAINED + 1
           HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "960", // CLEANER_COMMITS_RETAINED + 24
   Please let us know your thoughts on this.
   	


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] victorxiang30 commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "victorxiang30 (via GitHub)" <gi...@apache.org>.
victorxiang30 commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1696869127

   > @SabyasachiDasTR @koochiswathiTR The issue here is similar to #3739 . I believe what is happening here is that you are setting CLEANER_HOURS_RETAINED to 2 days. But meanwhile, archival is running more aggressively. By default archival will maintain maximum 30 commits in the active timeline - https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits. Hence, in your case by the time cleaner is run and its trying to clean up commits older than 2 days, those commits are already archived. And hence cleaner even though it is scheduled, it is not finding anything to clean based on the logs you have provided.
   > 
   > If you want to continue with you current cleaner config, you should set https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits to be higher than the number of commits you have in a span of 2 days. Essentially, you want to cleaner to run at a higher frequency than archival.
   > 
   > As for cleaning the data, you should disable https://hudi.apache.org/docs/configurations/#hoodiecleanerincrementalmode while running the clean manually. This is needed because in your case, you want to cleaner to go back in time and clean dangling files which are older than last time the cleaner was run.
   
   hi what should I do if my cleaning is OOM after disabling incremental mode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] umehrot2 commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "umehrot2 (via GitHub)" <gi...@apache.org>.
umehrot2 commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1416482538

   @koochiswathiTR yes the configs seems fine to me. Let us know if it helped.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1618920425

   @umehrot2 @koochiswathiTR Were you able to get it resolved with those configs. Please let us know in case you need any other help on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SabyasachiDasTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
SabyasachiDasTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1373515682

   Also we want to understand what is the impact  of executing "cleans run" command from cli manually. We have verified compaction and commits are working for the latest time but cleanup is not triggering automatically after that. If we execute the   "cleans run" command from cli manually will it impact the data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1374513596

   @SabyasachiDasTR have you observed any error or warn in logs? it's likely that something is blocking the clean or failing it. Can you search logs and find any statement wrt "clean"? looks like it just stop clean at some point.
   
   yes you can use cli to trigger clean manually. it won't impact the data. if you want to be cautious, you can perform it against a table clone to try it out. If something is failing the clean, it'll be the same result though. Need to check the logs still.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SabyasachiDasTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
SabyasachiDasTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1381812503

   Hi @xushiyan any thought on the above logs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] koochiswathiTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
koochiswathiTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1386920699

   Any update on this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1396467854

   cc @nsivabalan, can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] koochiswathiTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "koochiswathiTR (via GitHub)" <gi...@apache.org>.
koochiswathiTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1399865531

   @nsivabalan  Any update on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1404633967

   sorry. missed from the radar. are you folks in general slack in hudi workspace. lets connect there. we might need to inspect the timeline and see whats going on. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] umehrot2 commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by "umehrot2 (via GitHub)" <gi...@apache.org>.
umehrot2 commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1405874402

   @SabyasachiDasTR @koochiswathiTR The issue here is similar to https://github.com/apache/hudi/issues/3739 . I believe what is happening here is that you are setting CLEANER_HOURS_RETAINED to 2 days. But meanwhile, archival is running more aggressively. By default archival will maintain maximum 30 commits in the active timeline - https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits. Hence, in your case by the time cleaner is run and its trying to clean up commits older than 2 days, those commits are already archived. And hence cleaner even though it is scheduled, it is not finding anything to clean based on the logs you have provided.
   
   If you want to continue with you current cleaner config, you should set https://hudi.apache.org/docs/0.11.1/configurations#hoodiekeepmaxcommits to be higher than the number of commits you have in a span of 2 days. Essentially, you want to cleaner to run at a higher frequency than archival.
   
   As for cleaning the data, you should disable https://hudi.apache.org/docs/configurations/#hoodiecleanerincrementalmode while running the clean manually. This is needed because in your case, you want to cleaner to go back in time and clean dangling files which are older than last time the cleaner was run.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SabyasachiDasTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
SabyasachiDasTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1378651450

   Hi @xushiyan we enabled hudi debug logging and scanned all the container logs. We did not find any ERROR or WARN logs related to 'clean'. Below are the info logs and looks like it is not able to find the point in time from where it has to clean. 
   What could be the reason? 
   
   FYI we did try 'cleans run' command in one of our table and it executed successfully and cleaned lot of files. But the auto clean is still not triggering in any of the tables, that eventually is causing the number of log files to grow.
   
   `stderr.2023-01-09-10:2023-01-09T11:47:59.346+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Start to clean synchronously.
   stderr.2023-01-09-10:2023-01-09T11:48:00.062+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Scheduling cleaning at instant time :20230109114759346
   stderr.2023-01-09-10:2023-01-09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: No earliest commit to retain. No need to scan partitions !!
   stderr.2023-01-09-10:2023-01-09T11:48:01.308+0000 [INFO] [1673249876388qa_correlation_id] [org.apache.hudi.table.action.clean.CleanPlanner] [CleanPlanner]: Nothing to clean here. It is already clean`
   
   As per the logs Nothing to clean here. It is already clean , but we do see lot of logs files from 2 months back.
   I have attached generic logs here.
   [AllErrorLogs.txt](https://github.com/apache/hudi/files/10392060/AllErrorLogs.txt)
   
   [AllWARNLogs.txt](https://github.com/apache/hudi/files/10392070/AllWARNLogs.txt)
   
   [HudiErrorLogs.txt](https://github.com/apache/hudi/files/10392074/HudiErrorLogs.txt)
   
   [HudiWARNLogs.txt](https://github.com/apache/hudi/files/10392075/HudiWARNLogs.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SabyasachiDasTR commented on issue #7600: Hoodie clean is not deleting old files for MOR table

Posted by GitBox <gi...@apache.org>.
SabyasachiDasTR commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1370638353

   @nsivabalan we referred https://github.com/apache/hudi/issues/3739 but we are using different configs for CLEANER_POLICY.
   Could you please consider as high priority and suggest as this is failing our prod job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org