You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/13 09:41:34 UTC

[GitHub] [hudi] boneanxs opened a new issue, #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

boneanxs opened a new issue, #6938:
URL: https://github.com/apache/hudi/issues/6938

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Say we have a spark streaming job write data to a Hudi table while disable `hoodie.clean.automatic` to avoid heavy clean operation to block the write speed, while we'll use a new clustering job to compact old small files as well as optimize file layout. 
   
   We also build another independent clean job(Use `KEEP_LASTEST_FILE_VERSION` to clean old version files, but sometimes as the replaceCommit could be archived by the write streaming job's archiver, so the clean job can not find any old version files, whereas the duplicates issue occurs.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. A streaming job which configures `hoodie.clean.automatic` to false,  setting `hoodie.keep.min.commits` to 2 and `hoodie.keep.max.commits` equals 3 to fast reproduce the issue.
   2. Start a clustering job to cluster some partitions
   3. Wait for the streaming job to write some commits to trigger the archiver to archive the commits, the replaceCommits will also be archived.
   4. The clean job will not clean files which should be clustered by the clustering job as there's no corresponding replacecommit.
   
   **Expected behavior**
   
   As sometimes it's necessary to disable the `hoodie.clean.automatic` to make the writing streaming job more stable(clean action could use much memory in the driver side to build the File view, especially when we have many small files and need to use `KEEP_LASTEST_FILE_VERSION` mode), here maybe archiver should be more smart to identify whether the replacecommit should be archived by checking the replaceFileId existing in the table or not, if original replace files exists in the table, which means no clean action triggered to delete the old version files?
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Posted by GitBox <gi...@apache.org>.
boneanxs commented on issue #6938:
URL: https://github.com/apache/hudi/issues/6938#issuecomment-1286726879

   Yea, make sense, currently we still enable automatic clean, and don't use `KEEP_LASTEST_VERSION` (avoid the potential driver memory issue).
   
   I'll close the issue since there is a follow up ticket.
   Thanks you all
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs closed issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Posted by GitBox <gi...@apache.org>.
boneanxs closed issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates
URL: https://github.com/apache/hudi/issues/6938


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] boneanxs commented on issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Posted by GitBox <gi...@apache.org>.
boneanxs commented on issue #6938:
URL: https://github.com/apache/hudi/issues/6938#issuecomment-1280359415

   @yihua Yea, Identifying replaced file groups might be time consuming, we have to list affected partitions to build `FileSystemView` to get replaced file groups. I'm thinking If using `HoodieMetadataFileSystemView` in the end, the time cost of listing operation can be reduced a lot, besides, one replace operation usually doesn't contain many partitions, so maybe the time spent here can be acceptable(we can also make here run in parallel if there're many partitions affected)
   
   By the way, maybe we can provide a basic/simple fix at least address the issue(duplicates is actually a critical issue), and try to improve this logic in the long term.
   
   Do you think it's worth a try? Very appreciate for your suggestions!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6938:
URL: https://github.com/apache/hudi/issues/6938#issuecomment-1278589842

   @boneanxs thanks for reporting this.  I think we need to see how we can efficiently identify the replaced file groups.  Currently, the archiver does not read any instant metadata to be efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6938:
URL: https://github.com/apache/hudi/issues/6938#issuecomment-1283157298

   even if not for replace commits, we have some other fundamental issue here. 
   if you are setting automatic clean = false, then regular writer is never going to trigger clean at all. But still archiver will go ahead and keep archiving timeline files. If you try to clean up way later by a different clean process, it may not find some of the timeline files only (since archiver would have archived), and hence it might miss to clean up some of the data files pertaining to those timeline files. 
   
   For now, I would advise to relax the archiver w/ regular ingestion pipeline so there won't be dangling data files. 
   
   I have created a follow up jira for us to work on this gap https://issues.apache.org/jira/browse/HUDI-5054 
   
   Let me know if you need any more pointers/clarifications. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org