You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/13 09:41:34 UTC

[GitHub] [hudi] boneanxs opened a new issue, #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

boneanxs opened a new issue, #6938:
URL: https://github.com/apache/hudi/issues/6938

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

Say we have a spark streaming job write data to a Hudi table while disable `hoodie.clean.automatic` to avoid heavy clean operation to block the write speed, while we'll use a new clustering job to compact old small files as well as optimize file layout.

We also build another independent clean job(Use `KEEP_LASTEST_FILE_VERSION` to clean old version files, but sometimes as the replaceCommit could be archived by the write streaming job's archiver, so the clean job can not find any old version files, whereas the duplicates issue occurs.

**To Reproduce**

Steps to reproduce the behavior:

1. A streaming job which configures `hoodie.clean.automatic` to false, setting `hoodie.keep.min.commits` to 2 and `hoodie.keep.max.commits` equals 3 to fast reproduce the issue.
2. Start a clustering job to cluster some partitions
3. Wait for the streaming job to write some commits to trigger the archiver to archive the commits, the replaceCommits will also be archived.
4. The clean job will not clean files which should be clustered by the clustering job as there's no corresponding replacecommit.

**Expected behavior**

As sometimes it's necessary to disable the `hoodie.clean.automatic` to make the writing streaming job more stable(clean action could use much memory in the driver side to build the File view, especially when we have many small files and need to use `KEEP_LASTEST_FILE_VERSION` mode), here maybe archiver should be more smart to identify whether the replacecommit should be archived by checking the replaceFileId existing in the table or not, if original replace files exists in the table, which means no clean action triggered to delete the old version files?

**Environment Description**

* Hudi version : master

* Spark version : 3.1.2

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org