You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by 李响 <wa...@gmail.com> on 2019/09/19 07:33:29 UTC

How to use newRewrite (and why deleteFiles is empty)

Dear community,

I am trying to re-write a couple of data files in a table, like

val fileToDelete1 = DataFiles.builder(partitionSpec)
  ...
  .withPath(delete_path_1)
  ...
  .build
val fileToDelete2 = DataFiles.builder(partitionSpec)
  ...
  .withPath(delete_path_2)
  ...
  .build
val fileToAdd = DataFiles.builder(partitionSpec)
  ...
  .withPath(add1)
  ...
  .build

table.newRewrite()
     .rewriteFiles(JavaConversions.setAsJavaSet(Set(fileToDelete1,
fileToDelete2)),
                   JavaConversions.setAsJavaSet(Set(fileToAdd)))
     .commit()

And it is rejected by
Exception in thread "main"
org.apache.iceberg.exceptions.ValidationException: Missing required files
to delete: delete_path_1, delete_path_2
at
org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:42)
at
org.apache.iceberg.MergingSnapshotProducer.apply(MergingSnapshotProducer.java:275)
at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:146)
at
org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:238)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:403)
at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:212)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:188)
at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:237)
at Test$.main(Test.scala:68)
at Test.main(Test.scala)

The logic in MergingSnapshotProducer (line 275
<https://github.com/apache/incubator-iceberg/blob/433f169e9d0b10688d395abde64c4b6461d35ca9/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L275>)
makes that happen. failMissingDeletePaths is true. And
deletedFiles.containsAll(deletePaths) is false, as deleteFiles is empty.

My questions are
(1) What does failMissingDeletePaths mean? Whether to fail if missing
delete paths happens?
(2) How to make deletedFiles not being empty (it is supposed to contain the
files to delete I believe)?

I am reading the code but do not figure it out yet. Really appreciate it if
you could share your thoughts at your most convenience. Thanks!
-- 

                                               李响 Xiang Li

手机 cellphone :+86-136-8113-8972
邮件 e-mail      :waterlx@gmail.com

Re: How to use newRewrite (and why deleteFiles is empty)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi,

RewriteFiles implements a swap operation. For example, you might compact
file_1 with file_2 to produce file_3, then swap the files. The correctness
of this operation depends on having both file_1 and file_2 in the table, or
else the compaction would un-delete or duplicated rows. That's why it
validates that each file is actually removed from the dataset.

If you want an idempotent delete, then you should use the DeleteFiles API
that doesn't add validation.

rb

On Thu, Sep 19, 2019 at 12:33 AM 李响 <wa...@gmail.com> wrote:

> Dear community,
>
> I am trying to re-write a couple of data files in a table, like
>
> val fileToDelete1 = DataFiles.builder(partitionSpec)
>   ...
>   .withPath(delete_path_1)
>   ...
>   .build
> val fileToDelete2 = DataFiles.builder(partitionSpec)
>   ...
>   .withPath(delete_path_2)
>   ...
>   .build
> val fileToAdd = DataFiles.builder(partitionSpec)
>   ...
>   .withPath(add1)
>   ...
>   .build
>
> table.newRewrite()
>      .rewriteFiles(JavaConversions.setAsJavaSet(Set(fileToDelete1,
> fileToDelete2)),
>                    JavaConversions.setAsJavaSet(Set(fileToAdd)))
>      .commit()
>
> And it is rejected by
> Exception in thread "main"
> org.apache.iceberg.exceptions.ValidationException: Missing required files
> to delete: delete_path_1, delete_path_2
> at
> org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:42)
> at
> org.apache.iceberg.MergingSnapshotProducer.apply(MergingSnapshotProducer.java:275)
> at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:146)
> at
> org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:238)
> at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:403)
> at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:212)
> at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
> at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:188)
> at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:237)
> at Test$.main(Test.scala:68)
> at Test.main(Test.scala)
>
> The logic in MergingSnapshotProducer (line 275
> <https://github.com/apache/incubator-iceberg/blob/433f169e9d0b10688d395abde64c4b6461d35ca9/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L275>)
> makes that happen. failMissingDeletePaths is true. And
> deletedFiles.containsAll(deletePaths) is false, as deleteFiles is empty.
>
> My questions are
> (1) What does failMissingDeletePaths mean? Whether to fail if missing
> delete paths happens?
> (2) How to make deletedFiles not being empty (it is supposed to contain
> the files to delete I believe)?
>
> I am reading the code but do not figure it out yet. Really appreciate it
> if you could share your thoughts at your most convenience. Thanks!
> --
>
>                                                李响 Xiang Li
>
> 手机 cellphone :+86-136-8113-8972
> 邮件 e-mail      :waterlx@gmail.com
>


-- 
Ryan Blue
Software Engineer
Netflix