You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Zoltán Borók-Nagy <bo...@cloudera.com.INVALID> on 2021/01/29 18:14:13 UTC

Dynamic INSERT OVERWRITE

Hey everyone,

I'm currently working on the INSERT OVERWRITE statement for Iceberg tables
in Impala.

Seems like ReplacePartitions is the perfect interface for this job:
https://github.infra.cloudera.com/CDH/iceberg/blob/cdpd-master/api/src/main/java/org/apache/iceberg/ReplacePartitions.java

IIUC Spark also uses this interface for dynamic overwrites. Though the
class comment says that this interface is not recommended to use, and use
OverwriteFiles instead. OverwriteFiles is more generic and more explicit,
but for this task it would need extra boilerplate code, therefore more
possibilities to do stg wrong.

So my question is, is there any problem with using ReplacePartitions for
dynamic overwrites? I see that it only replaces partitions with the current
partition spec, but that's probably fine. Otherwise handling partition
layout evolution and dynamic inserts can be complicated and error-prone
anyway.

Apart from which interface to use, is there anything I should be aware of?
E.g. I guess we don't want to allow dynamic overwrites if the table is
partitioned by the BUCKET transform.

Thanks,
    Zoltan

Re: Dynamic INSERT OVERWRITE

Posted by Zoltán Borók-Nagy <bo...@cloudera.com.INVALID>.
Thanks for your answer, Ryan.
In the short term we'll only have INSERT INTO/OVERWRITE for Icebeg tables
in Impala (so the way to go is AppendFiles and ReplacePartitions
accordingly).
But I agree that a MERGE INTO statement would be super useful. Hopefully
we'll add support for it as well in the not too distant future.

Thanks again,
Zoltan



On Fri, Jan 29, 2021 at 8:19 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Zoltan,
>
> The warning is that dynamic overwrites in general aren't recommended.
> ReplacePartitions is the right operation to use for dynamic overwrite, we
> just want to steer users away from dynamic overwrites in general.
>
> The problem with dynamic overwrite is that its behavior depends on the
> underlying physical structure of the table. If your table is partitioned by
> days, then it will overwrite by day. If it is partitioned by hours, it will
> overwrite by hour. That means that the behavior changes when table
> partitioning changes. That's a bad thing because we want SQL operations to
> be logical operations that do not depend on the table layout.
>
> OverwriteFiles is a better option because what gets overwritten is
> explicit. If you want to overwrite a day, you pass a filter for that day.
> Another way around this problem is to support MERGE INTO, which will detect
> the files that need to be changed and correctly rewrite them, wherever they
> are in the table.
>
> rb
>
> On Fri, Jan 29, 2021 at 10:14 AM Zoltán Borók-Nagy
> <bo...@cloudera.com.invalid> wrote:
>
>> Hey everyone,
>>
>> I'm currently working on the INSERT OVERWRITE statement for Iceberg
>> tables in Impala.
>>
>> Seems like ReplacePartitions is the perfect interface for this job:
>> https://github.infra.cloudera.com/CDH/iceberg/blob/cdpd-master/api/src/main/java/org/apache/iceberg/ReplacePartitions.java
>>
>> IIUC Spark also uses this interface for dynamic overwrites. Though the
>> class comment says that this interface is not recommended to use, and use
>> OverwriteFiles instead. OverwriteFiles is more generic and more explicit,
>> but for this task it would need extra boilerplate code, therefore more
>> possibilities to do stg wrong.
>>
>> So my question is, is there any problem with using ReplacePartitions for
>> dynamic overwrites? I see that it only replaces partitions with the current
>> partition spec, but that's probably fine. Otherwise handling partition
>> layout evolution and dynamic inserts can be complicated and error-prone
>> anyway.
>>
>> Apart from which interface to use, is there anything I should be aware
>> of? E.g. I guess we don't want to allow dynamic overwrites if the table is
>> partitioned by the BUCKET transform.
>>
>> Thanks,
>>     Zoltan
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Dynamic INSERT OVERWRITE

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Zoltan,

The warning is that dynamic overwrites in general aren't recommended.
ReplacePartitions is the right operation to use for dynamic overwrite, we
just want to steer users away from dynamic overwrites in general.

The problem with dynamic overwrite is that its behavior depends on the
underlying physical structure of the table. If your table is partitioned by
days, then it will overwrite by day. If it is partitioned by hours, it will
overwrite by hour. That means that the behavior changes when table
partitioning changes. That's a bad thing because we want SQL operations to
be logical operations that do not depend on the table layout.

OverwriteFiles is a better option because what gets overwritten is
explicit. If you want to overwrite a day, you pass a filter for that day.
Another way around this problem is to support MERGE INTO, which will detect
the files that need to be changed and correctly rewrite them, wherever they
are in the table.

rb

On Fri, Jan 29, 2021 at 10:14 AM Zoltán Borók-Nagy
<bo...@cloudera.com.invalid> wrote:

> Hey everyone,
>
> I'm currently working on the INSERT OVERWRITE statement for Iceberg tables
> in Impala.
>
> Seems like ReplacePartitions is the perfect interface for this job:
> https://github.infra.cloudera.com/CDH/iceberg/blob/cdpd-master/api/src/main/java/org/apache/iceberg/ReplacePartitions.java
>
> IIUC Spark also uses this interface for dynamic overwrites. Though the
> class comment says that this interface is not recommended to use, and use
> OverwriteFiles instead. OverwriteFiles is more generic and more explicit,
> but for this task it would need extra boilerplate code, therefore more
> possibilities to do stg wrong.
>
> So my question is, is there any problem with using ReplacePartitions for
> dynamic overwrites? I see that it only replaces partitions with the current
> partition spec, but that's probably fine. Otherwise handling partition
> layout evolution and dynamic inserts can be complicated and error-prone
> anyway.
>
> Apart from which interface to use, is there anything I should be aware of?
> E.g. I guess we don't want to allow dynamic overwrites if the table is
> partitioned by the BUCKET transform.
>
> Thanks,
>     Zoltan
>
>

-- 
Ryan Blue
Software Engineer
Netflix