You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Saisai Shao <sa...@gmail.com> on 2019/11/22 07:33:22 UTC

Query about the semantics of "overwrite" in Iceberg

Hi Team,

I found that Iceberg's "overwrite" is different from Spark's built-in
sources like Parquet. The "overwrite" semantics in Iceberg seems more like
"upsert", but not deleting the partitions where new data doesn't contain.

I would like to know what is the purpose of such design choice? Also if I
want to achieve Spark Parquet's "overwrite" semantics, how would I
achieve this?

Warning

*Spark does not define the behavior of DataFrame overwrite*. Like most
sources, Iceberg will dynamically overwrite partitions when the dataframe
contains rows in a partition. Unpartitioned tables are completely
overwritten.

Best regards,
Saisai

Re: Query about the semantics of "overwrite" in Iceberg

Posted by Saisai Shao <sa...@gmail.com>.

Thanks Ryan for your response, let me spend more on Spark 3.0 overwrite
behavior.

Best Regards
Saisai

Ryan Blue <rb...@netflix.com.invalid> 于2019年11月23日周六 上午1:08写道：

> Saisai,
>
> Iceberg's behavior matches Hive's and Spark's behavior when using dynamic
> overwrite mode.
>
> Spark does not specify the correct behavior -- it varies by source. In
> addition, it isn't possible for a v2 source in 2.4 to implement the static
> overwrite mode that is Spark's default. The problem is that the source is
> not passed the static partition values, only rows.
>
> This is fixed in 3.0 because Spark will choose its behavior and correctly
> configure the source with a dynamic overwrite or an overwrite using an
> expression.
>
> On Thu, Nov 21, 2019 at 11:33 PM Saisai Shao <sa...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> I found that Iceberg's "overwrite" is different from Spark's built-in
>> sources like Parquet. The "overwrite" semantics in Iceberg seems more like
>> "upsert", but not deleting the partitions where new data doesn't contain.
>>
>> I would like to know what is the purpose of such design choice? Also if I
>> want to achieve Spark Parquet's "overwrite" semantics, how would I
>> achieve this?
>>
>> Warning
>>
>> *Spark does not define the behavior of DataFrame overwrite*. Like most
>> sources, Iceberg will dynamically overwrite partitions when the dataframe
>> contains rows in a partition. Unpartitioned tables are completely
>> overwritten.
>>
>> Best regards,
>> Saisai
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Query about the semantics of "overwrite" in Iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Saisai,

Iceberg's behavior matches Hive's and Spark's behavior when using dynamic
overwrite mode.

Spark does not specify the correct behavior -- it varies by source. In
addition, it isn't possible for a v2 source in 2.4 to implement the static
overwrite mode that is Spark's default. The problem is that the source is
not passed the static partition values, only rows.

This is fixed in 3.0 because Spark will choose its behavior and correctly
configure the source with a dynamic overwrite or an overwrite using an
expression.

On Thu, Nov 21, 2019 at 11:33 PM Saisai Shao <sa...@gmail.com> wrote:

> Hi Team,
>
> I found that Iceberg's "overwrite" is different from Spark's built-in
> sources like Parquet. The "overwrite" semantics in Iceberg seems more like
> "upsert", but not deleting the partitions where new data doesn't contain.
>
> I would like to know what is the purpose of such design choice? Also if I
> want to achieve Spark Parquet's "overwrite" semantics, how would I
> achieve this?
>
> Warning
>
> *Spark does not define the behavior of DataFrame overwrite*. Like most
> sources, Iceberg will dynamically overwrite partitions when the dataframe
> contains rows in a partition. Unpartitioned tables are completely
> overwritten.
>
> Best regards,
> Saisai
>

-- 
Ryan Blue
Software Engineer
Netflix