You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Shone Sadler <ss...@adobe.com.INVALID> on 2019/09/22 11:49:09 UTC

Incompatible Writes due to OutOfOrder Fields

Hello everyone,

This question is related to schema evolution support in Iceberg.

We have data coming in with fields out-of-order wrt to the schema in Iceberg (e.g. inbound struct(a,b,c) vs. iceberg struct(c,b,a))

As a result we are hitting the following error in Iceberg when saving the data  -> "Cannot write incompatible dataset to table with schema", generated within the IcebergeSource -> https://github.com/apache/incubator-iceberg/blob/d1f0b540f5f14f002be86133ef9f66445f7e0926/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L157

I also noted in the documentation that re-ordering was allowed -> https://iceberg.apache.org/evolution/ , which led me to believe that we could update the schema prior to writing the data, However, I see no means of re-ordering fields on the current UpdateSchema API.

How are people handling out-of-order fields today?

Our data is deeply nested, as a result I am hoping not to have to transform/prep on ingest and looking for alternatives.

Any thoughts appreciated!

Regards,
Shone Sadler




Re: Incompatible Writes due to OutOfOrder Fields

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Is there anything that Iceberg needs to do differently here? We've had
requests to support reordering fields with `ADD COLUMN ... AFTER other_col`
and `UPDATE COLUMN col BEFORE other_col`. Otherwise, do you think we need
to change the internal checks?

On Thu, Sep 26, 2019 at 1:23 AM Gautam <ga...@gmail.com> wrote:

> Shone and I synced offline but wanted to circle back here so others can
> hopefully benefit and others with more experience with this can correct me
> if there's a better way to achieve this.
>
> *Problem*:
>   The use case  is that incoming data has fields out of order w.r.t
> already ingested data in Iceberg. This same scenario applies to nested
> columns as well (e.g. fields in a sub-struct has fields out of order) .
> Also Incoming data might have added fields. Issue is if data is ingested as
> is  Iceberg will complain with it's compatibility checks. As it should.
>
> *Solution*:
>   Iceberg doesn't depend on field names nor natural order of fields. It
> uses Ids to keep track of schema fields. So if one wants to
> enforce evolution rules correctly she should first go back to the
> underlying Iceberg schema and apply schema transformation rules using
> Iceberg Schema Update Api and commit the schema changes to the underlying
> table. Once this is done Iceberg will have created a new version of the
> schema with new Ids allotted to the added fields. It also accounts for
> different order in the incoming data as it keeps the id-name mapping for
> all columns.
>
> Here is a gist that captures these scenarios described above with sample
> data : https://gist.github.com/prodeezy/b2cc35b87fca7d43ae681d45b3d7cab3
>
> Cheers,
> -Gautam.
>
>
>
>
>
>
>
> On Wed, Sep 25, 2019 at 5:29 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Shone,
>>
>> Iceberg should be able to handle out of order data columns in nested
>> structures. We probably just need to relax that compatibility check to
>> allow it. Can you post the error message that you're getting?
>>
>> On Sun, Sep 22, 2019 at 4:49 AM Shone Sadler <ss...@adobe.com.invalid>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> This question is related to schema evolution support in Iceberg.
>>>
>>> We have data coming in with fields out-of-order wrt to the schema in
>>> Iceberg (e.g. inbound struct(a,b,c) vs. iceberg struct(c,b,a))
>>>
>>> As a result we are hitting the following error in Iceberg when saving
>>> the data  -> "Cannot write incompatible dataset to table with schema",
>>> generated within the IcebergeSource ->
>>> https://github.com/apache/incubator-iceberg/blob/d1f0b540f5f14f002be86133ef9f66445f7e0926/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L157
>>>
>>> I also noted in the documentation that re-ordering was allowed ->
>>> https://iceberg.apache.org/evolution/ , which led me to believe that we
>>> could update the schema prior to writing the data, However, I see no means
>>> of re-ordering fields on the current UpdateSchema API.
>>>
>>> How are people handling out-of-order fields today?
>>>
>>> Our data is deeply nested, as a result I am hoping not to have to
>>> transform/prep on ingest and looking for alternatives.
>>>
>>> Any thoughts appreciated!
>>>
>>> Regards,
>>> Shone Sadler
>>>
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Incompatible Writes due to OutOfOrder Fields

Posted by Gautam <ga...@gmail.com>.
Shone and I synced offline but wanted to circle back here so others can
hopefully benefit and others with more experience with this can correct me
if there's a better way to achieve this.

*Problem*:
  The use case  is that incoming data has fields out of order w.r.t already
ingested data in Iceberg. This same scenario applies to nested columns as
well (e.g. fields in a sub-struct has fields out of order) . Also Incoming
data might have added fields. Issue is if data is ingested as is  Iceberg
will complain with it's compatibility checks. As it should.

*Solution*:
  Iceberg doesn't depend on field names nor natural order of fields. It
uses Ids to keep track of schema fields. So if one wants to
enforce evolution rules correctly she should first go back to the
underlying Iceberg schema and apply schema transformation rules using
Iceberg Schema Update Api and commit the schema changes to the underlying
table. Once this is done Iceberg will have created a new version of the
schema with new Ids allotted to the added fields. It also accounts for
different order in the incoming data as it keeps the id-name mapping for
all columns.

Here is a gist that captures these scenarios described above with sample
data : https://gist.github.com/prodeezy/b2cc35b87fca7d43ae681d45b3d7cab3

Cheers,
-Gautam.







On Wed, Sep 25, 2019 at 5:29 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Shone,
>
> Iceberg should be able to handle out of order data columns in nested
> structures. We probably just need to relax that compatibility check to
> allow it. Can you post the error message that you're getting?
>
> On Sun, Sep 22, 2019 at 4:49 AM Shone Sadler <ss...@adobe.com.invalid>
> wrote:
>
>> Hello everyone,
>>
>> This question is related to schema evolution support in Iceberg.
>>
>> We have data coming in with fields out-of-order wrt to the schema in
>> Iceberg (e.g. inbound struct(a,b,c) vs. iceberg struct(c,b,a))
>>
>> As a result we are hitting the following error in Iceberg when saving the
>> data  -> "Cannot write incompatible dataset to table with schema",
>> generated within the IcebergeSource ->
>> https://github.com/apache/incubator-iceberg/blob/d1f0b540f5f14f002be86133ef9f66445f7e0926/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L157
>>
>> I also noted in the documentation that re-ordering was allowed ->
>> https://iceberg.apache.org/evolution/ , which led me to believe that we
>> could update the schema prior to writing the data, However, I see no means
>> of re-ordering fields on the current UpdateSchema API.
>>
>> How are people handling out-of-order fields today?
>>
>> Our data is deeply nested, as a result I am hoping not to have to
>> transform/prep on ingest and looking for alternatives.
>>
>> Any thoughts appreciated!
>>
>> Regards,
>> Shone Sadler
>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Incompatible Writes due to OutOfOrder Fields

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Shone,

Iceberg should be able to handle out of order data columns in nested
structures. We probably just need to relax that compatibility check to
allow it. Can you post the error message that you're getting?

On Sun, Sep 22, 2019 at 4:49 AM Shone Sadler <ss...@adobe.com.invalid>
wrote:

> Hello everyone,
>
> This question is related to schema evolution support in Iceberg.
>
> We have data coming in with fields out-of-order wrt to the schema in
> Iceberg (e.g. inbound struct(a,b,c) vs. iceberg struct(c,b,a))
>
> As a result we are hitting the following error in Iceberg when saving the
> data  -> "Cannot write incompatible dataset to table with schema",
> generated within the IcebergeSource ->
> https://github.com/apache/incubator-iceberg/blob/d1f0b540f5f14f002be86133ef9f66445f7e0926/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L157
>
> I also noted in the documentation that re-ordering was allowed ->
> https://iceberg.apache.org/evolution/ , which led me to believe that we
> could update the schema prior to writing the data, However, I see no means
> of re-ordering fields on the current UpdateSchema API.
>
> How are people handling out-of-order fields today?
>
> Our data is deeply nested, as a result I am hoping not to have to
> transform/prep on ingest and looking for alternatives.
>
> Any thoughts appreciated!
>
> Regards,
> Shone Sadler
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix