You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Pratyaksh Sharma <pr...@gmail.com> on 2020/02/24 12:21:21 UTC

[DISCUSS] Consider defaultValue of field when writing to Hudi dataset

Hi,

Currently we recommend users to evolve schema in backwards compatible way.
When one is trying to evolve schema in backwards compatible way, one of the
most significant things to do is to define default value for newly added
columns so that records published with previous schema also can be consumed
properly.

However just before actually writing record to Hudi dataset, we try to
rewrite record with new Avro schema which has Hudi metadata columns [1]. In
this function, we are only trying to get the values from record without
considering field's default value. As a result, schema validation fails. In
essence this feels like we are not even respecting backwards compatible
schema changes.
IMO, this piece of code should take into account default value as well in
case field's actual value is null.

Open to hearing others' thoughts.

[1]
https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205
.

Re: [DISCUSS] Consider defaultValue of field when writing to Hudi dataset

Posted by Pratyaksh Sharma <pr...@gmail.com>.
https://issues.apache.org/jira/browse/HUDI-727 tracks this.

On Tue, Feb 25, 2020 at 2:23 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Vinoth,
>
> > in avro you define it as an optional field (union of type and null)..
> Yes that is correct. But imagine if someone does not want to populate
> null, rather he wants to populate default values for the field, which is a
> very common case.
>
> > seems like it's being copied over?
> When creating writer schema, we are copying Avro fields properly. However
> at the time of actually populating values as per new writer schema, we are
> not taking care of default values like I pointed out in my previous mail.
>
> > Nonetheless, we should copy over the default values, if that code is not
> doing so.
> Yeah, will raise a jira for the same.
>
> >
>
> On Tue, Feb 25, 2020 at 12:21 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>> IIUC the link between backwards compatibility and default values for
>> fields
>> in schema is that, in avro you define it as an optional field (union of
>> type and null).. Not sure if it has anything to do with default values.
>>
>> Nonetheless, we should copy over the default values, if that code is not
>> doing so.
>>
>> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L124
>> seems like it's being copied over?
>>
>> On Mon, Feb 24, 2020 at 4:21 AM Pratyaksh Sharma <pr...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Currently we recommend users to evolve schema in backwards compatible
>> way.
>> > When one is trying to evolve schema in backwards compatible way, one of
>> the
>> > most significant things to do is to define default value for newly added
>> > columns so that records published with previous schema also can be
>> consumed
>> > properly.
>> >
>> > However just before actually writing record to Hudi dataset, we try to
>> > rewrite record with new Avro schema which has Hudi metadata columns
>> [1]. In
>> > this function, we are only trying to get the values from record without
>> > considering field's default value. As a result, schema validation
>> fails. In
>> > essence this feels like we are not even respecting backwards compatible
>> > schema changes.
>> > IMO, this piece of code should take into account default value as well
>> in
>> > case field's actual value is null.
>> >
>> > Open to hearing others' thoughts.
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205
>> > .
>> >
>>
>

Re: [DISCUSS] Consider defaultValue of field when writing to Hudi dataset

Posted by Pratyaksh Sharma <pr...@gmail.com>.
Hi Vinoth,

> in avro you define it as an optional field (union of type and null)..
Yes that is correct. But imagine if someone does not want to populate null,
rather he wants to populate default values for the field, which is a very
common case.

> seems like it's being copied over?
When creating writer schema, we are copying Avro fields properly. However
at the time of actually populating values as per new writer schema, we are
not taking care of default values like I pointed out in my previous mail.

> Nonetheless, we should copy over the default values, if that code is not
doing so.
Yeah, will raise a jira for the same.

>

On Tue, Feb 25, 2020 at 12:21 PM Vinoth Chandar <vi...@apache.org> wrote:

> IIUC the link between backwards compatibility and default values for fields
> in schema is that, in avro you define it as an optional field (union of
> type and null).. Not sure if it has anything to do with default values.
>
> Nonetheless, we should copy over the default values, if that code is not
> doing so.
>
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L124
> seems like it's being copied over?
>
> On Mon, Feb 24, 2020 at 4:21 AM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Currently we recommend users to evolve schema in backwards compatible
> way.
> > When one is trying to evolve schema in backwards compatible way, one of
> the
> > most significant things to do is to define default value for newly added
> > columns so that records published with previous schema also can be
> consumed
> > properly.
> >
> > However just before actually writing record to Hudi dataset, we try to
> > rewrite record with new Avro schema which has Hudi metadata columns [1].
> In
> > this function, we are only trying to get the values from record without
> > considering field's default value. As a result, schema validation fails.
> In
> > essence this feels like we are not even respecting backwards compatible
> > schema changes.
> > IMO, this piece of code should take into account default value as well in
> > case field's actual value is null.
> >
> > Open to hearing others' thoughts.
> >
> > [1]
> >
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205
> > .
> >
>

Re: [DISCUSS] Consider defaultValue of field when writing to Hudi dataset

Posted by Vinoth Chandar <vi...@apache.org>.
IIUC the link between backwards compatibility and default values for fields
in schema is that, in avro you define it as an optional field (union of
type and null).. Not sure if it has anything to do with default values.

Nonetheless, we should copy over the default values, if that code is not
doing so.
https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L124
seems like it's being copied over?

On Mon, Feb 24, 2020 at 4:21 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi,
>
> Currently we recommend users to evolve schema in backwards compatible way.
> When one is trying to evolve schema in backwards compatible way, one of the
> most significant things to do is to define default value for newly added
> columns so that records published with previous schema also can be consumed
> properly.
>
> However just before actually writing record to Hudi dataset, we try to
> rewrite record with new Avro schema which has Hudi metadata columns [1]. In
> this function, we are only trying to get the values from record without
> considering field's default value. As a result, schema validation fails. In
> essence this feels like we are not even respecting backwards compatible
> schema changes.
> IMO, this piece of code should take into account default value as well in
> case field's actual value is null.
>
> Open to hearing others' thoughts.
>
> [1]
>
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205
> .
>