You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Pratyaksh Sharma <pr...@gmail.com> on 2020/01/02 06:26:09 UTC

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Hi Vinoth,

As you explained above and as per what is mentioned in this FAQ (
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory),
Hudi is able to maintain schema evolution only if the schema is *backwards
compatible*. What about the case when it is backwards incompatible? This
might be the case when for some reason you are unable to enforce things
like not deleting fields or not change the order. Ideally we should be full
proof and be able to support schema evolution in every case possible. In
such a case, creating a Uber schema can be useful. WDYT?

On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Syed,
>
> Typically, I have been the Confluent/avro schema registry used as a the
> source of truth and Hive schema is just a translation. Thats how the
> hudi-hive sync also works..
> Have you considered making fields optional in the avro schema so that even
> if the source data does not have few of them, there will be nulls..
> In general, the two places I have dealt with this, all made it works using
> the schema evolution rules avro supports.. and enforcing things like not
> deleting fields, not changing order etc.
>
> Hope that atleast helps a bit
>
> thanks
> vinoth
>
> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <in...@gmail.com>
> wrote:
>
> > Hi Team,
> >
> > We have pull data from Kafka generated by Debezium. The schema maintained
> > in the schema registry by confluent framework during the population of
> > data.
> >
> > *Problem Statement Here: *
> >
> > All the addition/deletion of columns is maintained in schema registry.
> >  During running the Hudi pipeline, We have custom schema registry that
> > pulls the latest schema from the schema registry as well as from hive
> > metastore and we create a uber schema (so that missing the columns from
> the
> > schema registry will be pulled from hive metastore) But is there any
> better
> > approach to solve this problem?.
> >
> >
> >
> >
> >             Thanks and Regards,
> >         S SYED ABDUL KATHER
> >
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

I was talking at the avro level.
https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

Nonetheless, this deserves more holistic thinking. So look forward to the
RFC.

Thanks
Vinoth

On Fri, Feb 7, 2020 at 1:24 AM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> @Nishith
>
> >> Hudi relies on Avro schema evolution rules which helps to prevent
> breaking of existing queries on such tables
>
> I want to understand this statement from code's perspective. According to
> what I know, in HoodieAvroUtils class, we are trying to validate the
> rewritten record against Avro schema as given below -
>
> private static GenericRecord rewrite(GenericRecord record, Schema
> schemaWithFields, Schema newSchema) {
>   GenericRecord newRecord = new GenericData.Record(newSchema);
>   for (Schema.Field f : schemaWithFields.getFields()) {
>     newRecord.put(f.name(), record.get(f.name()));
>   }
>   if (!GenericData.get().validate(newSchema, newRecord)) {
>     throw new SchemaCompatabilityException(
>         "Unable to validate the rewritten record " + record + "
> against schema " + newSchema);
>   }
>   return newRecord;
> }
>
> So I am trying to understand is there any place where we are actually
> checking the compatibility of writer's and reader's schema in our code? The
> above function simply validates the data types of field values and checks
> if they are non-null. Also can someone explain the reason behind doing the
> above validation? The record coming here gets created with original target
> schema and newSchema simply includes hoodie metadata fields. So I feel this
> check is redundant.
>
> On Fri, Feb 7, 2020 at 2:08 PM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > @Vinoth Chandar <vi...@apache.org> How does re-ordering affect here
> like
> > you mentioned? Parquet files access fields by name rather than by index
> by
> > default. So re-ordering should not matter. Please help me understand.
> >
> > On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >> @Pratyaksh Sharma <pr...@gmail.com> Please go ahead :)
> >>
> >> @Benoit , you are right about Parquet deletion, I think.
> >>
> >> Come to think of it, with an initial schema in place, how would we even
> >> drop a field? all of the old data needs to be rewritten (prohibitively
> >> expensive)? So all we will end up doing is simply mask the field from
> >> queries by mapping old data to the current schema?  This can get messy
> >> pretty quickly if field re-ordering is allowed for e.g.. What we
> do/advise
> >> now is to alternatively embrace a more brittle schema management at the
> >> write side (no renames, no dropping fields, all fields are nullable) and
> >> ensure reader schema is simpler to manage..  There is probably a
> >> middle-ground here somewhere/
> >>
> >>
> >>
> >> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <pratyaksh13@gmail.com
> >
> >> wrote:
> >>
> >>> @Vinoth Chandar <vi...@apache.org> I would like to drive this.
> >>>
> >>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <b....@brci.fr>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I think deleting field is supported with Avro both backward and
> forward
> >>>> as long as the field is optional  and provide default value.
> >>>>
> >>>> A simple exemple of Avro optional field defined using a union type and
> >>>> a default value:
> >>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
> >>>> Readers will use default value when field is not present.
> >>>>
> >>>> I believe problem here is Parquet which does not support field
> >>>> deletion.
> >>>> One option is to set Parquet field value to null. Parquet will use RLE
> >>>> encoding for efficient encoding of all null values in "deleted" field.
> >>>>
> >>>> Regards,
> >>>> Benoit
> >>>>
> >>>> > On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
> >>>> >
> >>>>
> >>>> Pratakysh,
> >>>>
> >>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
> >>>> Avro schema evolution rules which helps to prevent breaking of
> existing
> >>>> queries on such tables - say someone was querying that field that is
> now
> >>>> deleted.
> >>>> You can read more here ->
> https://avro.apache.org/docs/1.8.2/spec.html
> >>>> That being said, I’m also looking at how we can support schema
> >>>> evolution slightly differently - somethings could be more in our
> control
> >>>> and not break reader queries - but that’s not in the near future.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <
> pratyaksh13@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > Hi Vinoth,
> >>>> >
> >>>> > We do not have any standard documentation for the said approach as
> it
> >>>> was
> >>>> > self thought through. Just logging a conversation from #general
> >>>> channel for
> >>>> > record purpose -
> >>>> >
> >>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but
> >>>> I got
> >>>> > an error and I didnt find any solution for this... I wrote some
> >>>> parquet
> >>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
> >>>> MOR_STORAGE_TYPE_OPT_VAL
> >>>> > and sync with hive and worked perfectly. But after that, I try to
> >>>> wrote
> >>>> > another file in the same table (with some schema changes, just
> delete
> >>>> and
> >>>> > add some columns) and got this error Caused by:
> >>>> > org.apache.parquet.io.InvalidRecordException:
> >>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone
> >>>> know
> >>>> > what to do?"
> >>>> >
> >>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
> >>>> wrote:
> >>>> >>
> >>>> >> In my experience, you need to follow some rules on evolving and
> keep
> >>>> the
> >>>> >> data backwards compatible. Or the only other option is to rewrite
> the
> >>>> >> entire dataset :), which is very expensive.
> >>>> >>
> >>>> >> If you have some pointers to learn more about any approach you are
> >>>> >> suggesting, happy to read up.
> >>>> >>
> >>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
> >>>> pratyaksh13@gmail.com>
> >>>> >> wrote:
> >>>> >>
> >>>> >>> Hi Vinoth,
> >>>> >>>
> >>>> >>> As you explained above and as per what is mentioned in this FAQ (
> >>>> >>>
> >>>> >>>
> >>>> >>
> >>>>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> >>>> >>> ),
> >>>> >>> Hudi is able to maintain schema evolution only if the schema is
> >>>> >> *backwards
> >>>> >>> compatible*. What about the case when it is backwards
> incompatible?
> >>>> This
> >>>> >>> might be the case when for some reason you are unable to enforce
> >>>> things
> >>>> >>> like not deleting fields or not change the order. Ideally we
> should
> >>>> be
> >>>> >> full
> >>>> >>> proof and be able to support schema evolution in every case
> >>>> possible. In
> >>>> >>> such a case, creating a Uber schema can be useful. WDYT?
> >>>> >>>
> >>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vinoth@apache.org
> >
> >>>> >> wrote:
> >>>> >>>
> >>>> >>>> Hi Syed,
> >>>> >>>>
> >>>> >>>> Typically, I have been the Confluent/avro schema registry used as
> >>>> a the
> >>>> >>>> source of truth and Hive schema is just a translation. Thats how
> >>>> the
> >>>> >>>> hudi-hive sync also works..
> >>>> >>>> Have you considered making fields optional in the avro schema so
> >>>> that
> >>>> >>> even
> >>>> >>>> if the source data does not have few of them, there will be
> nulls..
> >>>> >>>> In general, the two places I have dealt with this, all made it
> >>>> works
> >>>> >>> using
> >>>> >>>> the schema evolution rules avro supports.. and enforcing things
> >>>> like
> >>>> >> not
> >>>> >>>> deleting fields, not changing order etc.
> >>>> >>>>
> >>>> >>>> Hope that atleast helps a bit
> >>>> >>>>
> >>>> >>>> thanks
> >>>> >>>> vinoth
> >>>> >>>>
> >>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
> >>>> in.abdul@gmail.com
> >>>> >>>
> >>>> >>>> wrote:
> >>>> >>>>
> >>>> >>>>> Hi Team,
> >>>> >>>>>
> >>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
> >>>> >>> maintained
> >>>> >>>>> in the schema registry by confluent framework during the
> >>>> population
> >>>> >> of
> >>>> >>>>> data.
> >>>> >>>>>
> >>>> >>>>> *Problem Statement Here: *
> >>>> >>>>>
> >>>> >>>>> All the addition/deletion of columns is maintained in schema
> >>>> >> registry.
> >>>> >>>>> During running the Hudi pipeline, We have custom schema registry
> >>>> >> that
> >>>> >>>>> pulls the latest schema from the schema registry as well as from
> >>>> hive
> >>>> >>>>> metastore and we create a uber schema (so that missing the
> columns
> >>>> >> from
> >>>> >>>> the
> >>>> >>>>> schema registry will be pulled from hive metastore) But is there
> >>>> any
> >>>> >>>> better
> >>>> >>>>> approach to solve this problem?.
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>           Thanks and Regards,
> >>>> >>>>>       S SYED ABDUL KATHER
> >>>> >>>>>
> >>>> >>>>
> >>>> >>>
> >>>> >>
> >>>>
> >>>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Pratyaksh Sharma <pr...@gmail.com>.

@Nishith

>> Hudi relies on Avro schema evolution rules which helps to prevent
breaking of existing queries on such tables

I want to understand this statement from code's perspective. According to
what I know, in HoodieAvroUtils class, we are trying to validate the
rewritten record against Avro schema as given below -

private static GenericRecord rewrite(GenericRecord record, Schema
schemaWithFields, Schema newSchema) {
  GenericRecord newRecord = new GenericData.Record(newSchema);
  for (Schema.Field f : schemaWithFields.getFields()) {
    newRecord.put(f.name(), record.get(f.name()));
  }
  if (!GenericData.get().validate(newSchema, newRecord)) {
    throw new SchemaCompatabilityException(
        "Unable to validate the rewritten record " + record + "
against schema " + newSchema);
  }
  return newRecord;
}

So I am trying to understand is there any place where we are actually
checking the compatibility of writer's and reader's schema in our code? The
above function simply validates the data types of field values and checks
if they are non-null. Also can someone explain the reason behind doing the
above validation? The record coming here gets created with original target
schema and newSchema simply includes hoodie metadata fields. So I feel this
check is redundant.

On Fri, Feb 7, 2020 at 2:08 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> @Vinoth Chandar <vi...@apache.org> How does re-ordering affect here like
> you mentioned? Parquet files access fields by name rather than by index by
> default. So re-ordering should not matter. Please help me understand.
>
> On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <vi...@apache.org> wrote:
>
>> @Pratyaksh Sharma <pr...@gmail.com> Please go ahead :)
>>
>> @Benoit , you are right about Parquet deletion, I think.
>>
>> Come to think of it, with an initial schema in place, how would we even
>> drop a field? all of the old data needs to be rewritten (prohibitively
>> expensive)? So all we will end up doing is simply mask the field from
>> queries by mapping old data to the current schema?  This can get messy
>> pretty quickly if field re-ordering is allowed for e.g.. What we do/advise
>> now is to alternatively embrace a more brittle schema management at the
>> write side (no renames, no dropping fields, all fields are nullable) and
>> ensure reader schema is simpler to manage..  There is probably a
>> middle-ground here somewhere/
>>
>>
>>
>> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <pr...@gmail.com>
>> wrote:
>>
>>> @Vinoth Chandar <vi...@apache.org> I would like to drive this.
>>>
>>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <b....@brci.fr>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think deleting field is supported with Avro both backward and forward
>>>> as long as the field is optional  and provide default value.
>>>>
>>>> A simple exemple of Avro optional field defined using a union type and
>>>> a default value:
>>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
>>>> Readers will use default value when field is not present.
>>>>
>>>> I believe problem here is Parquet which does not support field
>>>> deletion.
>>>> One option is to set Parquet field value to null. Parquet will use RLE
>>>> encoding for efficient encoding of all null values in "deleted" field.
>>>>
>>>> Regards,
>>>> Benoit
>>>>
>>>> > On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
>>>> >
>>>>
>>>> Pratakysh,
>>>>
>>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
>>>> Avro schema evolution rules which helps to prevent breaking of existing
>>>> queries on such tables - say someone was querying that field that is now
>>>> deleted.
>>>> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
>>>> That being said, I’m also looking at how we can support schema
>>>> evolution slightly differently - somethings could be more in our control
>>>> and not break reader queries - but that’s not in the near future.
>>>>
>>>> Thanks
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi Vinoth,
>>>> >
>>>> > We do not have any standard documentation for the said approach as it
>>>> was
>>>> > self thought through. Just logging a conversation from #general
>>>> channel for
>>>> > record purpose -
>>>> >
>>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but
>>>> I got
>>>> > an error and I didnt find any solution for this... I wrote some
>>>> parquet
>>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
>>>> MOR_STORAGE_TYPE_OPT_VAL
>>>> > and sync with hive and worked perfectly. But after that, I try to
>>>> wrote
>>>> > another file in the same table (with some schema changes, just delete
>>>> and
>>>> > add some columns) and got this error Caused by:
>>>> > org.apache.parquet.io.InvalidRecordException:
>>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone
>>>> know
>>>> > what to do?"
>>>> >
>>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
>>>> wrote:
>>>> >>
>>>> >> In my experience, you need to follow some rules on evolving and keep
>>>> the
>>>> >> data backwards compatible. Or the only other option is to rewrite the
>>>> >> entire dataset :), which is very expensive.
>>>> >>
>>>> >> If you have some pointers to learn more about any approach you are
>>>> >> suggesting, happy to read up.
>>>> >>
>>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
>>>> pratyaksh13@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi Vinoth,
>>>> >>>
>>>> >>> As you explained above and as per what is mentioned in this FAQ (
>>>> >>>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>>> >>> ),
>>>> >>> Hudi is able to maintain schema evolution only if the schema is
>>>> >> *backwards
>>>> >>> compatible*. What about the case when it is backwards incompatible?
>>>> This
>>>> >>> might be the case when for some reason you are unable to enforce
>>>> things
>>>> >>> like not deleting fields or not change the order. Ideally we should
>>>> be
>>>> >> full
>>>> >>> proof and be able to support schema evolution in every case
>>>> possible. In
>>>> >>> such a case, creating a Uber schema can be useful. WDYT?
>>>> >>>
>>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
>>>> >> wrote:
>>>> >>>
>>>> >>>> Hi Syed,
>>>> >>>>
>>>> >>>> Typically, I have been the Confluent/avro schema registry used as
>>>> a the
>>>> >>>> source of truth and Hive schema is just a translation. Thats how
>>>> the
>>>> >>>> hudi-hive sync also works..
>>>> >>>> Have you considered making fields optional in the avro schema so
>>>> that
>>>> >>> even
>>>> >>>> if the source data does not have few of them, there will be nulls..
>>>> >>>> In general, the two places I have dealt with this, all made it
>>>> works
>>>> >>> using
>>>> >>>> the schema evolution rules avro supports.. and enforcing things
>>>> like
>>>> >> not
>>>> >>>> deleting fields, not changing order etc.
>>>> >>>>
>>>> >>>> Hope that atleast helps a bit
>>>> >>>>
>>>> >>>> thanks
>>>> >>>> vinoth
>>>> >>>>
>>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
>>>> in.abdul@gmail.com
>>>> >>>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> Hi Team,
>>>> >>>>>
>>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
>>>> >>> maintained
>>>> >>>>> in the schema registry by confluent framework during the
>>>> population
>>>> >> of
>>>> >>>>> data.
>>>> >>>>>
>>>> >>>>> *Problem Statement Here: *
>>>> >>>>>
>>>> >>>>> All the addition/deletion of columns is maintained in schema
>>>> >> registry.
>>>> >>>>> During running the Hudi pipeline, We have custom schema registry
>>>> >> that
>>>> >>>>> pulls the latest schema from the schema registry as well as from
>>>> hive
>>>> >>>>> metastore and we create a uber schema (so that missing the columns
>>>> >> from
>>>> >>>> the
>>>> >>>>> schema registry will be pulled from hive metastore) But is there
>>>> any
>>>> >>>> better
>>>> >>>>> approach to solve this problem?.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>           Thanks and Regards,
>>>> >>>>>       S SYED ABDUL KATHER
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>>
>>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Pratyaksh Sharma <pr...@gmail.com>.

@Vinoth Chandar <vi...@apache.org> How does re-ordering affect here like
you mentioned? Parquet files access fields by name rather than by index by
default. So re-ordering should not matter. Please help me understand.

On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <vi...@apache.org> wrote:

> @Pratyaksh Sharma <pr...@gmail.com> Please go ahead :)
>
> @Benoit , you are right about Parquet deletion, I think.
>
> Come to think of it, with an initial schema in place, how would we even
> drop a field? all of the old data needs to be rewritten (prohibitively
> expensive)? So all we will end up doing is simply mask the field from
> queries by mapping old data to the current schema?  This can get messy
> pretty quickly if field re-ordering is allowed for e.g.. What we do/advise
> now is to alternatively embrace a more brittle schema management at the
> write side (no renames, no dropping fields, all fields are nullable) and
> ensure reader schema is simpler to manage..  There is probably a
> middle-ground here somewhere/
>
>
>
> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
>> @Vinoth Chandar <vi...@apache.org> I would like to drive this.
>>
>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <b....@brci.fr>
>> wrote:
>>
>>> Hi,
>>>
>>> I think deleting field is supported with Avro both backward and forward
>>> as long as the field is optional  and provide default value.
>>>
>>> A simple exemple of Avro optional field defined using a union type and a
>>> default value:
>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
>>> Readers will use default value when field is not present.
>>>
>>> I believe problem here is Parquet which does not support field deletion.
>>> One option is to set Parquet field value to null. Parquet will use RLE
>>> encoding for efficient encoding of all null values in "deleted" field.
>>>
>>> Regards,
>>> Benoit
>>>
>>> > On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
>>> >
>>>
>>> Pratakysh,
>>>
>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
>>> Avro schema evolution rules which helps to prevent breaking of existing
>>> queries on such tables - say someone was querying that field that is now
>>> deleted.
>>> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
>>> That being said, I’m also looking at how we can support schema evolution
>>> slightly differently - somethings could be more in our control and not
>>> break reader queries - but that’s not in the near future.
>>>
>>> Thanks
>>>
>>> Sent from my iPhone
>>>
>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Vinoth,
>>> >
>>> > We do not have any standard documentation for the said approach as it
>>> was
>>> > self thought through. Just logging a conversation from #general
>>> channel for
>>> > record purpose -
>>> >
>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
>>> got
>>> > an error and I didnt find any solution for this... I wrote some parquet
>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
>>> MOR_STORAGE_TYPE_OPT_VAL
>>> > and sync with hive and worked perfectly. But after that, I try to wrote
>>> > another file in the same table (with some schema changes, just delete
>>> and
>>> > add some columns) and got this error Caused by:
>>> > org.apache.parquet.io.InvalidRecordException:
>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
>>> > what to do?"
>>> >
>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>> >>
>>> >> In my experience, you need to follow some rules on evolving and keep
>>> the
>>> >> data backwards compatible. Or the only other option is to rewrite the
>>> >> entire dataset :), which is very expensive.
>>> >>
>>> >> If you have some pointers to learn more about any approach you are
>>> >> suggesting, happy to read up.
>>> >>
>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
>>> pratyaksh13@gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Hi Vinoth,
>>> >>>
>>> >>> As you explained above and as per what is mentioned in this FAQ (
>>> >>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>> >>> ),
>>> >>> Hudi is able to maintain schema evolution only if the schema is
>>> >> *backwards
>>> >>> compatible*. What about the case when it is backwards incompatible?
>>> This
>>> >>> might be the case when for some reason you are unable to enforce
>>> things
>>> >>> like not deleting fields or not change the order. Ideally we should
>>> be
>>> >> full
>>> >>> proof and be able to support schema evolution in every case
>>> possible. In
>>> >>> such a case, creating a Uber schema can be useful. WDYT?
>>> >>>
>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
>>> >> wrote:
>>> >>>
>>> >>>> Hi Syed,
>>> >>>>
>>> >>>> Typically, I have been the Confluent/avro schema registry used as a
>>> the
>>> >>>> source of truth and Hive schema is just a translation. Thats how the
>>> >>>> hudi-hive sync also works..
>>> >>>> Have you considered making fields optional in the avro schema so
>>> that
>>> >>> even
>>> >>>> if the source data does not have few of them, there will be nulls..
>>> >>>> In general, the two places I have dealt with this, all made it works
>>> >>> using
>>> >>>> the schema evolution rules avro supports.. and enforcing things like
>>> >> not
>>> >>>> deleting fields, not changing order etc.
>>> >>>>
>>> >>>> Hope that atleast helps a bit
>>> >>>>
>>> >>>> thanks
>>> >>>> vinoth
>>> >>>>
>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
>>> in.abdul@gmail.com
>>> >>>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Hi Team,
>>> >>>>>
>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
>>> >>> maintained
>>> >>>>> in the schema registry by confluent framework during the population
>>> >> of
>>> >>>>> data.
>>> >>>>>
>>> >>>>> *Problem Statement Here: *
>>> >>>>>
>>> >>>>> All the addition/deletion of columns is maintained in schema
>>> >> registry.
>>> >>>>> During running the Hudi pipeline, We have custom schema registry
>>> >> that
>>> >>>>> pulls the latest schema from the schema registry as well as from
>>> hive
>>> >>>>> metastore and we create a uber schema (so that missing the columns
>>> >> from
>>> >>>> the
>>> >>>>> schema registry will be pulled from hive metastore) But is there
>>> any
>>> >>>> better
>>> >>>>> approach to solve this problem?.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>           Thanks and Regards,
>>> >>>>>       S SYED ABDUL KATHER
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>>
>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Vinoth Chandar <vi...@apache.org>.

@Pratyaksh Sharma <pr...@gmail.com> Please go ahead :)

@Benoit , you are right about Parquet deletion, I think.

Come to think of it, with an initial schema in place, how would we even
drop a field? all of the old data needs to be rewritten (prohibitively
expensive)? So all we will end up doing is simply mask the field from
queries by mapping old data to the current schema?  This can get messy
pretty quickly if field re-ordering is allowed for e.g.. What we do/advise
now is to alternatively embrace a more brittle schema management at the
write side (no renames, no dropping fields, all fields are nullable) and
ensure reader schema is simpler to manage..  There is probably a
middle-ground here somewhere/



On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> @Vinoth Chandar <vi...@apache.org> I would like to drive this.
>
> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <b....@brci.fr> wrote:
>
>> Hi,
>>
>> I think deleting field is supported with Avro both backward and forward
>> as long as the field is optional  and provide default value.
>>
>> A simple exemple of Avro optional field defined using a union type and a
>> default value:
>> { "name": "foo", "type": ["null", "string"], "default": "null" }
>> Readers will use default value when field is not present.
>>
>> I believe problem here is Parquet which does not support field deletion.
>> One option is to set Parquet field value to null. Parquet will use RLE
>> encoding for efficient encoding of all null values in "deleted" field.
>>
>> Regards,
>> Benoit
>>
>> > On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
>> >
>>
>> Pratakysh,
>>
>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
>> Avro schema evolution rules which helps to prevent breaking of existing
>> queries on such tables - say someone was querying that field that is now
>> deleted.
>> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
>> That being said, I’m also looking at how we can support schema evolution
>> slightly differently - somethings could be more in our control and not
>> break reader queries - but that’s not in the near future.
>>
>> Thanks
>>
>> Sent from my iPhone
>>
>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com>
>> wrote:
>> >
>> > Hi Vinoth,
>> >
>> > We do not have any standard documentation for the said approach as it
>> was
>> > self thought through. Just logging a conversation from #general channel
>> for
>> > record purpose -
>> >
>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
>> got
>> > an error and I didnt find any solution for this... I wrote some parquet
>> > files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
>> > and sync with hive and worked perfectly. But after that, I try to wrote
>> > another file in the same table (with some schema changes, just delete
>> and
>> > add some columns) and got this error Caused by:
>> > org.apache.parquet.io.InvalidRecordException:
>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
>> > what to do?"
>> >
>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
>> wrote:
>> >>
>> >> In my experience, you need to follow some rules on evolving and keep
>> the
>> >> data backwards compatible. Or the only other option is to rewrite the
>> >> entire dataset :), which is very expensive.
>> >>
>> >> If you have some pointers to learn more about any approach you are
>> >> suggesting, happy to read up.
>> >>
>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
>> pratyaksh13@gmail.com>
>> >> wrote:
>> >>
>> >>> Hi Vinoth,
>> >>>
>> >>> As you explained above and as per what is mentioned in this FAQ (
>> >>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>> >>> ),
>> >>> Hudi is able to maintain schema evolution only if the schema is
>> >> *backwards
>> >>> compatible*. What about the case when it is backwards incompatible?
>> This
>> >>> might be the case when for some reason you are unable to enforce
>> things
>> >>> like not deleting fields or not change the order. Ideally we should be
>> >> full
>> >>> proof and be able to support schema evolution in every case possible.
>> In
>> >>> such a case, creating a Uber schema can be useful. WDYT?
>> >>>
>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
>> >> wrote:
>> >>>
>> >>>> Hi Syed,
>> >>>>
>> >>>> Typically, I have been the Confluent/avro schema registry used as a
>> the
>> >>>> source of truth and Hive schema is just a translation. Thats how the
>> >>>> hudi-hive sync also works..
>> >>>> Have you considered making fields optional in the avro schema so that
>> >>> even
>> >>>> if the source data does not have few of them, there will be nulls..
>> >>>> In general, the two places I have dealt with this, all made it works
>> >>> using
>> >>>> the schema evolution rules avro supports.. and enforcing things like
>> >> not
>> >>>> deleting fields, not changing order etc.
>> >>>>
>> >>>> Hope that atleast helps a bit
>> >>>>
>> >>>> thanks
>> >>>> vinoth
>> >>>>
>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
>> in.abdul@gmail.com
>> >>>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Team,
>> >>>>>
>> >>>>> We have pull data from Kafka generated by Debezium. The schema
>> >>> maintained
>> >>>>> in the schema registry by confluent framework during the population
>> >> of
>> >>>>> data.
>> >>>>>
>> >>>>> *Problem Statement Here: *
>> >>>>>
>> >>>>> All the addition/deletion of columns is maintained in schema
>> >> registry.
>> >>>>> During running the Hudi pipeline, We have custom schema registry
>> >> that
>> >>>>> pulls the latest schema from the schema registry as well as from
>> hive
>> >>>>> metastore and we create a uber schema (so that missing the columns
>> >> from
>> >>>> the
>> >>>>> schema registry will be pulled from hive metastore) But is there any
>> >>>> better
>> >>>>> approach to solve this problem?.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>           Thanks and Regards,
>> >>>>>       S SYED ABDUL KATHER
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Pratyaksh Sharma <pr...@gmail.com>.

@Vinoth Chandar <vi...@apache.org> I would like to drive this.

On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <b....@brci.fr> wrote:

> Hi,
>
> I think deleting field is supported with Avro both backward and forward as
> long as the field is optional  and provide default value.
>
> A simple exemple of Avro optional field defined using a union type and a
> default value:
> { "name": "foo", "type": ["null", "string"], "default": "null" }
> Readers will use default value when field is not present.
>
> I believe problem here is Parquet which does not support field deletion.
> One option is to set Parquet field value to null. Parquet will use RLE
> encoding for efficient encoding of all null values in "deleted" field.
>
> Regards,
> Benoit
>
> > On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
> >
>
> Pratakysh,
>
> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
> Avro schema evolution rules which helps to prevent breaking of existing
> queries on such tables - say someone was querying that field that is now
> deleted.
> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
> That being said, I’m also looking at how we can support schema evolution
> slightly differently - somethings could be more in our control and not
> break reader queries - but that’s not in the near future.
>
> Thanks
>
> Sent from my iPhone
>
> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com>
> wrote:
> >
> > Hi Vinoth,
> >
> > We do not have any standard documentation for the said approach as it was
> > self thought through. Just logging a conversation from #general channel
> for
> > record purpose -
> >
> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
> got
> > an error and I didnt find any solution for this... I wrote some parquet
> > files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
> > and sync with hive and worked perfectly. But after that, I try to wrote
> > another file in the same table (with some schema changes, just delete and
> > add some columns) and got this error Caused by:
> > org.apache.parquet.io.InvalidRecordException:
> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
> > what to do?"
> >
> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >>
> >> In my experience, you need to follow some rules on evolving and keep the
> >> data backwards compatible. Or the only other option is to rewrite the
> >> entire dataset :), which is very expensive.
> >>
> >> If you have some pointers to learn more about any approach you are
> >> suggesting, happy to read up.
> >>
> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pratyaksh13@gmail.com
> >
> >> wrote:
> >>
> >>> Hi Vinoth,
> >>>
> >>> As you explained above and as per what is mentioned in this FAQ (
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> >>> ),
> >>> Hudi is able to maintain schema evolution only if the schema is
> >> *backwards
> >>> compatible*. What about the case when it is backwards incompatible?
> This
> >>> might be the case when for some reason you are unable to enforce things
> >>> like not deleting fields or not change the order. Ideally we should be
> >> full
> >>> proof and be able to support schema evolution in every case possible.
> In
> >>> such a case, creating a Uber schema can be useful. WDYT?
> >>>
> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
> >> wrote:
> >>>
> >>>> Hi Syed,
> >>>>
> >>>> Typically, I have been the Confluent/avro schema registry used as a
> the
> >>>> source of truth and Hive schema is just a translation. Thats how the
> >>>> hudi-hive sync also works..
> >>>> Have you considered making fields optional in the avro schema so that
> >>> even
> >>>> if the source data does not have few of them, there will be nulls..
> >>>> In general, the two places I have dealt with this, all made it works
> >>> using
> >>>> the schema evolution rules avro supports.. and enforcing things like
> >> not
> >>>> deleting fields, not changing order etc.
> >>>>
> >>>> Hope that atleast helps a bit
> >>>>
> >>>> thanks
> >>>> vinoth
> >>>>
> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
> in.abdul@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi Team,
> >>>>>
> >>>>> We have pull data from Kafka generated by Debezium. The schema
> >>> maintained
> >>>>> in the schema registry by confluent framework during the population
> >> of
> >>>>> data.
> >>>>>
> >>>>> *Problem Statement Here: *
> >>>>>
> >>>>> All the addition/deletion of columns is maintained in schema
> >> registry.
> >>>>> During running the Hudi pipeline, We have custom schema registry
> >> that
> >>>>> pulls the latest schema from the schema registry as well as from hive
> >>>>> metastore and we create a uber schema (so that missing the columns
> >> from
> >>>> the
> >>>>> schema registry will be pulled from hive metastore) But is there any
> >>>> better
> >>>>> approach to solve this problem?.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>           Thanks and Regards,
> >>>>>       S SYED ABDUL KATHER
> >>>>>
> >>>>
> >>>
> >>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Benoit Rousseau <b....@brci.fr>.

Hi,

I think deleting field is supported with Avro both backward and forward as long as the field is optional  and provide default value.

A simple exemple of Avro optional field defined using a union type and a default value:
{ "name": "foo", "type": ["null", "string"], "default": "null" }
Readers will use default value when field is not present.

I believe problem here is Parquet which does not support field deletion. 
One option is to set Parquet field value to null. Parquet will use RLE encoding for efficient encoding of all null values in "deleted" field.

Regards,
Benoit

> On 6 Feb 2020, at 17:57, Nishith <n3...@gmail.com> wrote:
> 

Pratakysh,

Deleting fields isn’t Avro schema backwards compatible. Hudi relies on Avro schema evolution rules which helps to prevent breaking of existing queries on such tables - say someone was querying that field that is now deleted.
You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
That being said, I’m also looking at how we can support schema evolution slightly differently - somethings could be more in our control and not break reader queries - but that’s not in the near future.

Thanks

Sent from my iPhone

> On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com> wrote:
> 
> Hi Vinoth,
> 
> We do not have any standard documentation for the said approach as it was
> self thought through. Just logging a conversation from #general channel for
> record purpose -
> 
> "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I got
> an error and I didnt find any solution for this... I wrote some parquet
> files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
> and sync with hive and worked perfectly. But after that, I try to wrote
> another file in the same table (with some schema changes, just delete and
> add some columns) and got this error Caused by:
> org.apache.parquet.io.InvalidRecordException:
> Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
> what to do?"
> 
>>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org> wrote:
>> 
>> In my experience, you need to follow some rules on evolving and keep the
>> data backwards compatible. Or the only other option is to rewrite the
>> entire dataset :), which is very expensive.
>> 
>> If you have some pointers to learn more about any approach you are
>> suggesting, happy to read up.
>> 
>> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pr...@gmail.com>
>> wrote:
>> 
>>> Hi Vinoth,
>>> 
>>> As you explained above and as per what is mentioned in this FAQ (
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>> ),
>>> Hudi is able to maintain schema evolution only if the schema is
>> *backwards
>>> compatible*. What about the case when it is backwards incompatible? This
>>> might be the case when for some reason you are unable to enforce things
>>> like not deleting fields or not change the order. Ideally we should be
>> full
>>> proof and be able to support schema evolution in every case possible. In
>>> such a case, creating a Uber schema can be useful. WDYT?
>>> 
>>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
>> wrote:
>>> 
>>>> Hi Syed,
>>>> 
>>>> Typically, I have been the Confluent/avro schema registry used as a the
>>>> source of truth and Hive schema is just a translation. Thats how the
>>>> hudi-hive sync also works..
>>>> Have you considered making fields optional in the avro schema so that
>>> even
>>>> if the source data does not have few of them, there will be nulls..
>>>> In general, the two places I have dealt with this, all made it works
>>> using
>>>> the schema evolution rules avro supports.. and enforcing things like
>> not
>>>> deleting fields, not changing order etc.
>>>> 
>>>> Hope that atleast helps a bit
>>>> 
>>>> thanks
>>>> vinoth
>>>> 
>>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <in.abdul@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi Team,
>>>>> 
>>>>> We have pull data from Kafka generated by Debezium. The schema
>>> maintained
>>>>> in the schema registry by confluent framework during the population
>> of
>>>>> data.
>>>>> 
>>>>> *Problem Statement Here: *
>>>>> 
>>>>> All the addition/deletion of columns is maintained in schema
>> registry.
>>>>> During running the Hudi pipeline, We have custom schema registry
>> that
>>>>> pulls the latest schema from the schema registry as well as from hive
>>>>> metastore and we create a uber schema (so that missing the columns
>> from
>>>> the
>>>>> schema registry will be pulled from hive metastore) But is there any
>>>> better
>>>>> approach to solve this problem?.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>           Thanks and Regards,
>>>>>       S SYED ABDUL KATHER
>>>>> 
>>>> 
>>> 
>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Vinoth Chandar <vi...@apache.org>.

This is a good topic for a RFC.. Like the one Nishith wrote for indexing,
something to gather ideas and share a broader perspective on..

Anyone interested in driving this?

On Thu, Feb 6, 2020 at 8:57 AM Nishith <n3...@gmail.com> wrote:

> Pratakysh,
>
> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
> Avro schema evolution rules which helps to prevent breaking of existing
> queries on such tables - say someone was querying that field that is now
> deleted.
> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
> That being said, I’m also looking at how we can support schema evolution
> slightly differently - somethings could be more in our control and not
> break reader queries - but that’s not in the near future.
>
> Thanks
>
> Sent from my iPhone
>
> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com>
> wrote:
> >
> > Hi Vinoth,
> >
> > We do not have any standard documentation for the said approach as it was
> > self thought through. Just logging a conversation from #general channel
> for
> > record purpose -
> >
> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
> got
> > an error and I didnt find any solution for this... I wrote some parquet
> > files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
> > and sync with hive and worked perfectly. But after that, I try to wrote
> > another file in the same table (with some schema changes, just delete and
> > add some columns) and got this error Caused by:
> > org.apache.parquet.io.InvalidRecordException:
> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
> > what to do?"
> >
> >> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >>
> >> In my experience, you need to follow some rules on evolving and keep the
> >> data backwards compatible. Or the only other option is to rewrite the
> >> entire dataset :), which is very expensive.
> >>
> >> If you have some pointers to learn more about any approach you are
> >> suggesting, happy to read up.
> >>
> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pratyaksh13@gmail.com
> >
> >> wrote:
> >>
> >>> Hi Vinoth,
> >>>
> >>> As you explained above and as per what is mentioned in this FAQ (
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> >>> ),
> >>> Hudi is able to maintain schema evolution only if the schema is
> >> *backwards
> >>> compatible*. What about the case when it is backwards incompatible?
> This
> >>> might be the case when for some reason you are unable to enforce things
> >>> like not deleting fields or not change the order. Ideally we should be
> >> full
> >>> proof and be able to support schema evolution in every case possible.
> In
> >>> such a case, creating a Uber schema can be useful. WDYT?
> >>>
> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
> >> wrote:
> >>>
> >>>> Hi Syed,
> >>>>
> >>>> Typically, I have been the Confluent/avro schema registry used as a
> the
> >>>> source of truth and Hive schema is just a translation. Thats how the
> >>>> hudi-hive sync also works..
> >>>> Have you considered making fields optional in the avro schema so that
> >>> even
> >>>> if the source data does not have few of them, there will be nulls..
> >>>> In general, the two places I have dealt with this, all made it works
> >>> using
> >>>> the schema evolution rules avro supports.. and enforcing things like
> >> not
> >>>> deleting fields, not changing order etc.
> >>>>
> >>>> Hope that atleast helps a bit
> >>>>
> >>>> thanks
> >>>> vinoth
> >>>>
> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
> in.abdul@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi Team,
> >>>>>
> >>>>> We have pull data from Kafka generated by Debezium. The schema
> >>> maintained
> >>>>> in the schema registry by confluent framework during the population
> >> of
> >>>>> data.
> >>>>>
> >>>>> *Problem Statement Here: *
> >>>>>
> >>>>> All the addition/deletion of columns is maintained in schema
> >> registry.
> >>>>> During running the Hudi pipeline, We have custom schema registry
> >> that
> >>>>> pulls the latest schema from the schema registry as well as from hive
> >>>>> metastore and we create a uber schema (so that missing the columns
> >> from
> >>>> the
> >>>>> schema registry will be pulled from hive metastore) But is there any
> >>>> better
> >>>>> approach to solve this problem?.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>            Thanks and Regards,
> >>>>>        S SYED ABDUL KATHER
> >>>>>
> >>>>
> >>>
> >>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Nishith <n3...@gmail.com>.

Pratakysh,

Deleting fields isn’t Avro schema backwards compatible. Hudi relies on Avro schema evolution rules which helps to prevent breaking of existing queries on such tables - say someone was querying that field that is now deleted.
You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
That being said, I’m also looking at how we can support schema evolution slightly differently - somethings could be more in our control and not break reader queries - but that’s not in the near future.

Thanks

Sent from my iPhone

> On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <pr...@gmail.com> wrote:
> 
> Hi Vinoth,
> 
> We do not have any standard documentation for the said approach as it was
> self thought through. Just logging a conversation from #general channel for
> record purpose -
> 
> "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I got
> an error and I didnt find any solution for this... I wrote some parquet
> files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
> and sync with hive and worked perfectly. But after that, I try to wrote
> another file in the same table (with some schema changes, just delete and
> add some columns) and got this error Caused by:
> org.apache.parquet.io.InvalidRecordException:
> Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
> what to do?"
> 
>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org> wrote:
>> 
>> In my experience, you need to follow some rules on evolving and keep the
>> data backwards compatible. Or the only other option is to rewrite the
>> entire dataset :), which is very expensive.
>> 
>> If you have some pointers to learn more about any approach you are
>> suggesting, happy to read up.
>> 
>> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pr...@gmail.com>
>> wrote:
>> 
>>> Hi Vinoth,
>>> 
>>> As you explained above and as per what is mentioned in this FAQ (
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>> ),
>>> Hudi is able to maintain schema evolution only if the schema is
>> *backwards
>>> compatible*. What about the case when it is backwards incompatible? This
>>> might be the case when for some reason you are unable to enforce things
>>> like not deleting fields or not change the order. Ideally we should be
>> full
>>> proof and be able to support schema evolution in every case possible. In
>>> such a case, creating a Uber schema can be useful. WDYT?
>>> 
>>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
>> wrote:
>>> 
>>>> Hi Syed,
>>>> 
>>>> Typically, I have been the Confluent/avro schema registry used as a the
>>>> source of truth and Hive schema is just a translation. Thats how the
>>>> hudi-hive sync also works..
>>>> Have you considered making fields optional in the avro schema so that
>>> even
>>>> if the source data does not have few of them, there will be nulls..
>>>> In general, the two places I have dealt with this, all made it works
>>> using
>>>> the schema evolution rules avro supports.. and enforcing things like
>> not
>>>> deleting fields, not changing order etc.
>>>> 
>>>> Hope that atleast helps a bit
>>>> 
>>>> thanks
>>>> vinoth
>>>> 
>>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <in.abdul@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi Team,
>>>>> 
>>>>> We have pull data from Kafka generated by Debezium. The schema
>>> maintained
>>>>> in the schema registry by confluent framework during the population
>> of
>>>>> data.
>>>>> 
>>>>> *Problem Statement Here: *
>>>>> 
>>>>> All the addition/deletion of columns is maintained in schema
>> registry.
>>>>> During running the Hudi pipeline, We have custom schema registry
>> that
>>>>> pulls the latest schema from the schema registry as well as from hive
>>>>> metastore and we create a uber schema (so that missing the columns
>> from
>>>> the
>>>>> schema registry will be pulled from hive metastore) But is there any
>>>> better
>>>>> approach to solve this problem?.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>            Thanks and Regards,
>>>>>        S SYED ABDUL KATHER
>>>>> 
>>>> 
>>> 
>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Vinoth,

We do not have any standard documentation for the said approach as it was
self thought through. Just logging a conversation from #general channel for
record purpose -

"Hello people, I'm doing a POC to use HUDI in our data pipeline, but I got
an error and I didnt find any solution for this... I wrote some parquet
files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
and sync with hive and worked perfectly. But after that, I try to wrote
another file in the same table (with some schema changes, just delete and
add some columns) and got this error Caused by:
org.apache.parquet.io.InvalidRecordException:
Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
what to do?"

On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <vi...@apache.org> wrote:

> In my experience, you need to follow some rules on evolving and keep the
> data backwards compatible. Or the only other option is to rewrite the
> entire dataset :), which is very expensive.
>
> If you have some pointers to learn more about any approach you are
> suggesting, happy to read up.
>
> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi Vinoth,
> >
> > As you explained above and as per what is mentioned in this FAQ (
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> > ),
> > Hudi is able to maintain schema evolution only if the schema is
> *backwards
> > compatible*. What about the case when it is backwards incompatible? This
> > might be the case when for some reason you are unable to enforce things
> > like not deleting fields or not change the order. Ideally we should be
> full
> > proof and be able to support schema evolution in every case possible. In
> > such a case, creating a Uber schema can be useful. WDYT?
> >
> > On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Syed,
> > >
> > > Typically, I have been the Confluent/avro schema registry used as a the
> > > source of truth and Hive schema is just a translation. Thats how the
> > > hudi-hive sync also works..
> > > Have you considered making fields optional in the avro schema so that
> > even
> > > if the source data does not have few of them, there will be nulls..
> > > In general, the two places I have dealt with this, all made it works
> > using
> > > the schema evolution rules avro supports.. and enforcing things like
> not
> > > deleting fields, not changing order etc.
> > >
> > > Hope that atleast helps a bit
> > >
> > > thanks
> > > vinoth
> > >
> > > On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <in.abdul@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > We have pull data from Kafka generated by Debezium. The schema
> > maintained
> > > > in the schema registry by confluent framework during the population
> of
> > > > data.
> > > >
> > > > *Problem Statement Here: *
> > > >
> > > > All the addition/deletion of columns is maintained in schema
> registry.
> > > >  During running the Hudi pipeline, We have custom schema registry
> that
> > > > pulls the latest schema from the schema registry as well as from hive
> > > > metastore and we create a uber schema (so that missing the columns
> from
> > > the
> > > > schema registry will be pulled from hive metastore) But is there any
> > > better
> > > > approach to solve this problem?.
> > > >
> > > >
> > > >
> > > >
> > > >             Thanks and Regards,
> > > >         S SYED ABDUL KATHER
> > > >
> > >
> >
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Posted by Vinoth Chandar <vi...@apache.org>.

In my experience, you need to follow some rules on evolving and keep the
data backwards compatible. Or the only other option is to rewrite the
entire dataset :), which is very expensive.

If you have some pointers to learn more about any approach you are
suggesting, happy to read up.

On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Vinoth,
>
> As you explained above and as per what is mentioned in this FAQ (
>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> ),
> Hudi is able to maintain schema evolution only if the schema is *backwards
> compatible*. What about the case when it is backwards incompatible? This
> might be the case when for some reason you are unable to enforce things
> like not deleting fields or not change the order. Ideally we should be full
> proof and be able to support schema evolution in every case possible. In
> such a case, creating a Uber schema can be useful. WDYT?
>
> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Syed,
> >
> > Typically, I have been the Confluent/avro schema registry used as a the
> > source of truth and Hive schema is just a translation. Thats how the
> > hudi-hive sync also works..
> > Have you considered making fields optional in the avro schema so that
> even
> > if the source data does not have few of them, there will be nulls..
> > In general, the two places I have dealt with this, all made it works
> using
> > the schema evolution rules avro supports.. and enforcing things like not
> > deleting fields, not changing order etc.
> >
> > Hope that atleast helps a bit
> >
> > thanks
> > vinoth
> >
> > On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <in...@gmail.com>
> > wrote:
> >
> > > Hi Team,
> > >
> > > We have pull data from Kafka generated by Debezium. The schema
> maintained
> > > in the schema registry by confluent framework during the population of
> > > data.
> > >
> > > *Problem Statement Here: *
> > >
> > > All the addition/deletion of columns is maintained in schema registry.
> > >  During running the Hudi pipeline, We have custom schema registry that
> > > pulls the latest schema from the schema registry as well as from hive
> > > metastore and we create a uber schema (so that missing the columns from
> > the
> > > schema registry will be pulled from hive metastore) But is there any
> > better
> > > approach to solve this problem?.
> > >
> > >
> > >
> > >
> > >             Thanks and Regards,
> > >         S SYED ABDUL KATHER
> > >
> >
>