You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Raghvendra Dubey <ra...@delhivery.com.INVALID> on 2020/03/15 11:38:08 UTC

Schema Reference in HudiDeltaStreamer

Hi Team,

 I am reading parquet data from HudiDeltaStreamer and writing data into Hudi Dataset.
s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)

I referred  avro schema as target schema through parameter
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc

Deltastreamer command like
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar --table-type COPY_ON_WRITE --source-ordering-field action_date --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://emr-spark-scripts/hudi_spark_test --target-table hudi_spark_test --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf  hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/ --continuous

but I  am getting issue of schema i.e
org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'cop_amt' not found
        at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
        at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
        at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
        at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
        at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
        at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)

I have referred errored field into schema but still getting this issue.
Could you guys please help how can I refer schema file?

Thanks
Raghvendra

Re: Schema Reference in HudiDeltaStreamer

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.

Thanks Pratyaksh for the help.

On Mon, Mar 16, 2020 at 4:12 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Raghvendra,
>
> As per the code flow of Parquet reader, I do not see any reason why this
> exception should be thrown if your target schema is actually having the
> concerned field. I would suggest to print the target schema just before
> ParquetReader flow starts in HoodieCopyOnWriteTable class i.e you need to
> print writerSchema in HoodieMergeHandle and cross check if the concerned
> field is actually getting passed to ParquetReader.
>
> On Mon, Mar 16, 2020 at 2:25 PM Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > It is nullable
> > like {"name":"_id","type":["null","string"],"default":null}
> >
> > On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma <pr...@gmail.com>
> > wrote:
> >
> > > How have you mentioned the field in your schema file? Is it a nullable
> > > field or is it having default value?
> > >
> > > On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> > > <ra...@delhivery.com.invalid> wrote:
> > >
> > > > Thanks Pratyaksh,
> > > >
> > > > I got your point, but as in the example I used s3 avro schema file to
> > > refer
> > > > all emerged schema, It is not working.
> > > > I didn't try HiveSync Tool for this. Is there any option to refer
> glue?
> > > >
> > > >
> > > > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <
> > pratyaksh13@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Raghvendra,
> > > > >
> > > > > As mentioned in the FAQ, this error occurs when your schema has
> > evolved
> > > > in
> > > > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > > > current
> > > > > target schema has this field, the problem is occurring because some
> > > > > incoming record is not having this field. To fix this, you have the
> > > > > following options -
> > > > >
> > > > > 1. Make sure none of the fields get deleted.
> > > > > 2. Else have some default value for this field and send all your
> > > records
> > > > > with that default value
> > > > > 3. Try creating uber schema.
> > > > >
> > > > > By uber schema I mean to say, create a schema which has all the
> > fields
> > > > > which were ever a part of your incoming records. If you are using
> > > > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> > > good
> > > > > source of truth for getting all the fields ever ingested. Please
> let
> > me
> > > > > know if this makes sense.
> > > > >
> > > > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > > > <ra...@delhivery.com.invalid> wrote:
> > > > >
> > > > > > Thanks Pratyaksh,
> > > > > > But I am assigning target schema here as
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > >
> > > > > > But it doesn’t help, as per troubleshooting guide it is asking to
> > > build
> > > > > > Uber schema and refer It as target schema, but I am not sure
> about
> > > Uber
> > > > > > schema could you please help me into this?
> > > > > >
> > > > > > Thanks
> > > > > > Raghvendra
> > > > > >
> > > > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > > > pratyaksh13@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > This might help - Caused by: org.apache.parquet.io
> > > > > > .InvalidRecordException:
> > > > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > > > >
> > > > > > > .
> > > > > > >
> > > > > > > Please let us know in case of any more queries.
> > > > > > >
> > > > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > > > > <ra...@delhivery.com.invalid> wrote:
> > > > > > >
> > > > > > > > Hi Team,
> > > > > > > >
> > > > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> > > data
> > > > > into
> > > > > > > > Hudi Dataset.
> > > > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > > > >
> > > > > > > > I referred  avro schema as target schema through parameter
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > > > >
> > > > > > > > Deltastreamer command like
> > > > > > > > spark-submit --class
> > > > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > --packages
> > > > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> > > --deploy-mode
> > > > > > client
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > > > --table-type COPY_ON_WRITE --source-ordering-field
> action_date
> > > > > > > > --source-class
> > org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > > > > --target-table
> > > > > > > > hudi_spark_test --transformer-class
> > > > > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> > > > --payload-class
> > > > > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > > > > --continuous
> > > > > > > >
> > > > > > > > but I  am getting issue of schema i.e
> > > > > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro
> > > schema
> > > > > > > > mismatch: Avro field 'cop_amt' not found
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > > > > >         at
> > > > > > > >
> > > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > > > > >
> > > > > > > > I have referred errored field into schema but still getting
> > this
> > > > > issue.
> > > > > > > > Could you guys please help how can I refer schema file?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Raghvendra
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Raghvendra,

As per the code flow of Parquet reader, I do not see any reason why this
exception should be thrown if your target schema is actually having the
concerned field. I would suggest to print the target schema just before
ParquetReader flow starts in HoodieCopyOnWriteTable class i.e you need to
print writerSchema in HoodieMergeHandle and cross check if the concerned
field is actually getting passed to ParquetReader.

On Mon, Mar 16, 2020 at 2:25 PM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> It is nullable
> like {"name":"_id","type":["null","string"],"default":null}
>
> On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > How have you mentioned the field in your schema file? Is it a nullable
> > field or is it having default value?
> >
> > On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> > <ra...@delhivery.com.invalid> wrote:
> >
> > > Thanks Pratyaksh,
> > >
> > > I got your point, but as in the example I used s3 avro schema file to
> > refer
> > > all emerged schema, It is not working.
> > > I didn't try HiveSync Tool for this. Is there any option to refer glue?
> > >
> > >
> > > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <
> pratyaksh13@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Raghvendra,
> > > >
> > > > As mentioned in the FAQ, this error occurs when your schema has
> evolved
> > > in
> > > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > > current
> > > > target schema has this field, the problem is occurring because some
> > > > incoming record is not having this field. To fix this, you have the
> > > > following options -
> > > >
> > > > 1. Make sure none of the fields get deleted.
> > > > 2. Else have some default value for this field and send all your
> > records
> > > > with that default value
> > > > 3. Try creating uber schema.
> > > >
> > > > By uber schema I mean to say, create a schema which has all the
> fields
> > > > which were ever a part of your incoming records. If you are using
> > > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> > good
> > > > source of truth for getting all the fields ever ingested. Please let
> me
> > > > know if this makes sense.
> > > >
> > > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > > <ra...@delhivery.com.invalid> wrote:
> > > >
> > > > > Thanks Pratyaksh,
> > > > > But I am assigning target schema here as
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > >
> > > > > But it doesn’t help, as per troubleshooting guide it is asking to
> > build
> > > > > Uber schema and refer It as target schema, but I am not sure about
> > Uber
> > > > > schema could you please help me into this?
> > > > >
> > > > > Thanks
> > > > > Raghvendra
> > > > >
> > > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > > pratyaksh13@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > This might help - Caused by: org.apache.parquet.io
> > > > > .InvalidRecordException:
> > > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > > >
> > > > > > .
> > > > > >
> > > > > > Please let us know in case of any more queries.
> > > > > >
> > > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > > > <ra...@delhivery.com.invalid> wrote:
> > > > > >
> > > > > > > Hi Team,
> > > > > > >
> > > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> > data
> > > > into
> > > > > > > Hudi Dataset.
> > > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > > >
> > > > > > > I referred  avro schema as target schema through parameter
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > > >
> > > > > > > Deltastreamer command like
> > > > > > > spark-submit --class
> > > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > > --packages
> > > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> > --deploy-mode
> > > > > client
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > > > --source-class
> org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > > > --target-table
> > > > > > > hudi_spark_test --transformer-class
> > > > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> > > --payload-class
> > > > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > > > --continuous
> > > > > > >
> > > > > > > but I  am getting issue of schema i.e
> > > > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro
> > schema
> > > > > > > mismatch: Avro field 'cop_amt' not found
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > > > >         at
> > > > > > >
> > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > > > >
> > > > > > > I have referred errored field into schema but still getting
> this
> > > > issue.
> > > > > > > Could you guys please help how can I refer schema file?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Raghvendra
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.

It is nullable
like {"name":"_id","type":["null","string"],"default":null}

On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> How have you mentioned the field in your schema file? Is it a nullable
> field or is it having default value?
>
> On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > Thanks Pratyaksh,
> >
> > I got your point, but as in the example I used s3 avro schema file to
> refer
> > all emerged schema, It is not working.
> > I didn't try HiveSync Tool for this. Is there any option to refer glue?
> >
> >
> > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <pratyaksh13@gmail.com
> >
> > wrote:
> >
> > > Hi Raghvendra,
> > >
> > > As mentioned in the FAQ, this error occurs when your schema has evolved
> > in
> > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > current
> > > target schema has this field, the problem is occurring because some
> > > incoming record is not having this field. To fix this, you have the
> > > following options -
> > >
> > > 1. Make sure none of the fields get deleted.
> > > 2. Else have some default value for this field and send all your
> records
> > > with that default value
> > > 3. Try creating uber schema.
> > >
> > > By uber schema I mean to say, create a schema which has all the fields
> > > which were ever a part of your incoming records. If you are using
> > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> good
> > > source of truth for getting all the fields ever ingested. Please let me
> > > know if this makes sense.
> > >
> > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > <ra...@delhivery.com.invalid> wrote:
> > >
> > > > Thanks Pratyaksh,
> > > > But I am assigning target schema here as
> > > >
> > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > >
> > > > But it doesn’t help, as per troubleshooting guide it is asking to
> build
> > > > Uber schema and refer It as target schema, but I am not sure about
> Uber
> > > > schema could you please help me into this?
> > > >
> > > > Thanks
> > > > Raghvendra
> > > >
> > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > pratyaksh13@gmail.com>
> > > > wrote:
> > > >
> > > > > This might help - Caused by: org.apache.parquet.io
> > > > .InvalidRecordException:
> > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > >
> > > > > .
> > > > >
> > > > > Please let us know in case of any more queries.
> > > > >
> > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > > <ra...@delhivery.com.invalid> wrote:
> > > > >
> > > > > > Hi Team,
> > > > > >
> > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> data
> > > into
> > > > > > Hudi Dataset.
> > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > >
> > > > > > I referred  avro schema as target schema through parameter
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > >
> > > > > > Deltastreamer command like
> > > > > > spark-submit --class
> > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > --packages
> > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> --deploy-mode
> > > > client
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > > --target-table
> > > > > > hudi_spark_test --transformer-class
> > > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> > --payload-class
> > > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > > --continuous
> > > > > >
> > > > > > but I  am getting issue of schema i.e
> > > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro
> schema
> > > > > > mismatch: Avro field 'cop_amt' not found
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > > >         at
> > > > > >
> > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > > >
> > > > > > I have referred errored field into schema but still getting this
> > > issue.
> > > > > > Could you guys please help how can I refer schema file?
> > > > > >
> > > > > > Thanks
> > > > > > Raghvendra
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Pratyaksh Sharma <pr...@gmail.com>.

How have you mentioned the field in your schema file? Is it a nullable
field or is it having default value?

On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> Thanks Pratyaksh,
>
> I got your point, but as in the example I used s3 avro schema file to refer
> all emerged schema, It is not working.
> I didn't try HiveSync Tool for this. Is there any option to refer glue?
>
>
> On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > Hi Raghvendra,
> >
> > As mentioned in the FAQ, this error occurs when your schema has evolved
> in
> > terms of deleting some field, in your case 'cop_amt'. Even if your
> current
> > target schema has this field, the problem is occurring because some
> > incoming record is not having this field. To fix this, you have the
> > following options -
> >
> > 1. Make sure none of the fields get deleted.
> > 2. Else have some default value for this field and send all your records
> > with that default value
> > 3. Try creating uber schema.
> >
> > By uber schema I mean to say, create a schema which has all the fields
> > which were ever a part of your incoming records. If you are using
> > HiveSyncTool along with DeltaStreamer, then hive metastore can be a good
> > source of truth for getting all the fields ever ingested. Please let me
> > know if this makes sense.
> >
> > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > <ra...@delhivery.com.invalid> wrote:
> >
> > > Thanks Pratyaksh,
> > > But I am assigning target schema here as
> > >
> > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > >
> > > But it doesn’t help, as per troubleshooting guide it is asking to build
> > > Uber schema and refer It as target schema, but I am not sure about Uber
> > > schema could you please help me into this?
> > >
> > > Thanks
> > > Raghvendra
> > >
> > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> pratyaksh13@gmail.com>
> > > wrote:
> > >
> > > > This might help - Caused by: org.apache.parquet.io
> > > .InvalidRecordException:
> > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > >
> > > > .
> > > >
> > > > Please let us know in case of any more queries.
> > > >
> > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > <ra...@delhivery.com.invalid> wrote:
> > > >
> > > > > Hi Team,
> > > > >
> > > > >  I am reading parquet data from HudiDeltaStreamer and writing data
> > into
> > > > > Hudi Dataset.
> > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > >
> > > > > I referred  avro schema as target schema through parameter
> > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > >
> > > > > Deltastreamer command like
> > > > > spark-submit --class
> > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > --packages
> > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode
> > > client
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > > --target-table
> > > > > hudi_spark_test --transformer-class
> > > > > org.apache.hudi.utilities.transform.AWSDmsTransformer
> --payload-class
> > > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > > >
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > > --continuous
> > > > >
> > > > > but I  am getting issue of schema i.e
> > > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > > > > mismatch: Avro field 'cop_amt' not found
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > > >         at
> > > > >
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > > >
> > > > > I have referred errored field into schema but still getting this
> > issue.
> > > > > Could you guys please help how can I refer schema file?
> > > > >
> > > > > Thanks
> > > > > Raghvendra
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.

Thanks Pratyaksh,

I got your point, but as in the example I used s3 avro schema file to refer
all emerged schema, It is not working.
I didn't try HiveSync Tool for this. Is there any option to refer glue?


On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <pr...@gmail.com>
wrote:

> Hi Raghvendra,
>
> As mentioned in the FAQ, this error occurs when your schema has evolved in
> terms of deleting some field, in your case 'cop_amt'. Even if your current
> target schema has this field, the problem is occurring because some
> incoming record is not having this field. To fix this, you have the
> following options -
>
> 1. Make sure none of the fields get deleted.
> 2. Else have some default value for this field and send all your records
> with that default value
> 3. Try creating uber schema.
>
> By uber schema I mean to say, create a schema which has all the fields
> which were ever a part of your incoming records. If you are using
> HiveSyncTool along with DeltaStreamer, then hive metastore can be a good
> source of truth for getting all the fields ever ingested. Please let me
> know if this makes sense.
>
> On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > Thanks Pratyaksh,
> > But I am assigning target schema here as
> >
> >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> >
> > But it doesn’t help, as per troubleshooting guide it is asking to build
> > Uber schema and refer It as target schema, but I am not sure about Uber
> > schema could you please help me into this?
> >
> > Thanks
> > Raghvendra
> >
> > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <pr...@gmail.com>
> > wrote:
> >
> > > This might help - Caused by: org.apache.parquet.io
> > .InvalidRecordException:
> > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > >
> > > .
> > >
> > > Please let us know in case of any more queries.
> > >
> > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > <ra...@delhivery.com.invalid> wrote:
> > >
> > > > Hi Team,
> > > >
> > > >  I am reading parquet data from HudiDeltaStreamer and writing data
> into
> > > > Hudi Dataset.
> > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > >
> > > > I referred  avro schema as target schema through parameter
> > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > >
> > > > Deltastreamer command like
> > > > spark-submit --class
> > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> --packages
> > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode
> > client
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > --target-table
> > > > hudi_spark_test --transformer-class
> > > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > --continuous
> > > >
> > > > but I  am getting issue of schema i.e
> > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > > > mismatch: Avro field 'cop_amt' not found
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > > >         at
> > > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > > >
> > > > I have referred errored field into schema but still getting this
> issue.
> > > > Could you guys please help how can I refer schema file?
> > > >
> > > > Thanks
> > > > Raghvendra
> > > >
> > > >
> > > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Raghvendra,

As mentioned in the FAQ, this error occurs when your schema has evolved in
terms of deleting some field, in your case 'cop_amt'. Even if your current
target schema has this field, the problem is occurring because some
incoming record is not having this field. To fix this, you have the
following options -

1. Make sure none of the fields get deleted.
2. Else have some default value for this field and send all your records
with that default value
3. Try creating uber schema.

By uber schema I mean to say, create a schema which has all the fields
which were ever a part of your incoming records. If you are using
HiveSyncTool along with DeltaStreamer, then hive metastore can be a good
source of truth for getting all the fields ever ingested. Please let me
know if this makes sense.

On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> Thanks Pratyaksh,
> But I am assigning target schema here as
>
>
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
>
> But it doesn’t help, as per troubleshooting guide it is asking to build
> Uber schema and refer It as target schema, but I am not sure about Uber
> schema could you please help me into this?
>
> Thanks
> Raghvendra
>
> On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <pr...@gmail.com>
> wrote:
>
> > This might help - Caused by: org.apache.parquet.io
> .InvalidRecordException:
> > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > >
> > .
> >
> > Please let us know in case of any more queries.
> >
> > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > <ra...@delhivery.com.invalid> wrote:
> >
> > > Hi Team,
> > >
> > >  I am reading parquet data from HudiDeltaStreamer and writing data into
> > > Hudi Dataset.
> > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > >
> > > I referred  avro schema as target schema through parameter
> > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > >
> > > Deltastreamer command like
> > > spark-submit --class
> > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode
> client
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> --target-table
> > > hudi_spark_test --transformer-class
> > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > --continuous
> > >
> > > but I  am getting issue of schema i.e
> > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > > mismatch: Avro field 'cop_amt' not found
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > >         at
> > >
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > >         at
> > >
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > >         at
> > > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> > >
> > > I have referred errored field into schema but still getting this issue.
> > > Could you guys please help how can I refer schema file?
> > >
> > > Thanks
> > > Raghvendra
> > >
> > >
> > >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.

Thanks Pratyaksh,
But I am assigning target schema here as

hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc

But it doesn’t help, as per troubleshooting guide it is asking to build
Uber schema and refer It as target schema, but I am not sure about Uber
schema could you please help me into this?

Thanks
Raghvendra

On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <pr...@gmail.com>
wrote:

> This might help - Caused by: org.apache.parquet.io.InvalidRecordException:
> Parquet/Avro schema mismatch: Avro field 'col1' not found
> <
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> >
> .
>
> Please let us know in case of any more queries.
>
> On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > Hi Team,
> >
> >  I am reading parquet data from HudiDeltaStreamer and writing data into
> > Hudi Dataset.
> > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> >
> > I referred  avro schema as target schema through parameter
> >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> >
> > Deltastreamer command like
> > spark-submit --class
> > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > --target-base-path s3://emr-spark-scripts/hudi_spark_test --target-table
> > hudi_spark_test --transformer-class
> > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > --continuous
> >
> > but I  am getting issue of schema i.e
> > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > mismatch: Avro field 'cop_amt' not found
> >         at
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> >         at
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
> >         at
> >
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> >         at
> >
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> >         at
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> >         at
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> >         at
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> >         at
> > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> >
> > I have referred errored field into schema but still getting this issue.
> > Could you guys please help how can I refer schema file?
> >
> > Thanks
> > Raghvendra
> >
> >
> >
>

Re: Schema Reference in HudiDeltaStreamer

Posted by Pratyaksh Sharma <pr...@gmail.com>.

This might help - Caused by: org.apache.parquet.io.InvalidRecordException:
Parquet/Avro schema mismatch: Avro field 'col1' not found
<https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound>
.

Please let us know in case of any more queries.

On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
<ra...@delhivery.com.invalid> wrote:

> Hi Team,
>
>  I am reading parquet data from HudiDeltaStreamer and writing data into
> Hudi Dataset.
> s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
>
> I referred  avro schema as target schema through parameter
>
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
>
> Deltastreamer command like
> spark-submit --class
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> --table-type COPY_ON_WRITE --source-ordering-field action_date
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> --target-base-path s3://emr-spark-scripts/hudi_spark_test --target-table
> hudi_spark_test --transformer-class
> org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> --continuous
>
> but I  am getting issue of schema i.e
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> mismatch: Avro field 'cop_amt' not found
>         at
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
>         at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
>         at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
>         at
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
>         at
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
>         at
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
>         at
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
>         at
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
>
> I have referred errored field into schema but still getting this issue.
> Could you guys please help how can I refer schema file?
>
> Thanks
> Raghvendra
>
>
>