You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2019/06/01 14:58:02 UTC

Re: Strange exception after upgrade to 0.4.7

May be that alters the order of classes in classpath? hard to tell..  We
are definitely going to look into shading Jackson and few other
dependencies in a much better way.
0.4.7 has a Jackson version change. So that could be a difference. But
again, that version is same as what spark uses ..
SO :( Let me try to reproduce your issue as well when we rethink the deps.

On Fri, May 31, 2019 at 4:54 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:

> Hi,
>
> I am using Spark 2.2.0. I found that if I put the Hudi dependency under
> the Spark dependency in Maven, Hudi can run correctly.
> However, if I put Hudi before the Spark dependency, the exception always
> occur, no matter hoodie-spark or hoodie-spark-bundle I used.
>
> Do you have any idea about the reason of this? This only happen in 0.4.7.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -----Original Message-----
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, May 31, 2019 1:19 AM
> To: dev@hudi.apache.org
> Subject: Re: Strange exception after upgrade to 0.4.7
>
> Hi,
>
> This does sound like a jar mismatch issue from spark version. I have seen
> similar ticket associated with spark 2.1.x IIRC. If you are building your
> own uber/fat jar then probably better to depend on hoodie-spark module than
> the hoodie-spark-bundle which is a uberjar itself.
>
> What version of spark are you using?
>
> Thanks
> Vinoth
>
> On Thu, May 30, 2019 at 11:24 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:
>
> > Hi,
> >
> > The test case is really very simple, just like the hoodie test case.
> > I have two dataframe, using the CopyOnWrite, first write the first one
> > with Overwrite, and then write the second one Append, both operation
> > use the format "com.uber.hoodie".
> > However, the exception occur when I read the dataset after this two
> > write operation.
> > I used the Maven to manage the dependencies, here is the part of my
> > maven
> > dependencies:
> >
> >         <dependency>
> >             <groupId>com.uber.hoodie</groupId>
> >             <artifactId>hoodie-spark-bundle</artifactId>
> >             <version>0.4.7</version>
> >         </dependency>
> >
> > This exception only happen in 0.4.7, if I change it to 0.4.6, it works
> > very well.
> > I have ran the same test in
> > 1. GitHub repository compiled on my laptop 2. Source Code of the 0.4.7
> > compiled on my laptop All worked very well.
> >
> > Maybe, it because of the Maven release.
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> > -----Original Message-----
> > From: Vinoth Chandar <vi...@apache.org>
> > Sent: Wednesday, May 29, 2019 8:00 PM
> > To: dev@hudi.apache.org
> > Subject: Re: Strange exception after upgrade to 0.4.7
> >
> > Also curious if this error does not happen with 0.4.6? Can you please
> > confirm that? It would be helpful to narrow it down
> >
> > On Wed, May 29, 2019 at 6:25 PM vbalaji@apache.org
> > <vb...@apache.org>
> > wrote:
> >
> > >  Hi Yuanbin,
> > >
> > > Not sure if I completely understood the problem. Are you using
> > > "com.uber.hoodie" format for reading the dataset ? Are you using
> > > hoodie-spark-bundle ?
> > > From the stack overflow link,
> > > https://stackoverflow.com/questions/48034825/why-does-streaming-quer
> > > y-
> > > fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&l
> > > q=
> > > 1 , this could be because of parquet version. Assuming this is the
> > > issue, I just checked spark-bundle and the parquet class
> > > dependencies are all shaded.  So, the new version of
> > > hoodie-spark-bundle should not be a problem as such.  Please make
> > > sure you are only using hoodie-spark-bundle and no other hudi
> > > packages are in classpath. Also,
> > make sure if spark does not pull in the older version of parquet.
> > > Balaji.V
> > >
> > >     On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng
> > > Yuanbin
> > > (CR/PJ-AI-S1) <fi...@us.bosch.com> wrote:
> > >
> > >  All,
> > >
> > > After we upgrade to the new release 0.4.7. One strange exception
> > > occurred when we read the com.uber.hoodie dataset from parquet.
> > > This exception never occurred in the previous version. I am so
> > > appreciate if anyone can help me locate this exception.
> > > Here I attach part of the exception log.
> > >
> > > An exception or error caused a run to abort.
> > > java.lang.ExceptionInInitializerError
> > >               at
> > >
> > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.b
> > uildReaderWithPartitionValues(ParquetFileFormat.scala:293)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(
> > DataSourceScanExec.scala:285)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceS
> > canExec.scala:283)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSource
> > ScanExec.scala:303)
> > >               at
> > >
> > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(H
> > ashAggregateExec.scala:141)
> > >               at
> > >
> > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeSt
> > ageCodegenExec.scala:386)
> > >               at
> > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(Sp
> > > ar
> > > kPlan.scala:117)
> > > ................
> > >
> > > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group
> > > type can not be empty. Parquet does not support empty group without
> > leaves.
> > > Empty group: spark_schema
> > >               at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
> > >               at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
> > >               at
> > > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
> > >               at
> > > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:
> > > 12
> > > 56)
> > >
> > > It seems that this exception cause by the schema of the dataframe
> > > write to the Hudi dataset. I careful compared the dataframe in our
> > > test case, the only different is the nullable field.
> > > All test cases in Hudi test schema contains the true nullable field,
> > > however, some of my test cases contain false nullable field.
> > > I tried to convert every nullable to true in our dataset fields, but
> > > it still contain the same exception.
> > >
> > >
> > > Best regards
> > >
> > > Yuanbin Cheng
> > > CR/PJ-AI-S1
> > >
> > >
> >
>

Re: Strange exception after upgrade to 0.4.7

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Hi Yuanbin,
We haven't seen the issue as such but we are actively working on revamping our packaging strategy to keep packaging simple and let users have more control. 
We are in the process of cleaning our POMs. You can find the changes in this branch: https://github.com/apache/incubator-hudi/tree/hackathon-0619
We are also trying to see if parquet/jackson and other jars can be kept out of the bundle so that the runtime version of those packages will no longer conflict with those in the bundle. 

Balaji.V 


    On Tuesday, June 18, 2019, 1:28:05 PM PDT, FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <fi...@us.bosch.com> wrote:  
 
 Hi,

Have you tried to reproduce this issue these few days, do you have the same issue on your computer?
These days I tried many method tried to locate this issue, changed many different Jackson version, or exclude Jackson from Hudi.
However, the issue still have.
Maybe this is not because of the Jackson, the strange thing is that if I put the Spark dependency higher than the Hudi, everything works good. Otherwise, if I put Hudi dependency higher than the Spark, even I tried to change the Jackson version and all other different version changed, the exception still remain.
Do you have any idea related to this issue? 
I would be so appreciate for any help.

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  



-----Original Message-----
From: Vinoth Chandar <vi...@apache.org> 
Sent: Saturday, June 1, 2019 7:58 AM
To: dev@hudi.apache.org
Subject: Re: Strange exception after upgrade to 0.4.7

May be that alters the order of classes in classpath? hard to tell..  We are definitely going to look into shading Jackson and few other dependencies in a much better way.
0.4.7 has a Jackson version change. So that could be a difference. But again, that version is same as what spark uses ..
SO :( Let me try to reproduce your issue as well when we rethink the deps.

On Fri, May 31, 2019 at 4:54 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:

> Hi,
>
> I am using Spark 2.2.0. I found that if I put the Hudi dependency 
> under the Spark dependency in Maven, Hudi can run correctly.
> However, if I put Hudi before the Spark dependency, the exception 
> always occur, no matter hoodie-spark or hoodie-spark-bundle I used.
>
> Do you have any idea about the reason of this? This only happen in 0.4.7.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -----Original Message-----
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, May 31, 2019 1:19 AM
> To: dev@hudi.apache.org
> Subject: Re: Strange exception after upgrade to 0.4.7
>
> Hi,
>
> This does sound like a jar mismatch issue from spark version. I have 
> seen similar ticket associated with spark 2.1.x IIRC. If you are 
> building your own uber/fat jar then probably better to depend on 
> hoodie-spark module than the hoodie-spark-bundle which is a uberjar itself.
>
> What version of spark are you using?
>
> Thanks
> Vinoth
>
> On Thu, May 30, 2019 at 11:24 AM FIXED-TERM Cheng Yuanbin 
> (CR/PJ-AI-S1) < fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:
>
> > Hi,
> >
> > The test case is really very simple, just like the hoodie test case.
> > I have two dataframe, using the CopyOnWrite, first write the first 
> > one with Overwrite, and then write the second one Append, both 
> > operation use the format "com.uber.hoodie".
> > However, the exception occur when I read the dataset after this two 
> > write operation.
> > I used the Maven to manage the dependencies, here is the part of my 
> > maven
> > dependencies:
> >
> >        <dependency>
> >            <groupId>com.uber.hoodie</groupId>
> >            <artifactId>hoodie-spark-bundle</artifactId>
> >            <version>0.4.7</version>
> >        </dependency>
> >
> > This exception only happen in 0.4.7, if I change it to 0.4.6, it 
> > works very well.
> > I have ran the same test in
> > 1. GitHub repository compiled on my laptop 2. Source Code of the 
> > 0.4.7 compiled on my laptop All worked very well.
> >
> > Maybe, it because of the Maven release.
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> > -----Original Message-----
> > From: Vinoth Chandar <vi...@apache.org>
> > Sent: Wednesday, May 29, 2019 8:00 PM
> > To: dev@hudi.apache.org
> > Subject: Re: Strange exception after upgrade to 0.4.7
> >
> > Also curious if this error does not happen with 0.4.6? Can you 
> > please confirm that? It would be helpful to narrow it down
> >
> > On Wed, May 29, 2019 at 6:25 PM vbalaji@apache.org 
> > <vb...@apache.org>
> > wrote:
> >
> > >  Hi Yuanbin,
> > >
> > > Not sure if I completely understood the problem. Are you using 
> > > "com.uber.hoodie" format for reading the dataset ? Are you using 
> > > hoodie-spark-bundle ?
> > > From the stack overflow link,
> > > https://stackoverflow.com/questions/48034825/why-does-streaming-qu
> > > er
> > > y-
> > > fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1
> > > &l
> > > q=
> > > 1 , this could be because of parquet version. Assuming this is the 
> > > issue, I just checked spark-bundle and the parquet class 
> > > dependencies are all shaded.  So, the new version of 
> > > hoodie-spark-bundle should not be a problem as such.  Please make 
> > > sure you are only using hoodie-spark-bundle and no other hudi 
> > > packages are in classpath. Also,
> > make sure if spark does not pull in the older version of parquet.
> > > Balaji.V
> > >
> > >    On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng 
> > > Yuanbin
> > > (CR/PJ-AI-S1) <fi...@us.bosch.com> wrote:
> > >
> > >  All,
> > >
> > > After we upgrade to the new release 0.4.7. One strange exception 
> > > occurred when we read the com.uber.hoodie dataset from parquet.
> > > This exception never occurred in the previous version. I am so 
> > > appreciate if anyone can help me locate this exception.
> > > Here I attach part of the exception log.
> > >
> > > An exception or error caused a run to abort.
> > > java.lang.ExceptionInInitializerError
> > >              at
> > >
> > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> > .b
> > uildReaderWithPartitionValues(ParquetFileFormat.scala:293)
> > >              at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycomput
> > e(
> > DataSourceScanExec.scala:285)
> > >              at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourc
> > eS
> > canExec.scala:283)
> > >              at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSour
> > ce
> > ScanExec.scala:303)
> > >              at
> > >
> > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs
> > (H
> > ashAggregateExec.scala:141)
> > >              at
> > >
> > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(Whole
> > St
> > ageCodegenExec.scala:386)
> > >              at
> > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(
> > > Sp
> > > ar
> > > kPlan.scala:117)
> > > ................
> > >
> > > Caused by: org.apache.parquet.schema.InvalidSchemaException: A 
> > > group type can not be empty. Parquet does not support empty group 
> > > without
> > leaves.
> > > Empty group: spark_schema
> > >              at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
> > >              at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
> > >              at
> > > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
> > >              at
> > > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:
> > > 12
> > > 56)
> > >
> > > It seems that this exception cause by the schema of the dataframe 
> > > write to the Hudi dataset. I careful compared the dataframe in our 
> > > test case, the only different is the nullable field.
> > > All test cases in Hudi test schema contains the true nullable 
> > > field, however, some of my test cases contain false nullable field.
> > > I tried to convert every nullable to true in our dataset fields, 
> > > but it still contain the same exception.
> > >
> > >
> > > Best regards
> > >
> > > Yuanbin Cheng
> > > CR/PJ-AI-S1
> > >
> > >
> >
>
  

RE: Strange exception after upgrade to 0.4.7

Posted by "FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)" <fi...@us.bosch.com>.
Hi,

Have you tried to reproduce this issue these few days, do you have the same issue on your computer?
These days I tried many method tried to locate this issue, changed many different Jackson version, or exclude Jackson from Hudi.
However, the issue still have.
Maybe this is not because of the Jackson, the strange thing is that if I put the Spark dependency higher than the Hudi, everything works good. Otherwise, if I put Hudi dependency higher than the Spark, even I tried to change the Jackson version and all other different version changed, the exception still remain.
Do you have any idea related to this issue? 
I would be so appreciate for any help.

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  



-----Original Message-----
From: Vinoth Chandar <vi...@apache.org> 
Sent: Saturday, June 1, 2019 7:58 AM
To: dev@hudi.apache.org
Subject: Re: Strange exception after upgrade to 0.4.7

May be that alters the order of classes in classpath? hard to tell..  We are definitely going to look into shading Jackson and few other dependencies in a much better way.
0.4.7 has a Jackson version change. So that could be a difference. But again, that version is same as what spark uses ..
SO :( Let me try to reproduce your issue as well when we rethink the deps.

On Fri, May 31, 2019 at 4:54 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:

> Hi,
>
> I am using Spark 2.2.0. I found that if I put the Hudi dependency 
> under the Spark dependency in Maven, Hudi can run correctly.
> However, if I put Hudi before the Spark dependency, the exception 
> always occur, no matter hoodie-spark or hoodie-spark-bundle I used.
>
> Do you have any idea about the reason of this? This only happen in 0.4.7.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -----Original Message-----
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, May 31, 2019 1:19 AM
> To: dev@hudi.apache.org
> Subject: Re: Strange exception after upgrade to 0.4.7
>
> Hi,
>
> This does sound like a jar mismatch issue from spark version. I have 
> seen similar ticket associated with spark 2.1.x IIRC. If you are 
> building your own uber/fat jar then probably better to depend on 
> hoodie-spark module than the hoodie-spark-bundle which is a uberjar itself.
>
> What version of spark are you using?
>
> Thanks
> Vinoth
>
> On Thu, May 30, 2019 at 11:24 AM FIXED-TERM Cheng Yuanbin 
> (CR/PJ-AI-S1) < fixed-term.Yuanbin.Cheng@us.bosch.com> wrote:
>
> > Hi,
> >
> > The test case is really very simple, just like the hoodie test case.
> > I have two dataframe, using the CopyOnWrite, first write the first 
> > one with Overwrite, and then write the second one Append, both 
> > operation use the format "com.uber.hoodie".
> > However, the exception occur when I read the dataset after this two 
> > write operation.
> > I used the Maven to manage the dependencies, here is the part of my 
> > maven
> > dependencies:
> >
> >         <dependency>
> >             <groupId>com.uber.hoodie</groupId>
> >             <artifactId>hoodie-spark-bundle</artifactId>
> >             <version>0.4.7</version>
> >         </dependency>
> >
> > This exception only happen in 0.4.7, if I change it to 0.4.6, it 
> > works very well.
> > I have ran the same test in
> > 1. GitHub repository compiled on my laptop 2. Source Code of the 
> > 0.4.7 compiled on my laptop All worked very well.
> >
> > Maybe, it because of the Maven release.
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> > -----Original Message-----
> > From: Vinoth Chandar <vi...@apache.org>
> > Sent: Wednesday, May 29, 2019 8:00 PM
> > To: dev@hudi.apache.org
> > Subject: Re: Strange exception after upgrade to 0.4.7
> >
> > Also curious if this error does not happen with 0.4.6? Can you 
> > please confirm that? It would be helpful to narrow it down
> >
> > On Wed, May 29, 2019 at 6:25 PM vbalaji@apache.org 
> > <vb...@apache.org>
> > wrote:
> >
> > >  Hi Yuanbin,
> > >
> > > Not sure if I completely understood the problem. Are you using 
> > > "com.uber.hoodie" format for reading the dataset ? Are you using 
> > > hoodie-spark-bundle ?
> > > From the stack overflow link,
> > > https://stackoverflow.com/questions/48034825/why-does-streaming-qu
> > > er
> > > y-
> > > fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1
> > > &l
> > > q=
> > > 1 , this could be because of parquet version. Assuming this is the 
> > > issue, I just checked spark-bundle and the parquet class 
> > > dependencies are all shaded.  So, the new version of 
> > > hoodie-spark-bundle should not be a problem as such.  Please make 
> > > sure you are only using hoodie-spark-bundle and no other hudi 
> > > packages are in classpath. Also,
> > make sure if spark does not pull in the older version of parquet.
> > > Balaji.V
> > >
> > >     On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng 
> > > Yuanbin
> > > (CR/PJ-AI-S1) <fi...@us.bosch.com> wrote:
> > >
> > >  All,
> > >
> > > After we upgrade to the new release 0.4.7. One strange exception 
> > > occurred when we read the com.uber.hoodie dataset from parquet.
> > > This exception never occurred in the previous version. I am so 
> > > appreciate if anyone can help me locate this exception.
> > > Here I attach part of the exception log.
> > >
> > > An exception or error caused a run to abort.
> > > java.lang.ExceptionInInitializerError
> > >               at
> > >
> > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> > .b
> > uildReaderWithPartitionValues(ParquetFileFormat.scala:293)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycomput
> > e(
> > DataSourceScanExec.scala:285)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourc
> > eS
> > canExec.scala:283)
> > >               at
> > >
> > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSour
> > ce
> > ScanExec.scala:303)
> > >               at
> > >
> > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs
> > (H
> > ashAggregateExec.scala:141)
> > >               at
> > >
> > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(Whole
> > St
> > ageCodegenExec.scala:386)
> > >               at
> > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(
> > > Sp
> > > ar
> > > kPlan.scala:117)
> > > ................
> > >
> > > Caused by: org.apache.parquet.schema.InvalidSchemaException: A 
> > > group type can not be empty. Parquet does not support empty group 
> > > without
> > leaves.
> > > Empty group: spark_schema
> > >               at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
> > >               at
> > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
> > >               at
> > > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
> > >               at
> > > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:
> > > 12
> > > 56)
> > >
> > > It seems that this exception cause by the schema of the dataframe 
> > > write to the Hudi dataset. I careful compared the dataframe in our 
> > > test case, the only different is the nullable field.
> > > All test cases in Hudi test schema contains the true nullable 
> > > field, however, some of my test cases contain false nullable field.
> > > I tried to convert every nullable to true in our dataset fields, 
> > > but it still contain the same exception.
> > >
> > >
> > > Best regards
> > >
> > > Yuanbin Cheng
> > > CR/PJ-AI-S1
> > >
> > >
> >
>