You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yong Zhang <ja...@hotmail.com> on 2017/06/12 21:05:13 UTC

Parquet file generated by Spark, but not compatible read by Hive

We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and partitioned by "brand" (which is a string to represent brand in this dataset).


After the partition files generated in HDFS like "brand=a" folder, we add the partitions in the Hive.


The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).


Now the problem is that for 2 brand partitions, we cannot query the data generated in Spark, but it works fine for the rest of partitions.


Below is the error in the Hive CLI and hive.log I got if I query the bad partitions like "select * from  tablename where brand='BrandA' limit 3;"


Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable


Caused by: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
    at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
    at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
    at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
    at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
    at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
    at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
    ... 22 more

There are not too much I can find by googling this error message, but it points to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other brands, which defined as a partition column, and share the whole Hive schema as the above.

If I query like: "select * from tablename where brand='BrandB' limit 3:", everything works fine.

So is this really caused by the Hive schema mismatch with parquet file generated by Spark, or by the data within different partitioned keys, or really a compatible issue between Spark/Hive?

Thanks

Yong

Re: Parquet file generated by Spark, but not compatible read by Hive

Posted by Yong Zhang <ja...@hotmail.com>.

The issue is cased by the data, and indeed a type miss match between Hive schema and Spark. Now it is fixed.


Without that kind of data, the problem won't be trigged in some brands.


Thanks taking a look of this problem.


Yong


________________________________
From: ayan guha <gu...@gmail.com>
Sent: Tuesday, June 13, 2017 1:54 AM
To: Angel Francisco Orta
Cc: Yong Zhang; user@spark.apache.org
Subject: Re: Parquet file generated by Spark, but not compatible read by Hive

Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")

On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <an...@gmail.com>> wrote:
Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang" <ja...@hotmail.com>> escribió:

We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and partitioned by "brand" (which is a string to represent brand in this dataset).


After the partition files generated in HDFS like "brand=a" folder, we add the partitions in the Hive.


The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).


Now the problem is that for 2 brand partitions, we cannot query the data generated in Spark, but it works fine for the rest of partitions.


Below is the error in the Hive CLI and hive.log I got if I query the bad partitions like "select * from  tablename where brand='BrandA' limit 3;"


Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable


Caused by: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable
    at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
    at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
    at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
    at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
    at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
    at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
    at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
    ... 22 more

There are not too much I can find by googling this error message, but it points to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other brands, which defined as a partition column, and share the whole Hive schema as the above.

If I query like: "select * from tablename where brand='BrandB' limit 3:", everything works fine.

So is this really caused by the Hive schema mismatch with parquet file generated by Spark, or by the data within different partitioned keys, or really a compatible issue between Spark/Hive?

Thanks

Yong





--
Best Regards,
Ayan Guha

Re: Parquet file generated by Spark, but not compatible read by Hive

Posted by ayan guha <gu...@gmail.com>.

Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")

On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <
angel.francisco.orta@gmail.com> wrote:

> Hello,
>
> Do you use df.write or you make with hivecontext.sql(" insert into ...")?
>
> Angel.
>
> El 12 jun. 2017 11:07 p. m., "Yong Zhang" <ja...@hotmail.com> escribió:
>
>> We are using Spark *1.6.2* as ETL to generate parquet file for one
>> dataset, and partitioned by "brand" (which is a string to represent brand
>> in this dataset).
>>
>>
>> After the partition files generated in HDFS like "brand=a" folder, we add
>> the partitions in the Hive.
>>
>>
>> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>>
>>
>> Now the problem is that for 2 brand partitions, we cannot query the data
>> generated in Spark, but it works fine for the rest of partitions.
>>
>>
>> Below is the error in the Hive CLI and hive.log I got if I query the bad
>> partitions like "select * from  tablename where brand='*BrandA*' limit
>> 3;"
>>
>>
>> Failed with exception java.io.IOException:org.apache
>> .hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException:
>> Cannot inspect org.apache.hadoop.io.LongWritable
>>
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.io.LongWritable
>>     at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.Parquet
>> StringInspector.getPrimitiveWritableObject(ParquetStringInsp
>> ector.java:52)
>>     at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveU
>> TF8(LazyUtils.java:222)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> (LazySimpleSerDe.java:307)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> Field(LazySimpleSerDe.java:262)
>>     at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeFi
>> eld(DelimitedJSONSerDe.java:72)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSeriali
>> ze(LazySimpleSerDe.java:246)
>>     at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.ser
>> ialize(AbstractEncodingAwareSerDe.java:50)
>>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:71)
>>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:40)
>>     at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(List
>> SinkOperator.java:90)
>>     ... 22 more
>>
>> There are not too much I can find by googling this error message, but it
>> points to that the schema in Hive is different as in parquet file.
>> But this is a very strange case, as the same schema works fine for other
>> brands, which defined as a partition column, and share the whole Hive
>> schema as the above.
>>
>> If I query like: "select * from tablename where brand='*BrandB*' limit
>> 3:", everything works fine.
>>
>> So is this really caused by the Hive schema mismatch with parquet file
>> generated by Spark, or by the data within different partitioned keys, or
>> really a compatible issue between Spark/Hive?
>>
>> Thanks
>>
>> Yong
>>
>>
>>


-- 
Best Regards,
Ayan Guha

Re: Parquet file generated by Spark, but not compatible read by Hive

Posted by Angel Francisco Orta <an...@gmail.com>.

Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang" <ja...@hotmail.com> escribió:

> We are using Spark *1.6.2* as ETL to generate parquet file for one
> dataset, and partitioned by "brand" (which is a string to represent brand
> in this dataset).
>
>
> After the partition files generated in HDFS like "brand=a" folder, we add
> the partitions in the Hive.
>
>
> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>
>
> Now the problem is that for 2 brand partitions, we cannot query the data
> generated in Spark, but it works fine for the rest of partitions.
>
>
> Below is the error in the Hive CLI and hive.log I got if I query the bad
> partitions like "select * from  tablename where brand='*BrandA*' limit 3;"
>
>
> Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
>
>
> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
>     at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.
> ParquetStringInspector.getPrimitiveWritableObject(
> ParquetStringInspector.java:52)
>     at org.apache.hadoop.hive.serde2.lazy.LazyUtils.
> writePrimitiveUTF8(LazyUtils.java:222)
>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> serialize(LazySimpleSerDe.java:307)
>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(
> LazySimpleSerDe.java:262)
>     at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(
> DelimitedJSONSerDe.java:72)
>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> doSerialize(LazySimpleSerDe.java:246)
>     at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(
> AbstractEncodingAwareSerDe.java:50)
>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:71)
>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:40)
>     at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(
> ListSinkOperator.java:90)
>     ... 22 more
>
> There are not too much I can find by googling this error message, but it
> points to that the schema in Hive is different as in parquet file.
> But this is a very strange case, as the same schema works fine for other
> brands, which defined as a partition column, and share the whole Hive
> schema as the above.
>
> If I query like: "select * from tablename where brand='*BrandB*' limit
> 3:", everything works fine.
>
> So is this really caused by the Hive schema mismatch with parquet file
> generated by Spark, or by the data within different partitioned keys, or
> really a compatible issue between Spark/Hive?
>
> Thanks
>
> Yong
>
>
>