You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ramakrishna Prasad K S (Jira)" <ji...@apache.org> on 2020/08/07 04:08:00 UTC

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

    [ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172805#comment-17172805 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/7/20, 4:07 AM:
-------------------------------------------------------------------------

[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the priority and target version. Sorry about the same. 

Can someone help me with this issue? This is critical and the workaround mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is also not working.


was (Author: ramks):
[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the Priority and target version. Sorry for the same. 

Can someone help me with this issue? This is critical and the workaround mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is also not working.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32558
>                 URL: https://issues.apache.org/jira/browse/SPARK-32558
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux Redhat)
>            Reporter: Ramakrishna Prasad K S
>            Priority: Major
>
> Steps to reproduce the issue:
> ------------------------------- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-------------------------+
> |               key| value| 
> +-------------------------+
> |spark.sql.orc.impl|native|
> +-------------------------+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-----------------+
> |    col1|    col2|
> +-----------------+
> |col1val1|col2val1|
> +-----------------+
> scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> +------------------------+
> |               key|value| 
> +------------------------+
> |spark.sql.orc.impl| hive|
> +------------------------+
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive
> {code}
> [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
> Processing data file /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> *Note: The same case works fine if you try metadata fetch from Hive_2.3 or above versions.*
> *So the main concern here is that setting {{spark.sql.orc.impl}} to hive is not producing ORC files that will work with Hive_2.1.1 or below.
>  Can someone help here. Is there any other workaround available? Can this be looked into on priority? Thank you.
>  
> References:
>  [https://spark.apache.org/docs/latest/sql-migration-guide.html]  (workaround of setting spark.sql.orc.impl=hive is mentioned here which is not working):
> {quote}
> Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, spark.sql.orc.impl and spark.sql.orc.filterPushdown change their default values to native and true respectively. ORC files created by native ORC writer cannot be read by some old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared with Hive 2.1.1 and older.""
> {quote}
> https://issues.apache.org/jira/browse/SPARK-26932
> https://issues.apache.org/jira/browse/HIVE-16683



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org