You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ramakrishna Prasad K S (Jira)" <ji...@apache.org> on 2020/08/06 08:24:00 UTC
[jira] [Created] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

Ramakrishna Prasad K S created SPARK-32558:
----------------------------------------------

             Summary: ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
                 Key: SPARK-32558
                 URL: https://issues.apache.org/jira/browse/SPARK-32558
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
         Environment: Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 version.
            Reporter: Ramakrishna Prasad K S
             Fix For: 3.0.0


Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: https://spark.apache.org/downloads.html


 -------------------------------------------------------------------------------------------- Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell --------------------------------------------------------------------------------------------
 Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0   Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show() +------------------+------+ |               key| value| +------------------+------+ |spark.sql.orc.impl|native| +------------------+------+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')") 20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = []
 scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show() +--------+--------+ |    col1|    col2| +--------+--------+ |col1val1|col2val1| +--------+--------+

scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file.  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) (even after overriding spark.sql.orc.impl to hive)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step 3) Now Create ORC File using the Hive API (as suggested by Spark in https://spark.apache.org/docs/latest/sql-migration-guide.html by setting spark.sql.orc.impl as hive)  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
 scala> spark.sql("set spark.sql.orc.impl").show() +------------------+-----+ |               key|value| +------------------+-----+ |spark.sql.orc.impl| hive| +------------------+-----+

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2 specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = []
 scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

---------------------------------------------------------------------------------------------------------------------------- Note: The same case works fine if you try metadata fetch from Hive_2.3 or above versions. ---------------------------------------------------------------------------------------------------------------------------- So the main concern here is that setting spark.sql.orc.impl to hive is not producing ORC files that will work with Hive_2.1.1 or below.  Can someone help here. Is there any other workaround available? Can this be looked into on priority? Thank you. References:
 https://spark.apache.org/docs/latest/sql-migration-guide.html  (workaround of setting spark.sql.orc.impl=hive is mentioned here which is not working) ""Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, spark.sql.orc.impl and spark.sql.orc.filterPushdown change their default values to native and true respectively. ORC files created by native ORC writer cannot be read by some old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared with Hive 2.1.1 and older.""
 https://issues.apache.org/jira/browse/SPARK-26932
 https://issues.apache.org/jira/browse/HIVE-16683



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org