You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrey Zinovyev (JIRA)" <ji...@apache.org> on 2019/06/05 11:04:00 UTC
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader
implements its own schema evolution
[ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856592#comment-16856592 ]
Andrey Zinovyev commented on SPARK-27913:
-----------------------------------------
Simple way to reproduce it
{code:sql}
create external table test_broken_orc(a struct<f1:int>) stored as orc;
insert into table test_broken_orc select named_struct("f1", 1);
drop table test_broken_orc;
create external table test_broken_orc(a struct<f1:int, f2: int>) stored as orc;
select * from test_broken_orc;
{code}
Last statement fails with exception
{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
{noformat}
Also you can remove column or add column in the middle of struct field. As far as I understand current implementation it supports by-name field resolution of zero level of orc structure. Everything deeper get resolved by index and expected be exact match with reader schema
> Spark SQL's native ORC reader implements its own schema evolution
> -----------------------------------------------------------------
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.3
> Reporter: Owen O'Malley
> Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL native ORC bindings do not provide the desired schema to the ORC reader. This causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org