You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrey Zinovyev (JIRA)" <ji...@apache.org> on 2019/06/05 11:04:00 UTC
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

    [ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856592#comment-16856592 ] 

Andrey Zinovyev commented on SPARK-27913:
-----------------------------------------

Simple way to reproduce it


{code:sql}
create external table test_broken_orc(a struct<f1:int>) stored as orc;
insert into table test_broken_orc select named_struct("f1", 1);
drop table test_broken_orc;
create external table test_broken_orc(a struct<f1:int, f2: int>) stored as orc;
select * from test_broken_orc;
{code}

Last statement fails with exception

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
	at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
	at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
	at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
	at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
	at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
{noformat}

Also you can remove column or add column in the middle of struct field. As far as I understand current implementation it supports by-name field resolution of zero level of orc structure. Everything deeper get resolved by index and expected be exact match with reader schema


> Spark SQL's native ORC reader implements its own schema evolution
> -----------------------------------------------------------------
>
>                 Key: SPARK-27913
>                 URL: https://issues.apache.org/jira/browse/SPARK-27913
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3
>            Reporter: Owen O'Malley
>            Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL native ORC bindings do not provide the desired schema to the ORC reader. This causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org