You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/14 05:53:47 UTC

[GitHub] [spark] SaurabhChawla100 edited a comment on pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

SaurabhChawla100 edited a comment on pull request #29045:
URL: https://github.com/apache/spark/pull/29045#issuecomment-657978027


   > Can you be more specific about the problem? Are you saying that the actual file schema doesn't match the table schema specified by the user?
   
   So in case of orc data created by the hive no field names in the physical schema. Please find the below code for reference.
   https://github.com/apache/spark/blob/24be81689cee76e03cd5136dfd089123bbff4595/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L133
   
   So from this code we are sending the index of the col from the dataschema.
   
   But Where as in the below code , we are passing the input result schema and that result schema will not have that index number that is passed from OrcUtils.scala
   https://github.com/apache/spark/blob/24be81689cee76e03cd5136dfd089123bbff4595/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L211
   
   For example - 
   
   ```
   val u = """select date_dim. d_year from date_dim limit 5"""
   
   spark.sql(u).collect
   ```
   
   Here the value of index(d_year returned by the OrcUtils.scala#L133 is 6
   
   where the resultSchema passed in OrcFileFormat.scala#L211 is having only one  struct<`d_year`:int> 
   
   So now on using the index value 6 in the resultSchema schema which is having size 1 is giving the exception
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 6
       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
       at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org