You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/06/10 19:26:38 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue #2692: [Spark] Opaque error when attempting to do vectorized read of Parquet file with unsupported encoding

RussellSpitzer opened a new issue #2692:
URL: https://github.com/apache/iceberg/issues/2692


   When parquet vectorized reading is on, if the file has DELTA_BYTE_ARRAY encoding that we throw a null pointer exception.
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2) (macbook-pro.attlocal.net executor driver): java.lang.NullPointerException
   	at org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readUnsignedVarInt(BaseVectorizedParquetValuesReader.java:137)
   	at org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readNextGroup(BaseVectorizedParquetValuesReader.java:187)
   	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader.readBatchVarWidth(VectorizedParquetDefinitionLevelReader.java:714)
   	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.nextBatchVarWidthType(VectorizedPageIterator.java:393)
   	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.nextBatchVarWidthType(VectorizedColumnIterator.java:182)
   	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:148)
   	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:70)
   	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:39)
   	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
   	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:88)
   	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
   	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
   	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   This can occur when Parquet files are added through non-spark frameworks like Trino or by manually adding files to a table using the java api.
   
   Spark's Vectorized reader also does not support this encoding but throws a clearer error.
   
   ```
   java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:783)
   ```
   
   Here is the metadata for one of the files in question:
   ```
   file:                  file:/Users/russellspitzer/Temp/vector/data/c7cd9613-1349-466b-90ba-24c98b0e3722.parquet
   creator:               null
   
   file schema:           table
   --------------------------------------------------------------------------------
   id:                    OPTIONAL BINARY L:STRING R:0 D:1
   ts: OPTIONAL INT64 R:0 D:1
   s_id:                OPTIONAL BINARY L:STRING R:0 D:1
   
   row group 1:           RC:1 TS:165 OFFSET:4
   --------------------------------------------------------------------------------
   id:                     BINARY GZIP DO:0 FPO:4 SZ:53/35/0.66 VC:1 ENC:DELTA_BYTE_ARRAY ST:[min: 1, max: 1, num_nulls: 0]
   ts:                     INT64 GZIP DO:0 FPO:57 SZ:54/34/0.63 VC:1 ENC:DELTA_BINARY_PACKED ST:[min: 1619809949087, max: 1619809949087, num_nulls: 0]
   s_id:                   BINARY GZIP DO:0 FPO:111 SZ:58/40/0.69 VC:1 ENC:DELTA_BYTE_ARRAY ST:[min: 708546, max: 708546, num_nulls: 0]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #2692: [Spark] NullPointerException error when attempting to do vectorized read of Parquet file with unsupported encoding

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #2692:
URL: https://github.com/apache/iceberg/issues/2692#issuecomment-861269643


   As a starting point, for the Spark vectorized parquet reader, I think we should explicitly throw when we either
   - (1) encounter an encoding that’s not supported
   - (2) explicitly throw when we encounter a Parquet v2 file at read time.
   
   
   I think that approach 2 would potentially be simpler and more in line with the code from Spark, which has an explicit V1 path and V2 path for data pages, footers, etc (and which we modeled this class on).
   
   Spark 3.1.1 afaik does not support vectorized reading of files written with parquet v2 format, though it seems to be in the works.
   
   A more helpful error message would go a long way until we’ve updated the code to support vectorized reading of both parquet v1 write format and parquet v2 format in the Spark vectorized parquet reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer closed issue #2692: [Spark] NullPointerException error when attempting to do vectorized read of Parquet file with unsupported encoding

Posted by GitBox <gi...@apache.org>.

RussellSpitzer closed issue #2692:
URL: https://github.com/apache/iceberg/issues/2692


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org