You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "David Palmer (Jira)" <ji...@apache.org> on 2023/08/20 22:38:00 UTC

[jira] [Created] (PARQUET-2339) ArrayIndexOutOfBounds exception writing parquet from Avro in Apache Hudi

David Palmer created PARQUET-2339:
-------------------------------------

             Summary: ArrayIndexOutOfBounds exception writing parquet from Avro in Apache Hudi
                 Key: PARQUET-2339
                 URL: https://issues.apache.org/jira/browse/PARQUET-2339
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro, parquet-mr
    Affects Versions: 1.12.3
         Environment: Amazon EMR 6.12.x, Apache Hudi 0.13.1, Apache Spark 3.4.0, Linux in Docker
            Reporter: David Palmer


While writing an Apache Hudi table using the DeltaStreamer utility, I receive an exception from the Parquet `AvroWriteSupport` class:

```23/08/17 22:43:50 ERROR HoodieCreateHandle: Error writing record HoodieRecord\{key=HoodieKey { recordKey=id:05a3065f8cf0494f9dc449307a0fddd8,idx:01 partitionPath=event.year=2023/event.month=08/event.day=17/event.hour=22}, currentLocation='null', newLocation='null'}
java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5
    at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:476) ~[parquet-column-1.12.3-amzn-0.jar:1.12.3-amzn-0]
    at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:358) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:287) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:174) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138) ~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
    at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310) ~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
    at org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvroWithMetadata(HoodieAvroParquetWriter.java:67) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.io.storage.HoodieAvroFileWriter.writeWithMetadata(HoodieAvroFileWriter.java:45) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.io.storage.HoodieFileWriter.writeWithMetadata(HoodieFileWriter.java:39) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:147) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:175) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) ~[scala-library-2.12.15.jar:?]
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[scala-library-2.12.15.jar:?]
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]

```

I have tried setting `spark.hadoop.parquet.avro.write-old-list-structure: false` but the issue persists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)