You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/29 18:46:20 UTC

[GitHub] [hudi] soma1712 opened a new issue, #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

soma1712 opened a new issue, #6249:
URL: https://github.com/apache/hudi/issues/6249

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Detailed Notes - 
   
   We have Incoming Delta transactions from an Oracle based application that are being pushed into S3 endpoint using AWS DMS services. These CDC records are applied as upserts on to already existing Hudi table in a different S3 bucket (Initial Load data). The UPSERTS are happening by running below Spark Submits -
   
   spark-submit \
   --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.default.parallelism=500 \
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.initialExecutors=3 \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.app.name=<table_1> \
   --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type MERGE_ON_READ \
   --op UPSERT \
   --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \
   --source-ordering-field dms_seq_no \
   --props s3://bucket/cdc.properties \
   --hoodie-conf hoodie.datasource.hive_sync.database=glue_db \
   --target-base-path s3://bucket/table_1 \
   --target-table table_1 \
   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/ \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-sync
   
   This table <table1> will be subsequently read with hudi options and joined with other hudi tables to populate the Final Enriched layer. While reading a Hudi table we are facing the ArrayIndexOutOfbound exception.
   
   Below are the Hudi props and Spark Submits we execute to read and populate the downstream.
   
   hoodie.datasource.write.partitionpath.field=
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.assume_date_partitioning=false
   hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.parquet.small.file.limit=134217728
   hoodie.parquet.max.file.size=1048576000
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=1
   hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,* from <SRC>
   hoodie.datasource.hive_sync.support_timestamp=true
   hoodie.datasource.compaction.async.enable=true
   hoodie.index.type=BLOOM
   hoodie.compact.inline=true
   hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5
   hoodie.metadata.compact.max.delta.commits=5
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.datasource.hive_sync.table=table_1
   hoodie.datasource.write.recordkey.field=table_1_ID
   
   spark-submit --deploy-mode client --conf spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf spark.executorEnv.SPARK_HOME=/prod/null --conf spark.shuffle.service.enabled=true --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar s3://pythonscripts/hudi_read.py
   
   
   
   TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1]
   22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 8887 bytes)
   22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1242536732

   @soma1712 if you can please fill out the issue following the format, that would greatly help us in investigation (for ex, what version of Hudi are you using?)
   
   Can you please paste full stack-trace of the exception?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soma1712 commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

soma1712 commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200084568

   hudi_read.txt is actually a .py file. As the system was not supporting to update a .py, I had to change it to .txt
   
   [hudi_read.txt](https://github.com/apache/hudi/files/9224774/hudi_read.txt)
   [results.txt](https://github.com/apache/hudi/files/9224775/results.txt)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

codope commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1276330333

   Closing due to inactivity and non-reproducibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1242540904

   @soma1712 are you sure you're running 0.11.1? 
   
   It seems like you're running an earlier version: looking at your stacktrace, i see method being used (`createRowWithRequiredSchema`) which was removed in 0.11.1:
   
   ```
   22/07/30 04:09:14 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 7, ip-172-31-18-101.ec2.internal, executor 2): java.lang.ArrayIndexOutOfBoundsException: 9157
           at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong(PlainValuesDictionary.java:165)
           at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
           at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
           at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getLong(MutableColumnarRow.java:120)
           at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:189)
           at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:278)
           at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:276)
           at scala.collection.Iterator$class.foreach(Iterator.scala:891)
           at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
           at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
           at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
           at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:275)
           at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:236)
           at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
           at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
           at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
           at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:297)
           at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:289)
           at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
           at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
           at org.apache.spark.scheduler.Task.run(Task.scala:123)
           at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
           at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1229275238

   can you share contents of .hoodie if its feasible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope closed issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

codope closed issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception
URL: https://github.com/apache/hudi/issues/6249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

codope commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1263392242

   @soma1712 I have been running a deltastreamer MOR table job for over 30+ commits with sql transformer (though my setup is a bit different, I'm not using DMS). So far it's going well. No issues. We have had long-running deltastreamer tests in our test infra. Could you try with the latest Hudi?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200047941

   @soma1712 could you share how you read the Hudi table in `s3://pythonscripts/hudi_read.py` and the full stacktrace as well?  Which Hudi release do you use?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org