You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "kazdy (via GitHub)" <gi...@apache.org> on 2023/03/21 15:37:57 UTC

[GitHub] [hudi] kazdy opened a new issue, #8259: [SUPPORT] ParquetDecodingException: Can not read value at 0 in block -1 in file after switching to async services.

kazdy opened a new issue, #8259:
URL: https://github.com/apache/hudi/issues/8259

   **Describe the problem you faced**
   
   I'm observing failed spark stages "Doing partition and writing data: <table name>" in  " SparkUpsertCommitActionExecutor" job group. Effectively it can not update records.
   There were no changes to schema or datatypes, I only switched from inline to async services for CoW table that gets updated by streaming job.
   Only one job suffers from this issue after switching to async services. I can see that async services are running fine eg cleaning.
   I'm using inprocess lock, metadata table is enabled.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   not sure how to reproduce, other jobs are running fine after enabling to async services
   
   **Expected behavior**
   
   Hudi continues to write updates with no failed stages.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn
   
   * Spark version : 3.3.0-amzn-1
   
   * Hive version : na
   
   * Hadoop version : na
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no, running in emr serverless
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Job aborted due to stage failure: Task 0 in stage 90.0 failed 4 times, most recent failure: Lost task 0.3 in stage 90.0 (TID 203) ([executorid] executor 4): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
   	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
   	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
   	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
   	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
   	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
   	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
   	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
   	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:138)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
   	at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
   	... 28 more
   Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:161)
   	at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147)
   	... 31 more
   Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
   	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:155)
   	... 32 more
   Caused by: org.apache.hudi.exception.HoodieException: operation has failed
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:248)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:226)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:278)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:135)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	... 3 more
   Caused by: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file 
   	at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
   	at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:106)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	... 4 more
   Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://pathtoparuqetfile/somefile.parquet
   	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254)
   	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
   	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
   	at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
   	... 8 more
   Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
   	at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
   	at org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:77)
   	at org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:415)
   	at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
   	at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
   	at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
   	at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
   	at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
   	at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
   	at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
   	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
   	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
   	... 11 more
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1480945493

   @nsivabalan I would really appreciate any support or guidance here, I have two broken tables in prod now and need to come up with a fix or workaround. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] mansipp commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "mansipp (via GitHub)" <gi...@apache.org>.

mansipp commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530093282

   @kazdy can you please share more information on it?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema, corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1478507792

   Another thing I noticed is that clustering created a file that's around 900Mb, is this expected? I'm using defaults all the way when it comes to file sizing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] mansipp commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "mansipp (via GitHub)" <gi...@apache.org>.

mansipp commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1527846753

   Hi @kazdy @nsivabalan 
   In my testing I have observed that it's working for open source release 0.13.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519

   It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur with 0.13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] ParquetDecodingException: Can not read value at 0 in block -1 in file after switching to async services.

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1478402727

   I inspected all files that were source files for clustering and all have the same schema.
   The file created as result of clustering has changed schema, column named "year" was moved from the middle of the file schema to the end of the schema.
   
   It looks like during clustering schema was modified and now it's causing issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482386262

   https://hudi.apache.org/docs/metadata
   
   This blog advises to use in process lock when metadata table and async services are enabled for single writer scenario. Is the blog incorrect?
   
   Or what you meant is that I should not use async metadata indexing with other async table services in single writer mode and in-process lock provider?
   
   Nevertheless I was able to reproduce the issue without async table services in streaming job. Looks like the issue is in clusterig itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482220312

   > I'm using in-process lock
   
   In-process lock can not work correctly for MDT with async table services.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530087223

   @mansipp I was able to replicate this case using Hudi 0.13 as well. Files created by clustering have changed schema. Interestingly there's no issue if eg the order of types (not columns) does not change eg if I have only columns of type String.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table 
URL: https://github.com/apache/hudi/issues/8259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716

   Hi @nsivabalan I see that Danny assigned you to this ticket.
   I was able to replicate this exact case. Previous one was not exactly exposing the issue. I'll update the repo soon.
   
   Here's what I found out and some additional info.
   
   First I'm wondering why in the stacktrace I'm getting:
   ```
   java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
   	at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
   ```
   	
   I don't have any Long/Bigint in the table schema nor in the incoming schema, all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for spark).
   
   The schema of the incoming batch should look like this:
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   partitionCol1 int
   partitionCol2 int
   col6 String
   col7 timestamp
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   ```
   
   Schema of clustered parquet files:
   
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   col6 String // (earlier at this place was partitionCol1 int) Tries to read Int but instead needs to read String? idk
   partitionCol2 int
   col7 timestamp 
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   partitionCol1 int
   ```
   
   Schema in replacecommit conforms to the incoming batch schema/table schema (is correct).
   I don't know if Hudi resolves columns by position or by name and if it matters when reading parquet file for merging.
   If it was by position then col6 String (earlier at this place was partitionCol1 int) Hudi will try to read this column as of type Int but instead needs to read String? 
   Therefore it can't since for PlainLongDictionary there's no decodeToBinary implementation available?
   Idk if it makes any sense at all, but this is my intuition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.

xushiyan commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1555017700

   Tried repro and found that moving partition field to the end is still an issue in 0.13.0/0.12.3/0.12.2 when enabling both row-writing and hivestyle partition. will re-open for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [SUPPORT] Clustering created files with modified schema resulting in corrupted table [hudi]

Posted by "VIKASPATID (via GitHub)" <gi...@apache.org>.

VIKASPATID commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1940894333

   Hi, We are also facing similar issue where enabling clustering and then running 2-3 upserts results in column order/value mismatch ending up corrupting table. We are facing this issue on emr-6.15.0, spark 3.4.1, apache hudi 0.14. Please look into it, as we are not able to use clustering at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [SUPPORT] Clustering created files with modified schema resulting in corrupted table [hudi]

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-2050334663

   Can you try with [hoodie.datasource.write.row.writer.enable](https://hudi.apache.org/docs/configurations/#hoodiedatasourcewriterowwriterenable) = false?
   It's described here, problem is a bit different, for us it helped to create a file with the right format. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org