You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "kazdy (via GitHub)" <gi...@apache.org> on 2023/03/21 15:37:57 UTC
[GitHub] [hudi] kazdy opened a new issue, #8259: [SUPPORT] ParquetDecodingException: Can not read value at 0 in block -1 in file after switching to async services.
kazdy opened a new issue, #8259:
URL: https://github.com/apache/hudi/issues/8259
**Describe the problem you faced**
I'm observing failed spark stages "Doing partition and writing data: <table name>" in " SparkUpsertCommitActionExecutor" job group. Effectively it can not update records.
There were no changes to schema or datatypes, I only switched from inline to async services for CoW table that gets updated by streaming job.
Only one job suffers from this issue after switching to async services. I can see that async services are running fine eg cleaning.
I'm using inprocess lock, metadata table is enabled.
**To Reproduce**
Steps to reproduce the behavior:
not sure how to reproduce, other jobs are running fine after enabling to async services
**Expected behavior**
Hudi continues to write updates with no failed stages.
**Environment Description**
* Hudi version : 0.12.1-amzn
* Spark version : 3.3.0-amzn-1
* Hive version : na
* Hadoop version : na
* Storage (HDFS/S3/GCS..) : s3
* Running on Docker? (yes/no) : no, running in emr serverless
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```
Job aborted due to stage failure: Task 0 in stage 90.0 failed 4 times, most recent failure: Lost task 0.3 in stage 90.0 (TID 203) ([executorid] executor 4): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:161)
at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147)
... 31 more
Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:155)
... 32 more
Caused by: org.apache.hudi.exception.HoodieException: operation has failed
at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:248)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:226)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:278)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:135)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file
at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:106)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
... 4 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://pathtoparuqetfile/somefile.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
... 8 more
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:77)
at org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:415)
at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
... 11 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1480945493
@nsivabalan I would really appreciate any support or guidance here, I have two broken tables in prod now and need to come up with a fix or workaround. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] mansipp commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "mansipp (via GitHub)" <gi...@apache.org>.
mansipp commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530093282
@kazdy can you please share more information on it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema, corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1478507792
Another thing I noticed is that clustering created a file that's around 900Mb, is this expected? I'm using defaults all the way when it comes to file sizing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] mansipp commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "mansipp (via GitHub)" <gi...@apache.org>.
mansipp commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1527846753
Hi @kazdy @nsivabalan
In my testing I have observed that it's working for open source release 0.13.0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519
It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur with 0.13
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] ParquetDecodingException: Can not read value at 0 in block -1 in file after switching to async services.
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1478402727
I inspected all files that were source files for clustering and all have the same schema.
The file created as result of clustering has changed schema, column named "year" was moved from the middle of the file schema to the end of the schema.
It looks like during clustering schema was modified and now it's causing issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482386262
https://hudi.apache.org/docs/metadata
This blog advises to use in process lock when metadata table and async services are enabled for single writer scenario. Is the blog incorrect?
Or what you meant is that I should not use async metadata indexing with other async table services in single writer mode and in-process lock provider?
Nevertheless I was able to reproduce the issue without async table services in streaming job. Looks like the issue is in clusterig itself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482220312
> I'm using in-process lock
In-process lock can not work correctly for MDT with async table services.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530087223
@mansipp I was able to replicate this case using Hudi 0.13 as well. Files created by clustering have changed schema. Interestingly there's no issue if eg the order of types (not columns) does not change eg if I have only columns of type String.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
URL: https://github.com/apache/hudi/issues/8259
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716
Hi @nsivabalan I see that Danny assigned you to this ticket.
I was able to replicate this exact case. Previous one was not exactly exposing the issue. I'll update the repo soon.
Here's what I found out and some additional info.
First I'm wondering why in the stacktrace I'm getting:
```
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
```
I don't have any Long/Bigint in the table schema nor in the incoming schema, all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for spark).
The schema of the incoming batch should look like this:
```
col1 String
col2 String
col3 Int
col4 Int
col5 Timestamp
partitionCol1 int
partitionCol2 int
col6 String
col7 timestamp
col8 int
col9 string
col10 string
col11 string
col12 string
col13 decimal
col14 string
col15 string
col16 string
col17 string
col17 string
col18 string
col18 string
col19 int
col20 string
col21 string
col22 string
col23 int
col24 string
col25 string
col26 string
col27 string
col28 string
col29 string
col30 string
col31 int
col31 int
col32 string
col33 string
col34 string
col35 string
col36 string
col37 string
```
Schema of clustered parquet files:
```
col1 String
col2 String
col3 Int
col4 Int
col5 Timestamp
col6 String // (earlier at this place was partitionCol1 int) Tries to read Int but instead needs to read String? idk
partitionCol2 int
col7 timestamp
col8 int
col9 string
col10 string
col11 string
col12 string
col13 decimal
col14 string
col15 string
col16 string
col17 string
col17 string
col18 string
col18 string
col19 int
col20 string
col21 string
col22 string
col23 int
col24 string
col25 string
col26 string
col27 string
col28 string
col29 string
col30 string
col31 int
col31 int
col32 string
col33 string
col34 string
col35 string
col36 string
col37 string
partitionCol1 int
```
Schema in replacecommit conforms to the incoming batch schema/table schema (is correct).
I don't know if Hudi resolves columns by position or by name and if it matters when reading parquet file for merging.
If it was by position then col6 String (earlier at this place was partitionCol1 int) Hudi will try to read this column as of type Int but instead needs to read String?
Therefore it can't since for PlainLongDictionary there's no decodeToBinary implementation available?
Idk if it makes any sense at all, but this is my intuition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1555017700
Tried repro and found that moving partition field to the end is still an issue in 0.13.0/0.12.3/0.12.2 when enabling both row-writing and hivestyle partition. will re-open for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [SUPPORT] Clustering created files with modified schema resulting in corrupted table [hudi]
Posted by "VIKASPATID (via GitHub)" <gi...@apache.org>.
VIKASPATID commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1940894333
Hi, We are also facing similar issue where enabling clustering and then running 2-3 upserts results in column order/value mismatch ending up corrupting table. We are facing this issue on emr-6.15.0, spark 3.4.1, apache hudi 0.14. Please look into it, as we are not able to use clustering at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [SUPPORT] Clustering created files with modified schema resulting in corrupted table [hudi]
Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.
VitoMakarevich commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-2050334663
Can you try with [hoodie.datasource.write.row.writer.enable](https://hudi.apache.org/docs/configurations/#hoodiedatasourcewriterowwriterenable) = false?
It's described here, problem is a bit different, for us it helped to create a file with the right format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org