You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "eric9204 (via GitHub)" <gi...@apache.org> on 2023/02/02 03:08:11 UTC

[GitHub] [hudi] eric9204 opened a new issue, #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

eric9204 opened a new issue, #7822:
URL: https://github.com/apache/hudi/issues/7822

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   enable CDC, cannot perform compaction table service. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.hoodie.table.cdc.enabled=true
      hoodie.table.cdc.supplemental.logging.mode=data_before_after
   2.table type: mor
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :none
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   23/02/02 10:58:21 ERROR HoodieStreamingSink: Micro batch id=1 threw following expections,aborting streaming app to avoid data loss: 
   org.apache.hudi.exception.HoodieCompactionException: Could not compact /tmp/hudi/cdc_test
   	at org.apache.hudi.table.action.compact.RunCompactionActionExecutor.execute(RunCompactionActionExecutor.java:116)
   	at org.apache.hudi.table.HoodieSparkMergeOnReadTable.compact(HoodieSparkMergeOnReadTable.java:140)
   	at org.apache.hudi.client.SparkRDDTableServiceClient.compact(SparkRDDTableServiceClient.java:75)
   	at org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$runAnyPendingCompactions$2(BaseHoodieTableServiceClient.java:191)
   	at java.util.ArrayList.forEach(ArrayList.java:1259)
   	at org.apache.hudi.client.BaseHoodieTableServiceClient.runAnyPendingCompactions(BaseHoodieTableServiceClient.java:189)
   	at org.apache.hudi.client.BaseHoodieTableServiceClient.inlineCompaction(BaseHoodieTableServiceClient.java:160)
   	at org.apache.hudi.client.BaseHoodieTableServiceClient.runTableServicesInline(BaseHoodieTableServiceClient.java:334)
   	at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:540)
   	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:249)
   	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:102)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:903)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:372)
   	at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:122)
   	at scala.util.Try$.apply(Try.scala:213)
   	at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:120)
   	at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:244)
   	at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:119)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:584)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:584)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
   	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
   	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 67) (LAPTOP-DONGSJ executor driver): org.apache.avro.UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1
   	at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
   	at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
   	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
   	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
   	at org.apache.hudi.avro.HoodieAvroUtils.indexedRecordToBytes(HoodieAvroUtils.java:136)
   	at org.apache.hudi.avro.HoodieAvroUtils.avroToBytes(HoodieAvroUtils.java:128)
   	at org.apache.hudi.common.model.HoodieAvroPayload.<init>(HoodieAvroPayload.java:47)
   	at org.apache.hudi.io.HoodieCDCLogger.put(HoodieCDCLogger.java:175)
   	at org.apache.hudi.io.HoodieMergeHandleWithChangeLog.writeInsertRecord(HoodieMergeHandleWithChangeLog.java:106)
   	at org.apache.hudi.io.HoodieMergeHandle.writeIncomingRecords(HoodieMergeHandle.java:397)
   	at org.apache.hudi.io.HoodieMergeHandle.close(HoodieMergeHandle.java:405)
   	at org.apache.hudi.io.HoodieMergeHandleWithChangeLog.close(HoodieMergeHandleWithChangeLog.java:112)
   	at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:168)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdateInternal(HoodieSparkCopyOnWriteTable.java:224)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdate(HoodieSparkCopyOnWriteTable.java:215)
   	at org.apache.hudi.table.action.compact.CompactionExecutionHelper.writeFileAndGetWriteStats(CompactionExecutionHelper.java:64)
   	at org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:239)
   	at org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$9cd4b1be$1(HoodieCompactor.java:137)
   	at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
   	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
   	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
   	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
   	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
   	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
   	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
   	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   
   Driver stacktrace:
   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
   	at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362)
   	at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361)
   	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
   	at org.apache.hudi.data.HoodieJavaRDD.collectAsList(HoodieJavaRDD.java:163)
   	at org.apache.hudi.table.action.compact.RunCompactionActionExecutor.execute(RunCompactionActionExecutor.java:101)
   	... 38 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

Posted by "YannByron (via GitHub)" <gi...@apache.org>.
YannByron commented on issue #7822:
URL: https://github.com/apache/hudi/issues/7822#issuecomment-1416802202

   Some PRs related to `HoodieMergeHandle` have changed the cdc behavior that cdc logger should write the record without any meta fields. I submit a PR https://github.com/apache/hudi/pull/7852 to fix it. I think either of https://github.com/apache/hudi/pull/7852 and https://github.com/apache/hudi/pull/7849 can solve this pr. And both of these should be merged. @xushiyan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on issue #7822:
URL: https://github.com/apache/hudi/issues/7822#issuecomment-1416655609

   > I can't reproduce this in my env so that don't find the root cause. But for the current logic, the cdc query will not touch the cdc log files written by compaction (until this ticket https://issues.apache.org/jira/browse/HUDI-4705 is resolved). So just write cdc when compact mor table for now. @eric9204 can you cherry-pick this #7849 and run again?
   
   @eric9204 let us know if you can verify if the patch fixes it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] YannByron commented on issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

Posted by "YannByron (via GitHub)" <gi...@apache.org>.
YannByron commented on issue #7822:
URL: https://github.com/apache/hudi/issues/7822#issuecomment-1416592493

   I can't reproduce this in my env so that don't find the root cause.
   But for the current logic, the cdc query will not touch the cdc log files written by compaction (until this ticket https://issues.apache.org/jira/browse/HUDI-4705 is resolved). So just write cdc when compact mor table for now.
   @eric9204 can you cherry-pick this and run again? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan closed issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan closed issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1
URL: https://github.com/apache/hudi/issues/7822


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #7822:
URL: https://github.com/apache/hudi/issues/7822#issuecomment-1415268718

   cc @YannByron , can  you take a look for this bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org