You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "nbeeee (via GitHub)" <gi...@apache.org> on 2023/02/09 02:27:32 UTC

[GitHub] [hudi] nbeeee opened a new issue, #7902: [SUPPORT].UnresolvedUnionException: Not in union exception occurred when writing data through spark

nbeeee opened a new issue, #7902:
URL: https://github.com/apache/hudi/issues/7902

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   Data upsert to hudi through spark 
   This exception will appear after multiple executions, which seems to trigger the archive operation
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.12.1
   
   * Spark version :3.1.1
   
   * Hive version :2.7.3
   
   * Hadoop version :3.1.2
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   /data/spark3/bin/spark-submit \
   --master yarn \
   --name xxxxx \
   --deploy-mode cluster \
   --driver-memory 4g \
   --num-executors 5 \
   --total-executor-cores 6 \
   --executor-memory 12g \
   --conf spark.memory.fraction=0.85 \
   --conf spark.memory.storageFraction=0.85 \
   --queue default \
   --class Kudu2Hudi     spark_dependence_jar/hudi/hudi.jar 
   
   **Stacktrace**
   
   ```23/02/07 02:00:01 INFO HoodieLogFormatWriter: HoodieLogFile{pathStr='hdfs://cluster/transdb/ods/stock/ods_dts_stock_summary_all_df/.hoodie/archived/.commits_.archive.1_1-0-1', fileLen=0} exists. Appending to existing file
   23/02/07 02:00:01 ERROR HoodieTimelineArchiver: Failed to archive commits, .commit file: 20230201144304071.clean.requested
   org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"map","values":{"type":"array","items":{"type":"string","avro.java.string":"String"}},"avro.java.string":"String"}]: KEEP_LATEST_FILE_VERSIONS
   	at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
   	at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
   	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
   	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
   	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.serializeRecords(HoodieAvroDataBlock.java:119)
   	at org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:131)
   	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:158)
   	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:135)
   	at org.apache.hudi.client.HoodieTimelineArchiver.writeToFile(HoodieTimelineArchiver.java:671)
   	at org.apache.hudi.client.HoodieTimelineArchiver.archive(HoodieTimelineArchiver.java:643)
   	at org.apache.hudi.client.HoodieTimelineArchiver.archiveIfRequired(HoodieTimelineArchiver.java:171)
   	at org.apache.hudi.client.BaseHoodieWriteClient.archive(BaseHoodieWriteClient.java:909)
   	at org.apache.hudi.client.BaseHoodieWriteClient.autoArchiveOnCommit(BaseHoodieWriteClient.java:629)
   	at org.apache.hudi.client.BaseHoodieWriteClient.postCommit(BaseHoodieWriteClient.java:534)
   	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:237)
   	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:125)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:714)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:340)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
   	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
   	at Kudu2Hudi$.main(Kudu2Hudi.scala:120)
   	at Kudu2Hudi.main(Kudu2Hudi.scala)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
   23/02/07 02:00:01 ERROR HoodieTimelineArchiver: Failed to archive commits, .commit file: 20230201144304071.clean.inflight.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jonvex commented on issue #7902: [SUPPORT].UnresolvedUnionException: Not in union exception occurred when writing data through spark

Posted by "jonvex (via GitHub)" <gi...@apache.org>.
jonvex commented on issue #7902:
URL: https://github.com/apache/hudi/issues/7902#issuecomment-1431612529

   If you take a look at the code for [UnresolvedUnionException.java](https://github.com/apache/avro/blob/f23eabb42f315b0db9135b075434b8a88680659c/lang/java/avro/src/main/java/org/apache/avro/UnresolvedUnionException.java), the ending item is 'unresolvedDatum'. In the exception you provided, that appears to be KEEP_LATEST_FILE_VERSIONS.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nbeeee commented on issue #7902: [SUPPORT].UnresolvedUnionException: Not in union exception occurred when writing data through spark

Posted by "nbeeee (via GitHub)" <gi...@apache.org>.
nbeeee commented on issue #7902:
URL: https://github.com/apache/hudi/issues/7902#issuecomment-1429275756

   > 
   sql:
   SELECT
   	trim(compid) company_id
   	,trim(busno) business_id
   	,trim(wareid) ware_id
   	,if(sumqty = '', null, cast(sumqty as decimal(20,4))) as sumqty
   	,autocomputemaxstore
   	,autocomputeminstore
   	,if(maxday = '', null, cast(maxday as decimal(20,4))) as maxday
   	,if(minday = '', null, cast(minday as decimal(20,4))) as minday
   	,if(maxstore = '', null, cast(maxstore as decimal(20,4))) as maxstore
   	,if(minstore = '', null, cast(minstore as decimal(20,4))) as minstore
   	,if(storepurprice = '', null, cast(storepurprice as decimal(20,4))) as storepurprice
   	,if(lastmdistqty = '', null, cast(lastmdistqty as decimal(20,4))) as lastmdistqty
   	,autodistapply
   	,if(lastm2qty = '', null, cast(lastm2qty as decimal(20,4))) as lastm2qty
   	,if(lastm3qty = '', null, cast(lastm3qty as decimal(20,4))) as lastm3qty
   	,if(lastmqty = '', null, cast(lastmqty as decimal(20,4))) as lastmqty
   	,if(lastyqty = '', null, cast(lastyqty as decimal(20,4))) as lastyqty
   	,oosdays
   	,if(sumawaitqty = '', null, cast(sumawaitqty as decimal(20,4))) as sumawaitqty
   	,if(sumpendingqty = '', null, cast(sumpendingqty as decimal(20,4))) as sumpendingqty
   	,if(sumawaitqty_nobatch = '', null, cast(sumawaitqty_nobatch as decimal(20,4))) as sumawaitqty_nobatch
   	,lastsaledate
   	,if(lastapplyqty = '', null, cast(lastapplyqty as decimal(20,4))) as lastapplyqty
   	,lastapplydate
   	,if(lastdistqty = '', null, cast(lastdistqty as decimal(20,4))) as lastdistqty
   	,lastdistdate
   	,if(lowestqty = '', null, cast(lowestqty as decimal(20,4))) as lowestqty
   	,if(onlineextqty = '', null, cast(onlineextqty as decimal(20,4))) as onlineextqty
   	,if(allocqty = '', null, cast(allocqty as decimal(20,4))) as allocqty
   	,if(minstroeqty = '', null, cast(minstroeqty as decimal(20,4))) as minstroeqty
   	,if(enrouteqty = '', null, cast(enrouteqty as decimal(20,4))) as enrouteqty
   	,if(reserveqty = '', null, cast(reserveqty as decimal(20,4))) as reserveqty
   	,if(sumdefectqty = '', null, cast(sumdefectqty as decimal(20,4))) as sumdefectqty
   	,if(shelvesqty = '', null, cast(shelvesqty as decimal(20,4))) as shelvesqty
   	,if(recallqty = '', null, cast(recallqty as decimal(20,4))) as recallqty
   	,if(resaleqty = '', null, cast(resaleqty as decimal(20,4))) as resaleqty
   	,if(sumtestqty = '', null, cast(sumtestqty as decimal(20,4))) as sumtestqty
   	,if(saledayqty = '', null, cast(saledayqty as decimal(20,4))) as saledayqty
   	,lastbreakdate
   	,lastbreakdhdate
   	,utimeforplat
   	,transplatstatus
   	,subitemid
   	,if(dayavgqty = '', null, cast(dayavgqty as decimal(20,4))) as dayavgqty
   	-- ,substr(breakstockdate, 1, 10) breakstockdate
   	,case when sumqty <='0'  and breakstockdate is null then  '2023-02-07'  else  substr(breakstockdate, 1, 10) end  breakstockdate
   	,time_stamp
   	,trim(group_id) group_id
   	FROM
   		t_store_stock
   
   t_store_stock table data:
   |group_id|busno|wareid|compid|data_source|time_stamp       |sumqty|autocomputemaxstore|autocomputeminstore|maxday|minday|maxstore|minstore|storepurprice|lastmdistqty|autodistapply|lastm2qty|lastm3qty|lastmqty|lastyqty|oosdays|sumawaitqty|sumpendingqty|sumawaitqty_nobatch|lastsaledate         |lastapplyqty|lastapplydate|lastdistqty|lastdistdate           |lowestqty|onlineextqty|allocqty|minstroeqty|enrouteqty|reserveqty|sumdefectqty|shelvesqty|recallqty|resaleqty|sumtestqty|saledayqty|lastbreakdate|lastbreakdhdate|utimeforplat|transplatstatus|subitemid|dayavgqty|breakstockdate|
   |--------|-----|------|------|-----------|-----------------|------|-------------------|-------------------|------|------|--------|--------|-------------|------------|-------------|---------|---------|--------|--------|-------|-----------|-------------|-------------------|---------------------|------------|-------------|-----------|-----------------------|---------|------------|--------|-----------|----------|----------|------------|----------|---------|---------|----------|----------|-------------|---------------|------------|---------------|---------|---------|--------------|
   |123456  |1013 |164083|123456|h1         |1,676,360,535,095|0.0   |1                  |1                  |60.0  |      |7.0     |3.0     |14.0         |1.0         |1            |2.0      |7.0      |        |        |       |0.0        |0.0          |0.0                |2023-01-29 00:00:00.0|            |             |1.0        |2023-01-14 09:53:57.457|         |0.0         |0.0     |           |          |          |            |          |         |         |          |          |             |               |            |               |         |         |              |
   
   
   > nothing special about this setup so it looks like a data issue. @nbeeee can you share the schema and some sample data to help reproduce? cc @jonvex
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7902: [SUPPORT].UnresolvedUnionException: Not in union exception occurred when writing data through spark

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on issue #7902:
URL: https://github.com/apache/hudi/issues/7902#issuecomment-1433373264

   > If you take a look at the code for [UnresolvedUnionException.java](https://github.com/apache/avro/blob/f23eabb42f315b0db9135b075434b8a88680659c/lang/java/avro/src/main/java/org/apache/avro/UnresolvedUnionException.java), the ending item is 'unresolvedDatum'. In the exception you provided, that appears to be KEEP_LATEST_FILE_VERSIONS.
   
   Yes we should look into about the commit metadata itself of 20230201144304071. we can exam the content. serializing it into archived commits had schema issue. @nbeeee would be also helpful to see the complete hoodie write configs you set for this job, and `.hoodie/` zip for inspection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7902: [SUPPORT].UnresolvedUnionException: Not in union exception occurred when writing data through spark

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on issue #7902:
URL: https://github.com/apache/hudi/issues/7902#issuecomment-1428631149

   nothing special about this setup so it looks like a data issue. @nbeeee can you share the schema and some sample data to help reproduce? cc @jonvex 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org