You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/25 20:40:25 UTC

[GitHub] [hudi] TarunMootala opened a new issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

TarunMootala opened a new issue #4914:
URL: https://github.com/apache/hudi/issues/4914


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   My use case is to populate default values for the missing fields. Using the property hoodie.datasource.write.reconcile.schema to inject default values for the missing fields but it's failing with error "org.apache.avro.UnresolvedUnionException: Not in union ["null","string"]: 1"
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi COW table and insert some sample data.
   
   ```
   inputDF = spark.createDataFrame(
       [
           ("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z"),
           ("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z")
       ],
       ["id", "name", "creation_date", "last_update_time"]
   )
   
   table_name = "first_hudi_table"
   table_path =  f"s3://<bucket_name>/Hudi/{table_name}"
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   print(table_name, table_path)
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'insert')\
   .options(**hudiOptions)\
   .mode('overwrite')\
   .save(table_path)
   
   ```
   2. Create a spark dataframe with less number of fields when compared to the schema of table. All the mandatory fields like Recordkey, precombine, partitionpath fields should present in the dataframe. 
   3. Enable the property hoodie.datasource.write.reconcile.schema, and upsert the Spark dataframe into Hudi table.
   ```
   inputDF = spark.createDataFrame(
       [
           ("110", '2015-01-01', "2015-01-02T13:51:39.340396Z"),
       ],
       ["id", "creation_date", "last_update_time"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'upsert')\
   .options(**hudiOptions)\
   .mode('append')\
   .save(table_path)
   ```
   
   **Expected behavior**
   Expecting upsert should succeed and default values injected for missing fields. 
   
   **Environment Description**
   Using Jupyter notebook with AWS EMR 6.5.0 and Glue Data Catalog as Hive metastore.
   
   * Hudi version : 0.9
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Stacktrace**
   
   ```
   An error was encountered:
   An error occurred while calling o205.save.
   : org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220225180756
   	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
   	at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:98)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:88)
   	at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157)
   	at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:169)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 12.0 failed 4 times, most recent failure: Lost task 3.3 in stage 12.0 (TID 34) (ip-172-31-0-85.ec2.internal executor 4): java.io.IOException: Could not create payload for class: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   	at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:133)
   	at org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:236)
   	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$7(HoodieSparkSqlWriter.scala:237)
   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
   	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
   	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
   	at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:130)
   	... 16 more
   Caused by: java.lang.reflect.InvocationTargetException
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
   	... 17 more
   Caused by: org.apache.avro.UnresolvedUnionException: Not in union ["null","string"]: 1
   	at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
   	at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
   	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
   	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
   	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
   	at org.apache.hudi.avro.HoodieAvroUtils.indexedRecordToBytes(HoodieAvroUtils.java:102)
   	at org.apache.hudi.avro.HoodieAvroUtils.avroToBytes(HoodieAvroUtils.java:94)
   	at org.apache.hudi.common.model.BaseAvroPayload.<init>(BaseAvroPayload.java:49)
   	at org.apache.hudi.common.model.OverwriteWithLatestAvroPayload.<init>(OverwriteWithLatestAvroPayload.java:42)
   	... 22 more
   
   Driver stacktrace:
   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2306)
   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
   	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
   	at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
   	at org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
   	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lookupIndex(SparkHoodieBloomIndex.java:114)
   	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:84)
   	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:60)
   	at org.apache.hudi.table.action.commit.AbstractWriteHelper.tag(AbstractWriteHelper.java:69)
   	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:51)
   	... 45 more
   
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
       self._jwrite.save(path)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1057503160


   wrt setting auto defaults, do you think [this](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteNonDefaultsWithLatestAvroPayload.java) payload will help you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1068502313


   @[nsivabalan](https://github.com/nsivabalan)
   Thanks for confirming. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] mkk1490 commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

mkk1490 commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055596452


   Hive schema has to be rebuilt for the new fields. The new fields will be added only in the parquet which then have to be added in the metastore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055704640


   I'm using Glue Data Catalog as Hive Metastore, and enabled Hive Sync. I can see Glue Catalog table is updated with the latest schema i.e. new fields are added and can query from Athena, and Redshift spectrum also. The only issue I've seen is with Spark SQL. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #4914:
URL: https://github.com/apache/hudi/issues/4914


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1054853406


   schema has to be backwards compatible and new fields has to be added towards the end. From your example, the field of interest is "name" which is in the middle. 
   
   ```
   
   inputDF = spark.createDataFrame(
       [
           ("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z", "2015-01-01"),
           ("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z", "2015-01-01"),
           ("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z", "2015-01-01"),
           ("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z", "2015-01-01"),
           ("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z", "2015-01-01"),
           ("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01")
       ],
       ["id", "name", "creation_date", "last_update_time", "creation_date_1"]
   )
   
   table_name = "first_hudi_table"
   table_path =  f"s3://<bucket_name>/Hudi/{table_name}"
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   }
   
   print(table_name, table_path)
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'insert')\
   .options(**hudiOptions)\
   .mode('overwrite')\
   .save(table_path)
   
   
   
   inputDF = spark.createDataFrame(
       [
           ("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z"),
           ("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z")
       ],
       ["id", "name", "creation_date", "last_update_time"]
   )
   
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   }
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'insert')\
   .options(**hudiOptions)\
   .mode('append')\
   .save(table_path)
   
   ```
   This works. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala edited a comment on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala edited a comment on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055704640


   I'm using Glue Data Catalog as Hive Metastore, and enabled Hive Sync. I can see Glue Catalog table is updated with the latest schema i.e. new fields are added and can query from Athena, and Redshift spectrum also. The only issue I've seen is with Spark SQL. Is there any specific configuration for this ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055586449


   Thanks for the response. It's working when the fields are added at the end. I've a use case to populate default values for missing fields at any position. Is there any option to do so? 
   
   Also, I've another observation. The new fields that are added at the end are not reflecting when using spark sql. Provided the PySpark code that I've used in notebook. 
   
   ```
   # Added the jar /usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar to spark.jars as recommended by AWS EMR  team to resolve the version conflict in JsonUnmarshallerContext inside hudi-spark-bundle.jar
   
   table_name = "test_hudi_table7"
   table_path =  f"s3://<bucket_name>/Hudi/{table_name}"
   
   inputDF = spark.createDataFrame(
       [
           ("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z"),
           ("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z")
       ],
       ["id", "name", "creation_date", "last_update_time"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'insert')\
   .options(**hudiOptions)\
   .mode('overwrite')\
   .save(table_path)
   
   
   # Added 2 new fields at the end 
   
   inputDF = spark.createDataFrame(
       [
           ("106", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z", "2015-01-01", "2015-01-01"),
           ("107", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z", "2015-01-01", "2015-01-01"),
           ("108", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z", "2015-01-01", "2015-01-01"),
           ("109", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z", "2015-01-01", "2015-01-01"),
           ("110", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z", "2015-01-01", "2015-01-01"),
           ("111", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01")
       ],
       ["id", "name", "creation_date", "last_update_time", "creation_date1", "creation_date2"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   print(table_name, table_path)
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'upsert')\
   .options(**hudiOptions)\
   .mode('append')\
   .save(table_path)
   
   }
   
   spark.read.format('hudi').load(table_path).show() # can see the new fields added 
   spark.sql('select * from <table name>').show() # can't see the new fields 
   spark.table('<table name>').show() # can't see the new fields 
   
   ```
   
   Note: The same example is working in EMR 6.4 (Hudi 0.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1057500304


   @YannByron : Can you help here please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1064191264


   @[YannByron](https://github.com/YannByron)
   Thanks for checking. We are using AWS, unfortunately the latest Hudi version available in both EMR and Glue is 0.9. Can I consider the issues I've mentioned as bugs in Hudi 0.9 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YannByron commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

YannByron commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1063623677


   @TarunMootala can you upgrade hudi to 0.10.1. this can reconcile the schema wherever the new field is put in. Spark-SQL is still having some problems that the new middle field can't be shown.
   But I test in mater branch, all of the problems above have gone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1058226404


   Tested with OverwriteNonDefaultsWithLatestAvroPayload but it's not working as expected. When a field is missing in the middle, it has overwritten values based on position instead of field names. Same case with OverwriteWithLatestAvroPayload also. 
   
   ```
   # Missing the 2nd field 'name' 
   inputDF = spark.createDataFrame(
       [
           ("111", "2022-01-01", "2022-01-01T12:15:00.512679Z", "2022-01-01", "2022-01-01"),
           ("112", "2015-01-01", "2015-01-01T12:15:00.512679Z", "2015-01-01", "2015-01-01"),
           ("113", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01"),
           ("114", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01"),
           ("115", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01"),
           ("116", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01"),
           ("117", "2015-01-01", "2015-01-01T13:51:42.248818Z", "2015-01-01", "2015-01-01")
       ],
       ["id", "creation_date", "last_update_time", "creation_date1", "creation_date2"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor',
   'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload'
   }
   
   print(table_name, table_path)
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'upsert')\
   .options(**hudiOptions)\
   .mode('append')\
   .save(table_path)
   ```
   
   Output:
   ```
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+--------------------+--------------------+--------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|      name|       creation_date|    last_update_time|creation_date1|creation_date2|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+--------------------+--------------------+--------------+--------------+
   |     20220301154618|  20220301154618_0_1|               100|                      |a23a3468-f505-4e1...|100|       AAA|          2015-01-01|2015-01-01T13:51:...|          null|          null|
   |     20220301154618|  20220301154618_0_2|               101|                      |a23a3468-f505-4e1...|101|       BBB|          2015-01-01|2015-01-01T12:14:...|          null|          null|
   |     20220301154618|  20220301154618_0_3|               102|                      |a23a3468-f505-4e1...|102|       CCC|          2015-01-01|2015-01-01T13:51:...|          null|          null|
   |     20220301154618|  20220301154618_0_4|               103|                      |a23a3468-f505-4e1...|103|       DDD|          2015-01-01|2015-01-01T13:51:...|          null|          null|
   |     20220301154618|  20220301154618_0_5|               104|                      |a23a3468-f505-4e1...|104|       EEE|          2015-01-01|2015-01-01T12:15:...|          null|          null|
   |     20220301154618|  20220301154618_0_6|               105|                      |a23a3468-f505-4e1...|105|       FFF|          2015-01-01|2015-01-01T13:51:...|          null|          null|
   |     20220301154640|  20220301154640_0_3|               106|                      |a23a3468-f505-4e1...|106|       AAA|          2015-01-01|2015-01-01T13:51:...|    2015-01-01|    2015-01-01|
   |     20220301154640|  20220301154640_0_4|               107|                      |a23a3468-f505-4e1...|107|       BBB|          2015-01-01|2015-01-01T12:14:...|    2015-01-01|    2015-01-01|
   |     20220301154640|  20220301154640_0_5|               108|                      |a23a3468-f505-4e1...|108|       CCC|          2015-01-01|2015-01-01T13:51:...|    2015-01-01|    2015-01-01|
   |     20220301154640|  20220301154640_0_6|               109|                      |a23a3468-f505-4e1...|109|       DDD|          2015-01-01|2015-01-01T13:51:...|    2015-01-01|    2015-01-01|
   |     20220301154640|  20220301154640_0_1|               110|                      |a23a3468-f505-4e1...|110|       EEE|          2015-01-01|2015-01-01T12:15:...|    2015-01-01|    2015-01-01|
   |     20220303161634|  20220303161634_0_7|               111|                      |a23a3468-f505-4e1...|111|2022-01-01|2022-01-01T12:15:...|          2022-01-01|    2022-01-01|          null|
   |     20220303161634|  20220303161634_0_8|               112|                      |a23a3468-f505-4e1...|112|2015-01-01|2015-01-01T12:15:...|          2015-01-01|    2015-01-01|          null|
   |     20220303161634|  20220303161634_0_9|               113|                      |a23a3468-f505-4e1...|113|2015-01-01|2015-01-01T13:51:...|          2015-01-01|    2015-01-01|          null|
   |     20220303161634| 20220303161634_0_10|               114|                      |a23a3468-f505-4e1...|114|2015-01-01|2015-01-01T13:51:...|          2015-01-01|    2015-01-01|          null|
   |     20220303161634| 20220303161634_0_11|               115|                      |a23a3468-f505-4e1...|115|2015-01-01|2015-01-01T13:51:...|          2015-01-01|    2015-01-01|          null|
   |     20220303161634| 20220303161634_0_12|               116|                      |a23a3468-f505-4e1...|116|2015-01-01|2015-01-01T13:51:...|          2015-01-01|    2015-01-01|          null|
   |     20220303161634| 20220303161634_0_13|               117|                      |a23a3468-f505-4e1...|117|2015-01-01|2015-01-01T13:51:...|          2015-01-01|    2015-01-01|          null|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+--------------------+--------------------+--------------+--------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1067584612


   yes. you can. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org