You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "parisni (via GitHub)" <gi...@apache.org> on 2023/06/17 19:35:43 UTC

[GitHub] [hudi] parisni opened a new issue, #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs

parisni opened a new issue, #9005:
URL: https://github.com/apache/hudi/issues/9005

   hudi 0.13.x
   spark 3.x.y
   
   When schema evolution occurs with a non primitive type (string, array....) then the new merger fails with NPE. Other types such int, timestamp won't fail.
   
   ```python
   tableName = 'test_hudi_merger'
   basePath = "/tmp/{tableName}".format(tableName=tableName)
   
   df =spark.sql("select '1' as event_id, '2' as ts, '3' as version, 'foo' as event_date")
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "event_id",
       "hoodie.datasource.write.partitionpath.field": "version,event_date",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "insert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.metadata.enable": "false",
       # NEW MERGER
       "hoodie.datasource.write.record.merger.impls": "org.apache.hudi.HoodieSparkRecordMerger",
   }
   (df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
   
   # NOW ADD A STRING COLUMN 
   df =spark.sql("select '2' as event_id, '2' as ts, '3' as version, 'foo' as event_date, 'bar' as add_col")
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "event_id",
       "hoodie.datasource.write.partitionpath.field": "version,event_date",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "insert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.metadata.enable": "false",
       # NEW MERGER
       "hoodie.datasource.write.record.merger.impls": "org.apache.hudi.HoodieSparkRecordMerger",
   }
   (df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
   ```
   
   Fails with:
   ```
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 140.0 failed 1 times, most recent failure: Lost task 0.0 in stage 140.0 (TID 138) (192.168.1.18 executor driver): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
           at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
           at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
           at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
           at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
           at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
           at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
           at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
           at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
           at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
           at org.apache.spark.scheduler.Task.run(Task.scala:131)
           at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
           at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.lang.NullPointerException
           at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:156)
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:372)
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363)
           at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
           ... 29 more
   Caused by: org.apache.hudi.exception.HoodieException: java.lang.NullPointerException
           at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
           at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:154)
           ... 32 more
   Caused by: java.lang.NullPointerException
           at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
           at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown Source)
           at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
           at org.apache.spark.sql.HoodieCatalystExpressionUtils$$anon$1.apply(HoodieCatalystExpressionUtils.scala:105)
           at org.apache.spark.sql.HoodieInternalRowUtils$.$anonfun$genUnsafeRowWriter$1(HoodieInternalRowUtils.scala:136)
           at org.apache.hudi.common.model.HoodieSparkRecord.rewriteRecordWithNewSchema(HoodieSparkRecord.java:211)
           at org.apache.hudi.common.model.HoodieRecord.rewriteRecordWithNewSchema(HoodieRecord.java:369)
           at org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$runMerge$1(HoodieMergeHelper.java:143)
           at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:66)
           ... 33 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1607388794

   Similar issue here - https://github.com/apache/hudi/issues/9042
   Making source schema as nullable true will fix this too.
   
   We can track with this JIRA -  https://issues.apache.org/jira/browse/HUDI-6438


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [SUPPORT]New merger fails with NPE when schema evolution with string occurs [hudi]

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope closed issue #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs
URL: https://github.com/apache/hudi/issues/9005


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1607390352

   @parisni We will update on this if making source fields nullable is right solution for the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni commented on issue #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs

Posted by "parisni (via GitHub)" <gi...@apache.org>.

parisni commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1595861748

   After some investigations, the legacy merger also fails in this particular case (see below log). For both, the reason is the new column added has `nullable=false`. Moreover, when turning `hoodie.avro.schema.validate=true` then it will fail early with error: `Incoming batch schema is not compatible with the table's one`.
   
   Then the NPE is likely not a real issue (but hard to understand for end user): we try to write a null in the new column for previous records in a not null column.
   
   Indeed, when the new column has `nullable=true`, then all goes fine with schema evolution. For example by inserting `df =spark.sql("select '2' as event_id, '2' as ts, '3' as version, 'foo' as event_date, case when 1=1 then 'foo' else null end as add_col")`
   
   What about providing the hability to turn the incomming schema in `nullable=true` automatically in hudi to avoid such errors ? @nsivabalan @yihua 
   
   ```
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key event_id:1 from old file /tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-83-83_20230617230002764.parquet to new file /tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-91-91_20230617230003209.parquet with writerSchema {
     "type" : "record",
     "name" : "test_hudi_merger_record",
     "namespace" : "hoodie.test_hudi_merger",
     "fields" : [ {
       "name" : "_hoodie_commit_time",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_commit_seqno",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_record_key",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_partition_path",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_file_name",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "event_id",
       "type" : "string"
     }, {
       "name" : "ts",
       "type" : "string"
     }, {
       "name" : "version",
       "type" : "string"
     }, {
       "name" : "event_date",
       "type" : "string"
     }, {
       "name" : "add_col",
       "type" : "string"
     } ]
   }
           at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:370)
           at org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:54)
           at org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:44)
           at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
           ... 33 more
   Caused by: java.lang.RuntimeException: Null-value for required field: add_col
           at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
           at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:171)
           at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
           at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
           at org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80)
           at org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvro(HoodieAvroParquetWriter.java:76)
           at org.apache.hudi.io.storage.HoodieAvroFileWriter.write(HoodieAvroFileWriter.java:51)
           at org.apache.hudi.io.storage.HoodieFileWriter.write(HoodieFileWriter.java:43)
           at org.apache.hudi.io.HoodieMergeHandle.writeToFile(HoodieMergeHandle.java:384)
           at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:365)
           ... 36 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9005: [SUPPORT]New merger fails with NPE when schema evolution with string occurs

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1595916171

   > What about providing the hability to turn the incomming schema in nullable=true automatically in hudi to avoid such errors ?
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [SUPPORT]New merger fails with NPE when schema evolution with string occurs [hudi]

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1886362639

   @parisni Closing out this issue as we already have solution implemented in this PR - https://github.com/apache/hudi/pull/9262


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org