You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/30 02:38:54 UTC

[GitHub] [hudi] peanut-chenzhong opened a new issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

peanut-chenzhong opened a new issue #3735:
URL: https://github.com/apache/hudi/issues/3735


   For my understanding, if we using OverwriteNonDefaultsWithLatestAvroPayload, Hudi will update column by comlun. If the upsert data has some column which is null, Hudi will ignore these columns and only update the other columns. But the behavior now seems not correct.
   
   Steps to reproduce the behavior:
   
   1.use spark-sql to init test data
   create table test_payload (par1 int,par2 int,key int,col0 string,col1 double,col2 date,col3 timestamp);
   insert into test_payload select 1,20,100,'bb',220.22,'2011-02-10','2011-01-10 01:11:20';
   insert into test_payload select 1,10,100,'cc',null,null,'2011-01-10 01:11:00';
   
   2.insert the first line data to Hudi using OverwriteNonDefaultsWithLatestAvroPayload
   val base_data = sql("select * from test_payload where col0='aa' or col0='bb' ;")
   base_data.write.format("hudi").
   option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
   option("hoodie.datasource.write.precombine.field", "col3").
   option("hoodie.datasource.write.recordkey.field", "key").
   option("hoodie.datasource.write.partitionpath.field", "").
   option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
   option("hoodie.datasource.write.operation", "upsert").
   option("hoodie.datasource.write.payload.class", "org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload").
   option("hoodie.upsert.shuffle.parallelism", 4).
   option("hoodie.datasource.write.hive_style_partitioning", "true").
   option("hoodie.table.name", "tb_test_payload").mode(Overwrite).save(s"/tmp/huditest/tb_test_payload")
   
   3.upsert the second line data to Hudi using OverwriteNonDefaultsWithLatestAvroPayload
   upsert_data.write.format("hudi").
   option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
   option("hoodie.datasource.write.precombine.field", "col3").
   option("hoodie.datasource.write.recordkey.field", "key").
   option("hoodie.datasource.write.partitionpath.field", "").
   option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
   option("hoodie.datasource.write.payload.class", "org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload").
   option("hoodie.datasource.write.operation", "upsert").
   option("hoodie.upsert.shuffle.parallelism", 4).
   option("hoodie.datasource.write.hive_style_partitioning", "true").
   option("hoodie.table.name", "tb_test_payload").mode(Append).save(s"/tmp/huditest/tb_test_payload")
   
   4.query table
   spark.read.format("org.apache.hudi").load("/tmp/huditest/tb_test_payload/*").createOrReplaceTempView("hudi_ro_table")
   spark.sql("select * from hudi_ro_table").show(30,false)
   +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                      |par1|par2|key|col0|col1|col2|col3               |
   +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   |20210930083222     |20210930083222_0_6  |100               |                      |191bf655-bc6c-4944-b7bb-1f00304c033e-0_0-190-316_20210930083222.parquet|1   |10  |100|cc  |null|null|2011-01-10 01:11:00|
   +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   
   You can see the hole row has been update even col1 and col2 is null.
   
   **Expected behavior**
   
   expected behavior is the col1 and col2 shouldn`t been updated.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.9
   
   * Spark version :3.1.1
   
   * Hive version :3.1
   
   * Hadoop version :3.1.1
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no) :no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938266751


   BTW, could help add me to HUDI JIRA group so that I can assign the task to me? @nsivabalan @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
xushiyan closed issue #3735:
URL: https://github.com/apache/hudi/issues/3735


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
codope commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-931133022


   I think this is a bug. What's happening here is that the [overwrite check](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java#L102) in OverwriteWithLatestAvroPayload is simply checking whether the two objects are equal or not. Since the nullable column gets converted to avro schema, the [defaultVal() method](https://avro.apache.org/docs/1.8.2/api/java/org/apache/avro/Schema.Field.html#defaultVal()) returns the corresponding `JsonProperties.Null` but the other object in the check is not the same type. So, the check returns false and hence that field gets overwritten. Instead we should modify that method to something like:
   ```
   public Boolean overwriteField(Object value, Object defaultValue) {
       if (value == null) {
         return defaultValue instanceof JsonProperties.Null;
       }
       return Objects.equals(value, defaultValue);
   }
   ```
   
   @peanut-chenzhong  I have filed [a bug](https://issues.apache.org/jira/browse/HUDI-2509). Please raise a PR if you have the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938263412


   https://github.com/apache/hudi/pull/3761 PR raised


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan edited a comment on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
xushiyan edited a comment on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-946991414


   @peanut-chenzhong ~what is your JIRA id in apache jira? i can add you quickly.~
   
   Ok added your jira account to contributor group https://issues.apache.org/jira/secure/ViewProfile.jspa?name=peanut
   
   Also fixed your PR JIRA link, it should be https://issues.apache.org/jira/browse/HUDI-2509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-930703026


   @n3nash could you kindly help check this is an issue?
   If yes I can rise an PR to solve it soom. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-946991414


   @peanut-chenzhong what is your JIRA id in apache jira? i can add you quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

Posted by GitBox <gi...@apache.org>.
peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938254613


   @codope sure, will raise PR soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org